Enable GPU: LLM + Whisper on the RTX 5050, pick qwen3:8b
Some checks failed
Release / semantic-release (push) Successful in 19s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 9m54s
Release / build-linux (push) Failing after 7m14s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
Some checks failed
Release / semantic-release (push) Successful in 19s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 9m54s
Release / build-linux (push) Failing after 7m14s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
GPU acceleration is now on by default and verified end-to-end on the Blackwell RTX 5050 (sm_120): - Ollama offloads 100% to GPU (log: library=CUDA compute=12.0, BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI (devices: nvidia.com/gpu=all) to both ollama and javis. - Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12, LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on sm_120; bridge auto-falls back to CPU when no GPU is present. - Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling, ~5GB Q4). Embed stays nomic-embed-text. - README documents the host one-time setup (nvidia-container-toolkit + `nvidia-ctk cdi generate`) and GPU on/off. Verified: image builds; GPU visible in both containers via compose; ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK; bridge /health 200.
This commit is contained in:
10
.env.example
10
.env.example
@@ -20,9 +20,10 @@ BRIDGE_HOST=127.0.0.1
|
||||
BRIDGE_PORT=8765
|
||||
JARVIS_BRAIN_ENABLED=1
|
||||
JARVIS_TTS_ENABLED=1
|
||||
# faster-whisper device/compute. On this RTX 5050 box: cuda / float16.
|
||||
WHISPER_DEVICE=auto
|
||||
WHISPER_COMPUTE_TYPE=auto
|
||||
# faster-whisper device/compute. GPU by default (RTX 5050 / sm_120, verified).
|
||||
# Falls back to CPU automatically if no GPU is passed to the container.
|
||||
WHISPER_DEVICE=cuda
|
||||
WHISPER_COMPUTE_TYPE=float16
|
||||
# Optional explicit Piper voice model (.onnx). If empty, the jarvis default is used.
|
||||
TTS_PIPER_MODEL_PATH=
|
||||
|
||||
@@ -32,7 +33,8 @@ TTS_PIPER_MODEL_PATH=
|
||||
# ---------------------------------------------------------------------------
|
||||
# In docker-compose this is overridden to http://ollama:11434 automatically.
|
||||
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||
OLLAMA_CHAT_MODEL=llama3.1:8b
|
||||
# qwen3:8b — best 8GB-VRAM pick: strongest tool-calling, ~5GB Q4, fits the RTX 5050.
|
||||
OLLAMA_CHAT_MODEL=qwen3:8b
|
||||
OLLAMA_EMBED_MODEL=nomic-embed-text
|
||||
WHISPER_MODEL=small
|
||||
|
||||
|
||||
@@ -8,7 +8,10 @@ FROM ubuntu:24.04
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
LANG=C.UTF-8 \
|
||||
DISPLAY=:1 \
|
||||
PATH=/opt/venv/bin:/root/.bun/bin:/usr/local/bin:/usr/bin:/bin
|
||||
PATH=/opt/venv/bin:/root/.bun/bin:/usr/local/bin:/usr/bin:/bin \
|
||||
NVIDIA_VISIBLE_DEVICES=all \
|
||||
NVIDIA_DRIVER_CAPABILITIES=compute,utility \
|
||||
LD_LIBRARY_PATH=/opt/venv/lib/python3.12/site-packages/nvidia/cublas/lib:/opt/venv/lib/python3.12/site-packages/nvidia/cudnn/lib
|
||||
|
||||
# --- System packages: desktop, VNC, Chrome deps, ffmpeg, python, ocr ---
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
|
||||
19
README.md
19
README.md
@@ -75,7 +75,24 @@ docker compose up -d # 봇이 시작되고 /자비스 명령 등록
|
||||
|
||||
디스코드에서 `/자비스 join` 으로 호출하세요. (`OLLAMA_CHAT_MODEL` 등 모델을 바꾸려면 `.env`에서 지정 후 `docker compose up -d`.)
|
||||
|
||||
- GPU(RTX 5050) 가속: 호스트에 nvidia-container-toolkit 설치 후 `docker-compose.yml`의 GPU 블록 주석 해제, `.env`에서 `WHISPER_DEVICE=cuda` / `WHISPER_COMPUTE_TYPE=float16`.
|
||||
### GPU 가속 (기본 ON)
|
||||
|
||||
LLM(Ollama)과 Whisper STT가 **기본적으로 GPU(RTX 5050, Blackwell sm_120)** 에서 돕니다. 검증 완료: Ollama 100% GPU 오프로드, faster-whisper float16 GPU 동작.
|
||||
|
||||
호스트 사전 준비(1회):
|
||||
|
||||
```bash
|
||||
# nvidia-container-toolkit 설치 후 CDI 스펙 생성 (Docker 29 CDI 방식, 데몬 재시작 불필요)
|
||||
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||
docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L # GPU 보이면 OK
|
||||
```
|
||||
|
||||
`docker-compose.yml`은 두 컨테이너에 `devices: ["nvidia.com/gpu=all"]`(CDI)로 GPU를 넣습니다.
|
||||
|
||||
- 모델: 기본 `qwen3:8b` — 8GB VRAM에서 도구호출(tool calling)이 가장 안정적이고 ~5GB(Q4)로 잘 맞습니다. 더 가볍게/무겁게 쓰려면 `.env`의 `OLLAMA_CHAT_MODEL` 변경.
|
||||
- Whisper는 `WHISPER_DEVICE=cuda`/`float16` 기본. **GPU가 없으면 자동으로 CPU로 폴백**하므로 안전합니다.
|
||||
- GPU가 아예 없는 호스트라면 `docker-compose.yml`의 두 `devices:` 블록을 지우고 `.env`에 `WHISPER_DEVICE=cpu`를 두면 됩니다.
|
||||
|
||||
- 데이터(메모리 DB), Whisper 캐시, Piper 음성은 named volume에 영속됩니다.
|
||||
- 셀프봇 영상 송출 의존성은 이미지에 기본 포함하지 않습니다. 쓰려면 컨테이너에서 `cd /app/bot && bun add discord.js-selfbot-v13 @dank074/discord-video-stream` 후 재시작(또는 Dockerfile에 추가).
|
||||
|
||||
|
||||
@@ -5,12 +5,19 @@
|
||||
|
||||
# --- Brain runtime (imported when the reply engine loads) ---
|
||||
python-dotenv==1.0.1
|
||||
faster-whisper==1.0.3
|
||||
# >=1.1.0 pulls a ctranslate2 with Blackwell (sm_120) CUDA kernels.
|
||||
faster-whisper>=1.1.0
|
||||
mcp==1.13.1
|
||||
numpy<2.0.0
|
||||
rapidfuzz==3.6.1
|
||||
requests==2.32.3
|
||||
|
||||
# --- CUDA libraries for GPU-accelerated Whisper (RTX 5050 / sm_120) ---
|
||||
# ctranslate2 dlopens these at transcribe time; LD_LIBRARY_PATH is set in the
|
||||
# Dockerfile to point at them. Verified working on Blackwell sm_120.
|
||||
nvidia-cublas-cu12
|
||||
nvidia-cudnn-cu12
|
||||
|
||||
# --- Bridge HTTP service ---
|
||||
flask>=3.0.0
|
||||
|
||||
|
||||
@@ -90,7 +90,16 @@ def _ensure_brain():
|
||||
)
|
||||
device = os.environ.get("WHISPER_DEVICE", "auto")
|
||||
compute = os.environ.get("WHISPER_COMPUTE_TYPE", "auto")
|
||||
try:
|
||||
whisper = WhisperModel(cfg.whisper_model, device=device, compute_type=compute)
|
||||
except Exception as ge:
|
||||
# GPU not available / unsupported -> fall back to CPU so the
|
||||
# bridge still works without a GPU passed to the container.
|
||||
if device != "cpu":
|
||||
print(f"[bridge] whisper device='{device}' failed ({ge}); falling back to CPU", flush=True)
|
||||
whisper = WhisperModel(cfg.whisper_model, device="cpu", compute_type="int8")
|
||||
else:
|
||||
raise
|
||||
|
||||
_cfg, _db, _dialogue_memory, _whisper = cfg, db, dialogue_memory, whisper
|
||||
print(f"[bridge] brain ready (chat={cfg.ollama_chat_model}, whisper={cfg.whisper_model})", flush=True)
|
||||
|
||||
@@ -19,14 +19,10 @@ services:
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- ollama_models:/root/.ollama
|
||||
# --- GPU (optional): needs nvidia-container-toolkit on the host ---
|
||||
# deploy:
|
||||
# resources:
|
||||
# reservations:
|
||||
# devices:
|
||||
# - driver: nvidia
|
||||
# count: all
|
||||
# capabilities: [gpu]
|
||||
# GPU: needs nvidia-container-toolkit on the host (CDI). Verified on the
|
||||
# RTX 5050 (Blackwell sm_120) — Ollama offloads 100% to GPU.
|
||||
devices:
|
||||
- "nvidia.com/gpu=all"
|
||||
|
||||
# Auto-pull the models the brain needs, then exit. Idempotent (re-runnable).
|
||||
ollama-init:
|
||||
@@ -36,7 +32,7 @@ services:
|
||||
restart: "no"
|
||||
environment:
|
||||
OLLAMA_HOST: http://ollama:11434
|
||||
CHAT_MODEL: ${OLLAMA_CHAT_MODEL:-llama3.1:8b}
|
||||
CHAT_MODEL: ${OLLAMA_CHAT_MODEL:-qwen3:8b}
|
||||
EMBED_MODEL: ${OLLAMA_EMBED_MODEL:-nomic-embed-text}
|
||||
entrypoint: ["/bin/sh", "-c"]
|
||||
command:
|
||||
@@ -58,12 +54,18 @@ services:
|
||||
environment:
|
||||
# Point the brain at the ollama service and the bot at the in-container bridge.
|
||||
OLLAMA_BASE_URL: http://ollama:11434
|
||||
OLLAMA_CHAT_MODEL: ${OLLAMA_CHAT_MODEL:-llama3.1:8b}
|
||||
OLLAMA_CHAT_MODEL: ${OLLAMA_CHAT_MODEL:-qwen3:8b}
|
||||
OLLAMA_EMBED_MODEL: ${OLLAMA_EMBED_MODEL:-nomic-embed-text}
|
||||
WHISPER_MODEL: ${WHISPER_MODEL:-small}
|
||||
WHISPER_DEVICE: ${WHISPER_DEVICE:-cuda}
|
||||
WHISPER_COMPUTE_TYPE: ${WHISPER_COMPUTE_TYPE:-float16}
|
||||
BRIDGE_URL: http://127.0.0.1:8765
|
||||
depends_on:
|
||||
- ollama
|
||||
# GPU: accelerates Whisper STT (and anything else CUDA) in this container.
|
||||
# Verified: faster-whisper float16 works on the RTX 5050 (sm_120).
|
||||
devices:
|
||||
- "nvidia.com/gpu=all"
|
||||
shm_size: "1gb" # Chrome needs a larger /dev/shm
|
||||
ports:
|
||||
# Host ports are overridable. If the HOST already runs VNC on 5901
|
||||
@@ -75,7 +77,6 @@ services:
|
||||
- javis_data:/data # jarvis db + memory
|
||||
- whisper_cache:/root/.cache/huggingface # cached Whisper models
|
||||
- piper_voices:/opt/piper-voices # TTS voices
|
||||
# --- GPU (optional): mirror the ollama GPU block above to accelerate Whisper ---
|
||||
|
||||
volumes:
|
||||
ollama_models:
|
||||
|
||||
@@ -8,11 +8,11 @@ set -euo pipefail
|
||||
: "${VNC_PASSWORD:=javis123}"
|
||||
: "${VNC_RESOLUTION:=1920x1080}"
|
||||
: "${OLLAMA_BASE_URL:=http://ollama:11434}"
|
||||
: "${OLLAMA_CHAT_MODEL:=llama3.1:8b}"
|
||||
: "${OLLAMA_CHAT_MODEL:=qwen3:8b}"
|
||||
: "${OLLAMA_EMBED_MODEL:=nomic-embed-text}"
|
||||
: "${WHISPER_MODEL:=small}"
|
||||
: "${WHISPER_DEVICE:=cpu}"
|
||||
: "${WHISPER_COMPUTE_TYPE:=int8}"
|
||||
: "${WHISPER_DEVICE:=cuda}"
|
||||
: "${WHISPER_COMPUTE_TYPE:=float16}"
|
||||
: "${JARVIS_DB_PATH:=/data/jarvis.db}"
|
||||
: "${BRIDGE_HOST:=0.0.0.0}"
|
||||
: "${BRIDGE_PORT:=8765}"
|
||||
|
||||
Reference in New Issue
Block a user