Enable GPU: LLM + Whisper on the RTX 5050, pick qwen3:8b

GPU acceleration is now on by default and verified end-to-end on the Blackwell RTX 5050 (sm_120): - Ollama offloads 100% to GPU (log: library=CUDA compute=12.0, BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI (devices: nvidia.com/gpu=all) to both ollama and javis. - Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12, LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on sm_120; bridge auto-falls back to CPU when no GPU is present. - Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling, ~5GB Q4). Embed stays nomic-embed-text. - README documents the host one-time setup (nvidia-container-toolkit + `nvidia-ctk cdi generate`) and GPU on/off. Verified: image builds; GPU visible in both containers via compose; ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK; bridge /health 200.
2026-06-09 15:49:21 +09:00
parent 25c77ac794
commit 0dbc0300d7
7 changed files with 61 additions and 22 deletions
--- a/.env.example
+++ b/.env.example
@@ -20,9 +20,10 @@ BRIDGE_HOST=127.0.0.1
 BRIDGE_PORT=8765
 JARVIS_BRAIN_ENABLED=1
 JARVIS_TTS_ENABLED=1
-# faster-whisper device/compute. On this RTX 5050 box: cuda / float16.
-WHISPER_DEVICE=auto
-WHISPER_COMPUTE_TYPE=auto
+# faster-whisper device/compute. GPU by default (RTX 5050 / sm_120, verified).
+# Falls back to CPU automatically if no GPU is passed to the container.
+WHISPER_DEVICE=cuda
+WHISPER_COMPUTE_TYPE=float16
 # Optional explicit Piper voice model (.onnx). If empty, the jarvis default is used.
 TTS_PIPER_MODEL_PATH=

@@ -32,7 +33,8 @@ TTS_PIPER_MODEL_PATH=
 # ---------------------------------------------------------------------------
 # In docker-compose this is overridden to http://ollama:11434 automatically.
 OLLAMA_BASE_URL=http://127.0.0.1:11434
-OLLAMA_CHAT_MODEL=llama3.1:8b
+# qwen3:8b — best 8GB-VRAM pick: strongest tool-calling, ~5GB Q4, fits the RTX 5050.
+OLLAMA_CHAT_MODEL=qwen3:8b
 OLLAMA_EMBED_MODEL=nomic-embed-text
 WHISPER_MODEL=small