Enable GPU: LLM + Whisper on the RTX 5050, pick qwen3:8b

GPU acceleration is now on by default and verified end-to-end on the Blackwell RTX 5050 (sm_120): - Ollama offloads 100% to GPU (log: library=CUDA compute=12.0, BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI (devices: nvidia.com/gpu=all) to both ollama and javis. - Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12, LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on sm_120; bridge auto-falls back to CPU when no GPU is present. - Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling, ~5GB Q4). Embed stays nomic-embed-text. - README documents the host one-time setup (nvidia-container-toolkit + `nvidia-ctk cdi generate`) and GPU on/off. Verified: image builds; GPU visible in both containers via compose; ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK; bridge /health 200.
2026-06-09 15:49:21 +09:00
parent 25c77ac794
commit 0dbc0300d7
7 changed files with 61 additions and 22 deletions
--- a/bridge/requirements-bridge.txt
+++ b/bridge/requirements-bridge.txt
@@ -5,12 +5,19 @@

 # --- Brain runtime (imported when the reply engine loads) ---
 python-dotenv==1.0.1
-faster-whisper==1.0.3
+# >=1.1.0 pulls a ctranslate2 with Blackwell (sm_120) CUDA kernels.
+faster-whisper>=1.1.0
 mcp==1.13.1
 numpy<2.0.0
 rapidfuzz==3.6.1
 requests==2.32.3

+# --- CUDA libraries for GPU-accelerated Whisper (RTX 5050 / sm_120) ---
+# ctranslate2 dlopens these at transcribe time; LD_LIBRARY_PATH is set in the
+# Dockerfile to point at them. Verified working on Blackwell sm_120.
+nvidia-cublas-cu12
+nvidia-cudnn-cu12
+
 # --- Bridge HTTP service ---
 flask>=3.0.0

--- a/bridge/server.py
+++ b/bridge/server.py
@@ -90,7 +90,16 @@ def _ensure_brain():
            )
            device = os.environ.get("WHISPER_DEVICE", "auto")
            compute = os.environ.get("WHISPER_COMPUTE_TYPE", "auto")
-            whisper = WhisperModel(cfg.whisper_model, device=device, compute_type=compute)
+            try:
+                whisper = WhisperModel(cfg.whisper_model, device=device, compute_type=compute)
+            except Exception as ge:
+                # GPU not available / unsupported -> fall back to CPU so the
+                # bridge still works without a GPU passed to the container.
+                if device != "cpu":
+                    print(f"[bridge] whisper device='{device}' failed ({ge}); falling back to CPU", flush=True)
+                    whisper = WhisperModel(cfg.whisper_model, device="cpu", compute_type="int8")
+                else:
+                    raise

            _cfg, _db, _dialogue_memory, _whisper = cfg, db, dialogue_memory, whisper
            print(f"[bridge] brain ready (chat={cfg.ollama_chat_model}, whisper={cfg.whisper_model})", flush=True)