Enable GPU: LLM + Whisper on the RTX 5050, pick qwen3:8b
Some checks failed
Release / semantic-release (push) Successful in 19s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 9m54s
Release / build-linux (push) Failing after 7m14s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
Some checks failed
Release / semantic-release (push) Successful in 19s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 9m54s
Release / build-linux (push) Failing after 7m14s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
GPU acceleration is now on by default and verified end-to-end on the Blackwell RTX 5050 (sm_120): - Ollama offloads 100% to GPU (log: library=CUDA compute=12.0, BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI (devices: nvidia.com/gpu=all) to both ollama and javis. - Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12, LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on sm_120; bridge auto-falls back to CPU when no GPU is present. - Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling, ~5GB Q4). Embed stays nomic-embed-text. - README documents the host one-time setup (nvidia-container-toolkit + `nvidia-ctk cdi generate`) and GPU on/off. Verified: image builds; GPU visible in both containers via compose; ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK; bridge /health 200.
This commit is contained in:
19
README.md
19
README.md
@@ -75,7 +75,24 @@ docker compose up -d # 봇이 시작되고 /자비스 명령 등록
|
||||
|
||||
디스코드에서 `/자비스 join` 으로 호출하세요. (`OLLAMA_CHAT_MODEL` 등 모델을 바꾸려면 `.env`에서 지정 후 `docker compose up -d`.)
|
||||
|
||||
- GPU(RTX 5050) 가속: 호스트에 nvidia-container-toolkit 설치 후 `docker-compose.yml`의 GPU 블록 주석 해제, `.env`에서 `WHISPER_DEVICE=cuda` / `WHISPER_COMPUTE_TYPE=float16`.
|
||||
### GPU 가속 (기본 ON)
|
||||
|
||||
LLM(Ollama)과 Whisper STT가 **기본적으로 GPU(RTX 5050, Blackwell sm_120)** 에서 돕니다. 검증 완료: Ollama 100% GPU 오프로드, faster-whisper float16 GPU 동작.
|
||||
|
||||
호스트 사전 준비(1회):
|
||||
|
||||
```bash
|
||||
# nvidia-container-toolkit 설치 후 CDI 스펙 생성 (Docker 29 CDI 방식, 데몬 재시작 불필요)
|
||||
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||
docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L # GPU 보이면 OK
|
||||
```
|
||||
|
||||
`docker-compose.yml`은 두 컨테이너에 `devices: ["nvidia.com/gpu=all"]`(CDI)로 GPU를 넣습니다.
|
||||
|
||||
- 모델: 기본 `qwen3:8b` — 8GB VRAM에서 도구호출(tool calling)이 가장 안정적이고 ~5GB(Q4)로 잘 맞습니다. 더 가볍게/무겁게 쓰려면 `.env`의 `OLLAMA_CHAT_MODEL` 변경.
|
||||
- Whisper는 `WHISPER_DEVICE=cuda`/`float16` 기본. **GPU가 없으면 자동으로 CPU로 폴백**하므로 안전합니다.
|
||||
- GPU가 아예 없는 호스트라면 `docker-compose.yml`의 두 `devices:` 블록을 지우고 `.env`에 `WHISPER_DEVICE=cpu`를 두면 됩니다.
|
||||
|
||||
- 데이터(메모리 DB), Whisper 캐시, Piper 음성은 named volume에 영속됩니다.
|
||||
- 셀프봇 영상 송출 의존성은 이미지에 기본 포함하지 않습니다. 쓰려면 컨테이너에서 `cd /app/bot && bun add discord.js-selfbot-v13 @dank074/discord-video-stream` 후 재시작(또는 Dockerfile에 추가).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user