diff --git a/.env.example b/.env.example index 92ee550..7579972 100644 --- a/.env.example +++ b/.env.example @@ -2,15 +2,23 @@ DISCORD_BOT_TOKEN= DISCORD_APPLICATION_ID= DISCORD_COMMAND_GUILD_ID= -ELEVENLABS_API_KEY= -ELEVENLABS_VOICE_ID= -ELEVENLABS_STT_MODEL=scribe_v2_realtime -ELEVENLABS_TTS_MODEL=eleven_flash_v2_5 OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_MODEL=qwen3:0.6b OLLAMA_KEEP_ALIVE=5m OLLAMA_NUM_CTX=4096 +LOCAL_AI_VENV_PATH=.local-ai/.venv +LOCAL_AI_CACHE_DIR=.local-ai/cache +LOCAL_AI_PYTHON= +LOCAL_STT_MODEL=tiny +LOCAL_STT_DEVICE=auto +LOCAL_STT_COMPUTE_TYPE=auto +LOCAL_STT_BEAM_SIZE=1 +LOCAL_TTS_LANGUAGE=KR +LOCAL_TTS_SPEAKER=KR +LOCAL_TTS_DEVICE=auto +LOCAL_TTS_SPEED=1.12 + BOT_DEFAULT_LANGUAGE=ko MAX_CONVERSATION_TURNS=12 LOCAL_AUDIO_SOURCE= diff --git a/.gitignore b/.gitignore index 9c97bbd..8d7204a 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ node_modules dist .env +.local-ai diff --git a/README.md b/README.md index 4a6dbad..4879f0f 100644 --- a/README.md +++ b/README.md @@ -1,127 +1,138 @@ # realtime_voice_bot -디스코드 음성 채널 또는 로컬 PC 마이크/스피커에서 한국어 음성을 인식하고, 로컬 LLM 응답을 생성한 뒤 ElevenLabs TTS로 다시 읽어주는 최소 프로토타입입니다. +디스코드 음성 채널 또는 로컬 PC 마이크에서 한국어 음성을 인식하고, 완전 로컬 스택으로 답변을 생성한 뒤 다시 음성으로 읽어주는 최소 프로토타입입니다. + +## 현재 스택 + +- STT: `faster-whisper` + Whisper multilingual +- LLM: `Ollama` + `qwen3:0.6b` +- TTS: `MeloTTS` Korean +- VAD: `avr-vad` + +외부 유료 API나 무료 한도형 API는 쓰지 않습니다. ## 현재 구현 범위 - Discord slash command 기반 제어: `/join`, `/leave`, `/status`, `/reset`, `/say` -- 로컬 테스트 모드: `pw-record` 입력, `pw-play` 출력 +- 로컬 테스트 모드: PC 마이크로 직접 말하고 바로 응답 확인 - `@discordjs/voice` 기반 음성 채널 입장 및 유저별 오디오 수신 - 48k stereo PCM을 16k mono로 내려서 유저별 VAD 처리 -- Silero 계열 VAD(`avr-vad`)로 발화 시작/종료 감지 -- ElevenLabs Scribe Realtime WebSocket으로 발화 단위 STT -- Ollama 로컬 LLM으로 짧은 한국어 답변 생성 -- ElevenLabs Flash v2.5 스트리밍 TTS -- 채널 단위 단일 재생 큐 -- 사용자 발화 시작 시 현재 TTS와 대기열 중단(barge-in) +- 화자 발화 시작 시 현재 재생과 대기열 즉시 중단 +- Python 로컬 워커를 한 번 띄워 STT/TTS 모델을 메모리에 유지 -## 권장 환경 +## 필수 준비물 - Bun `1.3+` - Node.js `22.12+` +- Python `3.11+` +- `ffmpeg` - Ollama -- Discord bot with Voice permissions -- ElevenLabs API key + 사용할 Voice ID + +Discord 모드까지 쓸 거면 추가로: + +- Discord bot token +- Discord application id + +## 빠른 시작 + +```bash +bun install +ollama pull qwen3:0.6b +bun run setup:local-ai +``` + +그다음 로컬 장치 확인: + +```bash +bun run devices +``` + +실행: + +```bash +bun run start:local +``` + +Discord 모드: + +```bash +bun run start:discord +``` ## 환경 변수 -`.env.example`를 참고해서 `.env`를 채우면 됩니다. - -필수: - -- `ELEVENLABS_API_KEY` -- `ELEVENLABS_VOICE_ID` +`.env.example`를 복사해서 `.env`를 채우면 됩니다. Discord 모드에서만 필수: - `DISCORD_BOT_TOKEN` - `DISCORD_APPLICATION_ID` +기본값이 이미 들어있는 로컬 AI 설정: + +- `OLLAMA_BASE_URL` +- `OLLAMA_MODEL` +- `OLLAMA_KEEP_ALIVE` +- `OLLAMA_NUM_CTX` +- `LOCAL_AI_VENV_PATH` +- `LOCAL_AI_CACHE_DIR` +- `LOCAL_STT_MODEL` +- `LOCAL_STT_DEVICE` +- `LOCAL_STT_COMPUTE_TYPE` +- `LOCAL_STT_BEAM_SIZE` +- `LOCAL_TTS_LANGUAGE` +- `LOCAL_TTS_SPEAKER` +- `LOCAL_TTS_DEVICE` +- `LOCAL_TTS_SPEED` + 선택: - `DISCORD_COMMAND_GUILD_ID` - 테스트 서버에만 slash command를 즉시 반영하려면 설정 -- `OLLAMA_BASE_URL` - - 기본값: `http://localhost:11434` -- `OLLAMA_MODEL` - - 기본값: `qwen3:0.6b` - - 가장 빠른 무료 오픈웨이트 로컬 기본값 -- `OLLAMA_KEEP_ALIVE` - - 기본값: `5m` -- `OLLAMA_NUM_CTX` - - 기본값: `4096` +- `LOCAL_AI_PYTHON` + - Python 경로 자동 탐지가 안 되면 설정 + - 예시: `python` + - Windows 예시: `py -3` - `LOCAL_AUDIO_SOURCE` - - `pw-record --target` 에 넣을 PipeWire source id 또는 node name + - 로컬 입력 장치 + - Linux는 `pw-record --target`, Windows는 `ffmpeg dshow` 장치 이름 - `LOCAL_AUDIO_SINK` - - `pw-play --target` 에 넣을 PipeWire sink id 또는 node name + - Linux 로컬 출력 장치 + - Windows는 현재 시스템 기본 출력 장치 사용 - `LOCAL_SPEAKER_NAME` - 로컬 테스트에서 프롬프트에 넣을 화자 이름 -- `ELEVENLABS_STT_MODEL` - - 기본값: `scribe_v2_realtime` -- `ELEVENLABS_TTS_MODEL` - - 기본값: `eleven_flash_v2_5` +- `BOT_DEFAULT_LANGUAGE` + - 기본값 `ko` - `DEBUG_TEXT_EVENTS` - - `true`면 명령을 실행한 텍스트 채널에 transcript/reply를 같이 올림 + - `true`면 transcript/reply를 콘솔에 같이 출력 -## 실행 +## 속도 우선 기본값 -```bash -bun install +- STT 기본 모델은 `tiny` +- LLM 기본 모델은 `qwen3:0.6b` +- TTS 기본 속도는 `1.12` + +정확도가 아쉬우면: + +```env +LOCAL_STT_MODEL=small +OLLAMA_MODEL=qwen3:1.7b ``` -Ollama 준비: +## 로컬 테스트 순서 -```bash -ollama pull qwen3:0.6b -``` - -속도보다 품질이 더 중요하면: - -```bash -ollama pull qwen3:1.7b -# 또는 -ollama pull qwen3:4b -``` - -디스코드 모드: - -```bash -bun run start:discord -``` - -로컬 장치 목록: - -```bash -bun run audio:devices -``` - -로컬 테스트 모드: - -```bash -bun run start:local -``` - -타입 체크: - -```bash -bun run check -``` - -## 사용 흐름 - -1. 봇을 서버에 초대하고 음성 권한을 부여합니다. -2. 음성 채널에 들어갑니다. -3. 텍스트 채널에서 `/join` 실행 -4. 말을 하면 봇이 발화 단위로 인식하고 음성으로 짧게 답합니다. -5. 다시 말하면 현재 읽고 있던 TTS는 즉시 중단됩니다. - -로컬 테스트: - -1. `bun run audio:devices` 로 source/sink id 또는 이름 확인 +1. `bun install` 2. `ollama pull qwen3:0.6b` -3. 필요하면 `.env` 에 `LOCAL_AUDIO_SOURCE`, `LOCAL_AUDIO_SINK`, `OLLAMA_MODEL` 설정 -3. `bun run start:local` -4. 마이크로 바로 말해서 응답 확인 +3. `bun run setup:local-ai` +4. `bun run devices` +5. 필요하면 `.env` 에 `LOCAL_AUDIO_SOURCE` 설정 +6. `bun run start:local` + +## Windows 메모 + +- `bun run devices` 와 Windows 로컬 녹음은 `ffmpeg`가 필요합니다. +- 출력 장치 직접 선택은 아직 미구현이라 시스템 기본 출력 장치로 재생됩니다. +- Python 탐지가 안 되면 `.env` 에 `LOCAL_AI_PYTHON=py -3` 또는 `LOCAL_AI_PYTHON=python` 을 넣으면 됩니다. ## 설계 메모 @@ -129,5 +140,4 @@ bun run check - 출력은 길드 세션당 단일 큐 - 로컬 모드는 단일 화자 입력 기준 - 화자 구분은 `speaker_id`, `speaker_name`을 LLM 프롬프트에 항상 포함 -- 현재 기본 LLM은 `qwen3:0.6b` 이며 속도 우선 설정이라 답변 품질이 약하면 `qwen3:1.7b` 또는 `qwen3:4b` 로 올리는 것을 권장합니다. -- STT/TTS는 아직 ElevenLabs API를 사용하므로 프로젝트 전체가 완전 무과금은 아닙니다. +- 모델 다운로드 캐시는 기본적으로 `.local-ai/cache` 아래에 저장 diff --git a/bun.lock b/bun.lock index aa5c27c..700a171 100644 --- a/bun.lock +++ b/bun.lock @@ -12,7 +12,6 @@ "ffmpeg-static": "^5.3.0", "opusscript": "^0.1.1", "prism-media": "^1.3.5", - "ws": "^8.20.0", "zod": "^4.3.6", }, "devDependencies": { @@ -22,6 +21,7 @@ }, }, "trustedDependencies": [ + "ffmpeg-static", "onnxruntime-node", ], "packages": { diff --git a/package.json b/package.json index c59611c..64cc66f 100644 --- a/package.json +++ b/package.json @@ -8,6 +8,7 @@ "start": "bun src/index.ts discord", "start:discord": "bun src/index.ts discord", "start:local": "bun src/index.ts local", + "setup:local-ai": "bun src/setup-local-ai.ts", "devices": "bun src/index.ts local-devices", "audio:devices": "bun src/index.ts local-devices", "check": "tsc --noEmit", @@ -25,7 +26,6 @@ "ffmpeg-static": "^5.3.0", "opusscript": "^0.1.1", "prism-media": "^1.3.5", - "ws": "^8.20.0", "zod": "^4.3.6" }, "devDependencies": { diff --git a/python/__pycache__/local_stt_worker.cpython-314.pyc b/python/__pycache__/local_stt_worker.cpython-314.pyc new file mode 100644 index 0000000..4aeda52 Binary files /dev/null and b/python/__pycache__/local_stt_worker.cpython-314.pyc differ diff --git a/python/__pycache__/local_tts_worker.cpython-314.pyc b/python/__pycache__/local_tts_worker.cpython-314.pyc new file mode 100644 index 0000000..e12fcdb Binary files /dev/null and b/python/__pycache__/local_tts_worker.cpython-314.pyc differ diff --git a/python/local_stt_worker.py b/python/local_stt_worker.py new file mode 100644 index 0000000..a94783f --- /dev/null +++ b/python/local_stt_worker.py @@ -0,0 +1,145 @@ +import base64 +import json +import os +import sys +import tempfile +import traceback +import wave + + +os.environ.setdefault("PYTHONIOENCODING", "utf-8") + + +def log(message: str) -> None: + print(message, file=sys.stderr, flush=True) + + +def write_response(request_id: int, ok: bool, result=None, error: str | None = None) -> None: + payload = { + "id": request_id, + "ok": ok, + } + if ok: + payload["result"] = result + else: + payload["error"] = error or "unknown error" + + sys.stdout.write(json.dumps(payload, ensure_ascii=False) + "\n") + sys.stdout.flush() + + +def resolve_device() -> str: + raw = os.environ.get("LOCAL_STT_DEVICE", "auto").strip().lower() + if raw and raw != "auto": + return raw + + try: + import ctranslate2 + + if ctranslate2.get_cuda_device_count() > 0: + return "cuda" + except Exception: + pass + + return "cpu" + + +def resolve_compute_type(device: str) -> str: + raw = os.environ.get("LOCAL_STT_COMPUTE_TYPE", "auto").strip().lower() + if raw and raw != "auto": + return raw + if device == "cuda": + return "int8_float16" + return "int8" + + +class SttWorker: + def __init__(self) -> None: + from faster_whisper import WhisperModel + + self.model_name = os.environ.get("LOCAL_STT_MODEL", "tiny").strip() or "tiny" + self.device = resolve_device() + self.compute_type = resolve_compute_type(self.device) + self.beam_size = int(os.environ.get("LOCAL_STT_BEAM_SIZE", "1")) + self.model = WhisperModel( + self.model_name, + device=self.device, + compute_type=self.compute_type, + ) + log( + f"local-stt ready model={self.model_name} device={self.device} compute={self.compute_type} beam={self.beam_size}" + ) + + def transcribe(self, audio_base64: str, language: str | None) -> str: + pcm_bytes = base64.b64decode(audio_base64) + temp_path = "" + + try: + with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as handle: + temp_path = handle.name + + with wave.open(temp_path, "wb") as wav_file: + wav_file.setnchannels(1) + wav_file.setsampwidth(2) + wav_file.setframerate(16000) + wav_file.writeframes(pcm_bytes) + + segments, _info = self.model.transcribe( + temp_path, + language=language, + beam_size=self.beam_size, + best_of=1, + condition_on_previous_text=False, + vad_filter=False, + without_timestamps=True, + temperature=0.0, + ) + return " ".join(segment.text.strip() for segment in segments if segment.text.strip()).strip() + finally: + if temp_path: + try: + os.unlink(temp_path) + except OSError: + pass + + +def main() -> int: + try: + worker = SttWorker() + except Exception as exc: + log("failed to initialize local STT worker") + log("run `bun run setup:local-ai` first if dependencies are missing") + log("".join(traceback.format_exception(exc))) + return 1 + + for line in sys.stdin: + line = line.strip() + if not line: + continue + + try: + request = json.loads(line) + request_id = int(request["id"]) + method = request["method"] + params = request.get("params", {}) + + if method == "ping": + write_response(request_id, True, {"ready": True}) + continue + if method != "transcribe": + raise ValueError(f"unsupported method: {method}") + + text = worker.transcribe( + audio_base64=str(params.get("audio_base64", "")), + language=str(params.get("language") or "").strip() or None, + ) + write_response(request_id, True, {"text": text}) + except Exception as exc: + error_text = "".join(traceback.format_exception_only(type(exc), exc)).strip() + write_response(request_id, False, error=error_text) + + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/python/local_tts_worker.py b/python/local_tts_worker.py new file mode 100644 index 0000000..10a25e9 --- /dev/null +++ b/python/local_tts_worker.py @@ -0,0 +1,125 @@ +import base64 +import json +import os +import sys +import tempfile +import traceback + + +os.environ.setdefault("PYTHONIOENCODING", "utf-8") + + +def log(message: str) -> None: + print(message, file=sys.stderr, flush=True) + + +def write_response(request_id: int, ok: bool, result=None, error: str | None = None) -> None: + payload = { + "id": request_id, + "ok": ok, + } + if ok: + payload["result"] = result + else: + payload["error"] = error or "unknown error" + + sys.stdout.write(json.dumps(payload, ensure_ascii=False) + "\n") + sys.stdout.flush() + + +class TtsWorker: + def __init__(self) -> None: + from melo.api import TTS + + self.language = os.environ.get("LOCAL_TTS_LANGUAGE", "KR").strip() or "KR" + self.speaker_key = os.environ.get("LOCAL_TTS_SPEAKER", "KR").strip() or "KR" + self.device = os.environ.get("LOCAL_TTS_DEVICE", "auto").strip() or "auto" + self.speed = float(os.environ.get("LOCAL_TTS_SPEED", "1.12")) + + self.model = TTS(language=self.language, device=self.device) + speaker_ids = self.model.hps.data.spk2id + self.speaker_id = speaker_ids.get(self.speaker_key) + + if self.speaker_id is None: + normalized = self.speaker_key.upper() + self.speaker_id = speaker_ids.get(normalized) + + if self.speaker_id is None: + self.speaker_id = next(iter(speaker_ids.values())) + + log( + f"local-tts ready language={self.language} speaker={self.speaker_key} device={self.device} speed={self.speed}" + ) + + def synthesize(self, text: str) -> bytes: + temp_path = "" + + try: + with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as handle: + temp_path = handle.name + + self.model.tts_to_file( + text, + self.speaker_id, + temp_path, + speed=self.speed, + quiet=True, + ) + + with open(temp_path, "rb") as handle: + return handle.read() + finally: + if temp_path: + try: + os.unlink(temp_path) + except OSError: + pass + + +def main() -> int: + try: + worker = TtsWorker() + except Exception as exc: + log("failed to initialize local TTS worker") + log("run `bun run setup:local-ai` first if dependencies are missing") + log("".join(traceback.format_exception(exc))) + return 1 + + for line in sys.stdin: + line = line.strip() + if not line: + continue + + try: + request = json.loads(line) + request_id = int(request["id"]) + method = request["method"] + params = request.get("params", {}) + + if method == "ping": + write_response(request_id, True, {"ready": True}) + continue + if method != "synthesize": + raise ValueError(f"unsupported method: {method}") + + text = str(params.get("text", "")).strip() + if not text: + raise ValueError("text is empty") + + audio = worker.synthesize(text) + write_response( + request_id, + True, + { + "wav_base64": base64.b64encode(audio).decode("ascii"), + }, + ) + except Exception as exc: + error_text = "".join(traceback.format_exception_only(type(exc), exc)).strip() + write_response(request_id, False, error=error_text) + + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/python/requirements.txt b/python/requirements.txt new file mode 100644 index 0000000..ccdfa62 --- /dev/null +++ b/python/requirements.txt @@ -0,0 +1,2 @@ +faster-whisper==1.2.1 +git+https://github.com/myshell-ai/MeloTTS.git@v0.1.2 diff --git a/src/audio/guild-voice-session.ts b/src/audio/guild-voice-session.ts index fb93718..78bf6e7 100644 --- a/src/audio/guild-voice-session.ts +++ b/src/audio/guild-voice-session.ts @@ -22,9 +22,9 @@ import type { AppConfig } from "../config.js"; import { Logger } from "../logger.js"; import { float32ToPcm16Buffer, int16ArrayToFloat32, Stereo48kToMono16kDownsampler, takeFrame } from "./pcm.js"; import { ConversationMemory, type UserUtterance } from "../services/conversation.js"; -import { ElevenLabsSttService } from "../services/elevenlabs-stt.js"; -import { ElevenLabsTtsService, type PreparedSpeechAudio } from "../services/elevenlabs-tts.js"; import type { LlmService } from "../services/llm.js"; +import type { SttService } from "../services/stt.js"; +import type { PreparedSpeechAudio, TtsService } from "../services/tts.js"; interface GuildVoiceSessionOptions { client: Client; @@ -33,8 +33,8 @@ interface GuildVoiceSessionOptions { guild: Guild; voiceChannel: VoiceBasedChannel; textChannelId?: string; - stt: ElevenLabsSttService; - tts: ElevenLabsTtsService; + stt: SttService; + tts: TtsService; llm: LlmService; } diff --git a/src/audio/local-voice-session.ts b/src/audio/local-voice-session.ts index 47c8148..8541e7c 100644 --- a/src/audio/local-voice-session.ts +++ b/src/audio/local-voice-session.ts @@ -12,15 +12,15 @@ import { Logger } from "../logger.js"; import { requireFfmpegPath } from "./ffmpeg-path.js"; import { takeFrame, int16ArrayToFloat32, float32ToPcm16Buffer } from "./pcm.js"; import { ConversationMemory, type UserUtterance } from "../services/conversation.js"; -import { ElevenLabsSttService } from "../services/elevenlabs-stt.js"; -import { ElevenLabsTtsService, type PreparedSpeechAudio } from "../services/elevenlabs-tts.js"; import type { LlmService } from "../services/llm.js"; +import type { SttService } from "../services/stt.js"; +import type { PreparedSpeechAudio, TtsService } from "../services/tts.js"; interface LocalVoiceSessionOptions { config: AssistantRuntimeConfig; logger: Logger; - stt: ElevenLabsSttService; - tts: ElevenLabsTtsService; + stt: SttService; + tts: TtsService; llm: LlmService; } diff --git a/src/config.ts b/src/config.ts index cd33135..c814ac1 100644 --- a/src/config.ts +++ b/src/config.ts @@ -15,14 +15,21 @@ const envSchema = z.object({ DISCORD_BOT_TOKEN: emptyToUndefined, DISCORD_APPLICATION_ID: emptyToUndefined, DISCORD_COMMAND_GUILD_ID: emptyToUndefined, - ELEVENLABS_API_KEY: emptyToUndefined, - ELEVENLABS_VOICE_ID: emptyToUndefined, - ELEVENLABS_STT_MODEL: z.string().min(1).default("scribe_v2_realtime"), - ELEVENLABS_TTS_MODEL: z.string().min(1).default("eleven_flash_v2_5"), OLLAMA_BASE_URL: z.string().min(1).default("http://localhost:11434"), OLLAMA_MODEL: z.string().min(1).default("qwen3:0.6b"), OLLAMA_KEEP_ALIVE: z.string().min(1).default("5m"), OLLAMA_NUM_CTX: z.coerce.number().int().min(512).max(32768).default(4096), + LOCAL_AI_VENV_PATH: z.string().min(1).default(".local-ai/.venv"), + LOCAL_AI_CACHE_DIR: z.string().min(1).default(".local-ai/cache"), + LOCAL_AI_PYTHON: emptyToUndefined, + LOCAL_STT_MODEL: z.string().min(1).default("tiny"), + LOCAL_STT_DEVICE: z.string().min(1).default("auto"), + LOCAL_STT_COMPUTE_TYPE: z.string().min(1).default("auto"), + LOCAL_STT_BEAM_SIZE: z.coerce.number().int().min(1).max(8).default(1), + LOCAL_TTS_LANGUAGE: z.string().min(1).default("KR"), + LOCAL_TTS_SPEAKER: z.string().min(1).default("KR"), + LOCAL_TTS_DEVICE: z.string().min(1).default("auto"), + LOCAL_TTS_SPEED: z.coerce.number().min(0.8).max(1.6).default(1.12), BOT_DEFAULT_LANGUAGE: z.string().min(2).default("ko"), MAX_CONVERSATION_TURNS: z.coerce.number().int().min(4).max(30).default(12), LOCAL_AUDIO_SOURCE: emptyToUndefined, @@ -36,10 +43,7 @@ const envSchema = z.object({ }); export type AppConfig = z.infer; -export type AssistantRuntimeConfig = AppConfig & { - ELEVENLABS_API_KEY: string; - ELEVENLABS_VOICE_ID: string; -}; +export type AssistantRuntimeConfig = AppConfig; export type DiscordRuntimeConfig = AssistantRuntimeConfig & { DISCORD_BOT_TOKEN: string; DISCORD_APPLICATION_ID: string; @@ -57,11 +61,7 @@ function requirePresent(value: string | undefined, name: string): string { } export function requireAssistantRuntimeConfig(config: AppConfig): AssistantRuntimeConfig { - return { - ...config, - ELEVENLABS_API_KEY: requirePresent(config.ELEVENLABS_API_KEY, "ELEVENLABS_API_KEY"), - ELEVENLABS_VOICE_ID: requirePresent(config.ELEVENLABS_VOICE_ID, "ELEVENLABS_VOICE_ID"), - }; + return config; } export function requireDiscordRuntimeConfig(config: AppConfig): DiscordRuntimeConfig { diff --git a/src/discord-main.ts b/src/discord-main.ts index f06045d..88cde72 100644 --- a/src/discord-main.ts +++ b/src/discord-main.ts @@ -15,8 +15,8 @@ import { Client as DiscordClient } from "discord.js"; import { GuildVoiceSession } from "./audio/guild-voice-session.js"; import { type DiscordRuntimeConfig } from "./config.js"; import { Logger } from "./logger.js"; -import { ElevenLabsSttService } from "./services/elevenlabs-stt.js"; -import { ElevenLabsTtsService } from "./services/elevenlabs-tts.js"; +import { LocalFasterWhisperSttService } from "./services/local-stt.js"; +import { LocalMeloTtsService } from "./services/local-tts.js"; import { OllamaLlmService } from "./services/ollama-llm.js"; export async function runDiscordBot(config: DiscordRuntimeConfig, logger: Logger): Promise { @@ -37,11 +37,14 @@ export async function runDiscordBot(config: DiscordRuntimeConfig, logger: Logger intents: [GatewayIntentBits.Guilds, GatewayIntentBits.GuildVoiceStates], }); - const stt = new ElevenLabsSttService(config); - const tts = new ElevenLabsTtsService(config); + const stt = new LocalFasterWhisperSttService(config, logger); + const tts = new LocalMeloTtsService(config, logger); const llm = new OllamaLlmService(config); const sessions = new Map(); + await stt.warmup(); + await tts.warmup(); + function getVoiceChannel(interaction: ChatInputCommandInteraction): VoiceBasedChannel | null { const member = interaction.member as GuildMember | null; return member?.voice.channel ?? null; @@ -174,6 +177,7 @@ export async function runDiscordBot(config: DiscordRuntimeConfig, logger: Logger }); } sessions.clear(); + await Promise.allSettled([stt.destroy?.(), tts.destroy?.()]); await client.destroy(); process.exit(exitCode); } diff --git a/src/local-main.ts b/src/local-main.ts index 124e57f..ebd2c5a 100644 --- a/src/local-main.ts +++ b/src/local-main.ts @@ -5,8 +5,8 @@ import type { AssistantRuntimeConfig } from "./config.js"; import { Logger } from "./logger.js"; import { LocalVoiceSession } from "./audio/local-voice-session.js"; import { requireFfmpegPath } from "./audio/ffmpeg-path.js"; -import { ElevenLabsSttService } from "./services/elevenlabs-stt.js"; -import { ElevenLabsTtsService } from "./services/elevenlabs-tts.js"; +import { LocalFasterWhisperSttService } from "./services/local-stt.js"; +import { LocalMeloTtsService } from "./services/local-tts.js"; import { OllamaLlmService } from "./services/ollama-llm.js"; export async function printLocalAudioDevices(): Promise { @@ -67,9 +67,13 @@ export async function printLocalAudioDevices(): Promise { } export async function runLocalAssistant(config: AssistantRuntimeConfig, logger: Logger): Promise { - const stt = new ElevenLabsSttService(config); - const tts = new ElevenLabsTtsService(config); + const stt = new LocalFasterWhisperSttService(config, logger); + const tts = new LocalMeloTtsService(config, logger); const llm = new OllamaLlmService(config); + + await stt.warmup(); + await tts.warmup(); + const session = new LocalVoiceSession({ config, logger, @@ -91,6 +95,7 @@ export async function runLocalAssistant(config: AssistantRuntimeConfig, logger: await session.destroy().catch((error) => { logger.warn("Local session shutdown failed", error); }); + await Promise.allSettled([stt.destroy?.(), tts.destroy?.()]); process.exit(exitCode); }; diff --git a/src/python-runtime.ts b/src/python-runtime.ts new file mode 100644 index 0000000..48d4bcb --- /dev/null +++ b/src/python-runtime.ts @@ -0,0 +1,90 @@ +import { existsSync } from "node:fs"; +import { spawnSync } from "node:child_process"; +import path from "node:path"; + +import type { AppConfig } from "./config.js"; + +export interface PythonLaunch { + command: string; + args: string[]; + source: "venv" | "configured" | "system"; +} + +function splitCommandSpec(spec: string): string[] { + return spec.match(/(?:[^\s"]+|"[^"]*")+/g)?.map((part) => part.replace(/^"|"$/g, "")) ?? []; +} + +function canRun(command: string, args: string[]): boolean { + const result = spawnSync(command, [...args, "--version"], { + encoding: "utf8", + }); + return result.status === 0; +} + +export function resolveLocalAiVenvPath(config: AppConfig): string { + return path.resolve(process.cwd(), config.LOCAL_AI_VENV_PATH); +} + +export function resolveLocalAiCachePath(config: AppConfig): string { + return path.resolve(process.cwd(), config.LOCAL_AI_CACHE_DIR); +} + +export function resolveVenvPythonPath(config: AppConfig): string { + const venvPath = resolveLocalAiVenvPath(config); + return process.platform === "win32" + ? path.join(venvPath, "Scripts", "python.exe") + : path.join(venvPath, "bin", "python"); +} + +export function resolvePythonLaunch(config: AppConfig, options?: { preferVenv?: boolean }): PythonLaunch { + const preferVenv = options?.preferVenv ?? true; + const venvPython = resolveVenvPythonPath(config); + + if (preferVenv && existsSync(venvPython)) { + return { + command: venvPython, + args: [], + source: "venv", + }; + } + + const configured = config.LOCAL_AI_PYTHON ? splitCommandSpec(config.LOCAL_AI_PYTHON) : []; + if (configured.length > 0 && canRun(configured[0]!, configured.slice(1))) { + return { + command: configured[0]!, + args: configured.slice(1), + source: "configured", + }; + } + + const candidates = + process.platform === "win32" + ? [ + ["py", "-3"], + ["python"], + ["python3"], + ] + : [ + ["python3"], + ["python"], + ]; + + for (const [command, ...args] of candidates) { + if (canRun(command, args)) { + return { + command, + args, + source: "system", + }; + } + } + + throw new Error( + [ + "Python 실행 파일을 찾지 못했습니다.", + "1. Python 3.11 이상을 설치", + "2. 필요하면 `.env` 에 `LOCAL_AI_PYTHON=python` 또는 `LOCAL_AI_PYTHON=py -3` 설정", + "3. 그 다음 `bun run setup:local-ai` 실행", + ].join("\n"), + ); +} diff --git a/src/services/elevenlabs-stt.ts b/src/services/elevenlabs-stt.ts deleted file mode 100644 index 67b7979..0000000 --- a/src/services/elevenlabs-stt.ts +++ /dev/null @@ -1,124 +0,0 @@ -import WebSocket from "ws"; - -import type { AssistantRuntimeConfig } from "../config.js"; - -interface ElevenLabsMessage { - message_type?: string; - text?: string; - error?: string; -} - -const NON_FATAL_ERROR_TYPES = new Set([ - "insufficient_audio_activity", -]); - -export class ElevenLabsSttService { - constructor(private readonly config: AssistantRuntimeConfig) {} - - async transcribePcm16(pcm16MonoAudio: Buffer): Promise { - if (pcm16MonoAudio.byteLength === 0) { - return null; - } - - const url = new URL("wss://api.elevenlabs.io/v1/speech-to-text/realtime"); - url.searchParams.set("model_id", this.config.ELEVENLABS_STT_MODEL); - url.searchParams.set("language_code", this.config.BOT_DEFAULT_LANGUAGE); - url.searchParams.set("audio_format", "pcm_16000"); - url.searchParams.set("commit_strategy", "manual"); - url.searchParams.set("include_timestamps", "false"); - url.searchParams.set("include_language_detection", "false"); - url.searchParams.set("enable_logging", "false"); - - return await new Promise((resolve, reject) => { - const socket = new WebSocket(url, { - headers: { - "xi-api-key": this.config.ELEVENLABS_API_KEY, - }, - }); - - let settled = false; - let lastTranscript = ""; - - const timeout = setTimeout(() => { - finish(lastTranscript || null); - }, 15_000); - - const finish = (result: string | null, error?: Error) => { - if (settled) { - return; - } - settled = true; - clearTimeout(timeout); - try { - socket.close(); - } catch { - // Ignore close race. - } - - if (error) { - reject(error); - return; - } - resolve(result); - }; - - socket.on("message", (raw) => { - let message: ElevenLabsMessage; - try { - message = JSON.parse(raw.toString()) as ElevenLabsMessage; - } catch (error) { - finish(null, error as Error); - return; - } - - switch (message.message_type) { - case "session_started": - socket.send( - JSON.stringify({ - message_type: "input_audio_chunk", - audio_base_64: pcm16MonoAudio.toString("base64"), - commit: true, - sample_rate: 16000, - }), - ); - return; - case "partial_transcript": - return; - case "committed_transcript": - case "committed_transcript_with_timestamps": { - const transcript = message.text?.trim() ?? ""; - if (transcript.length > 0) { - lastTranscript = transcript; - finish(transcript); - } - return; - } - default: - if (!message.message_type?.endsWith("error") && !message.message_type) { - return; - } - - if (message.message_type && NON_FATAL_ERROR_TYPES.has(message.message_type)) { - finish(null); - return; - } - - finish( - null, - new Error(message.error ?? `ElevenLabs STT error: ${message.message_type ?? "unknown"}`), - ); - } - }); - - socket.on("error", (error) => { - finish(null, error as Error); - }); - - socket.on("close", () => { - if (!settled) { - finish(lastTranscript || null); - } - }); - }); - } -} diff --git a/src/services/elevenlabs-tts.ts b/src/services/elevenlabs-tts.ts deleted file mode 100644 index 2b38975..0000000 --- a/src/services/elevenlabs-tts.ts +++ /dev/null @@ -1,78 +0,0 @@ -import { Readable } from "node:stream"; - -import prism from "prism-media"; - -import type { AssistantRuntimeConfig } from "../config.js"; -import { resolveFfmpegPath } from "../audio/ffmpeg-path.js"; - -export interface PreparedSpeechAudio { - stream: Readable; - dispose: () => void; -} - -export class ElevenLabsTtsService { - constructor(private readonly config: AssistantRuntimeConfig) { - const resolvedFfmpegPath = resolveFfmpegPath(); - if (resolvedFfmpegPath && !process.env.FFMPEG_PATH) { - process.env.FFMPEG_PATH = resolvedFfmpegPath; - } - } - - async preparePlayback(text: string, signal?: AbortSignal): Promise { - const url = new URL(`https://api.elevenlabs.io/v1/text-to-speech/${this.config.ELEVENLABS_VOICE_ID}/stream`); - url.searchParams.set("output_format", "mp3_44100_128"); - url.searchParams.set("enable_logging", "false"); - - const response = await fetch(url, { - method: "POST", - headers: { - "Content-Type": "application/json", - "xi-api-key": this.config.ELEVENLABS_API_KEY, - }, - body: JSON.stringify({ - text, - model_id: this.config.ELEVENLABS_TTS_MODEL, - language_code: this.config.BOT_DEFAULT_LANGUAGE, - voice_settings: { - stability: 0.35, - similarity_boost: 0.75, - speed: 1.05, - }, - }), - signal, - }); - - if (!response.ok || !response.body) { - throw new Error(`ElevenLabs TTS request failed with status ${response.status}`); - } - - const input = Readable.fromWeb(response.body as never); - const ffmpeg = new prism.FFmpeg({ - args: [ - "-analyzeduration", - "0", - "-loglevel", - "0", - "-i", - "pipe:0", - "-f", - "s16le", - "-ar", - "48000", - "-ac", - "2", - "pipe:1", - ], - }); - - input.pipe(ffmpeg); - - return { - stream: ffmpeg, - dispose: () => { - input.destroy(); - ffmpeg.destroy(); - }, - }; - } -} diff --git a/src/services/local-stt.ts b/src/services/local-stt.ts new file mode 100644 index 0000000..60c3339 --- /dev/null +++ b/src/services/local-stt.ts @@ -0,0 +1,43 @@ +import type { AssistantRuntimeConfig } from "../config.js"; +import type { Logger } from "../logger.js"; +import { PythonJsonWorker } from "./python-json-worker.js"; +import type { SttService } from "./stt.js"; + +interface TranscribeResult { + text?: string; +} + +export class LocalFasterWhisperSttService implements SttService { + private readonly worker: PythonJsonWorker; + + constructor(private readonly config: AssistantRuntimeConfig, logger: Logger) { + this.worker = new PythonJsonWorker(config, logger, "local_stt_worker.py", "local-stt", { + LOCAL_STT_MODEL: config.LOCAL_STT_MODEL, + LOCAL_STT_DEVICE: config.LOCAL_STT_DEVICE, + LOCAL_STT_COMPUTE_TYPE: config.LOCAL_STT_COMPUTE_TYPE, + LOCAL_STT_BEAM_SIZE: String(config.LOCAL_STT_BEAM_SIZE), + }); + } + + async warmup(): Promise { + await this.worker.request("ping", {}); + } + + async transcribePcm16(pcm16MonoAudio: Buffer): Promise { + if (pcm16MonoAudio.byteLength === 0) { + return null; + } + + const result = await this.worker.request("transcribe", { + audio_base64: pcm16MonoAudio.toString("base64"), + language: this.config.BOT_DEFAULT_LANGUAGE, + }); + + const transcript = result.text?.trim() ?? ""; + return transcript.length > 0 ? transcript : null; + } + + async destroy(): Promise { + await this.worker.destroy(); + } +} diff --git a/src/services/local-tts.ts b/src/services/local-tts.ts new file mode 100644 index 0000000..3191c09 --- /dev/null +++ b/src/services/local-tts.ts @@ -0,0 +1,94 @@ +import { Readable } from "node:stream"; + +import prism from "prism-media"; + +import type { AssistantRuntimeConfig } from "../config.js"; +import type { Logger } from "../logger.js"; +import { resolveFfmpegPath } from "../audio/ffmpeg-path.js"; +import { PythonJsonWorker } from "./python-json-worker.js"; +import type { PreparedSpeechAudio, TtsService } from "./tts.js"; + +interface SynthesizeResult { + wav_base64?: string; +} + +export class LocalMeloTtsService implements TtsService { + private readonly worker: PythonJsonWorker; + + constructor(config: AssistantRuntimeConfig, logger: Logger) { + const resolvedFfmpegPath = resolveFfmpegPath(); + if (resolvedFfmpegPath && !process.env.FFMPEG_PATH) { + process.env.FFMPEG_PATH = resolvedFfmpegPath; + } + + this.worker = new PythonJsonWorker(config, logger, "local_tts_worker.py", "local-tts", { + LOCAL_TTS_LANGUAGE: config.LOCAL_TTS_LANGUAGE, + LOCAL_TTS_SPEAKER: config.LOCAL_TTS_SPEAKER, + LOCAL_TTS_DEVICE: config.LOCAL_TTS_DEVICE, + LOCAL_TTS_SPEED: String(config.LOCAL_TTS_SPEED), + }); + } + + async warmup(): Promise { + await this.worker.request("ping", {}); + } + + async preparePlayback(text: string, signal?: AbortSignal): Promise { + const result = await this.worker.request( + "synthesize", + { + text, + }, + signal, + ); + + const wavBase64 = result.wav_base64; + if (!wavBase64) { + throw new Error("로컬 TTS가 빈 오디오를 반환했습니다."); + } + + const input = Readable.from([Buffer.from(wavBase64, "base64")]); + const ffmpeg = new prism.FFmpeg({ + args: [ + "-analyzeduration", + "0", + "-loglevel", + "0", + "-i", + "pipe:0", + "-f", + "s16le", + "-ar", + "48000", + "-ac", + "2", + "pipe:1", + ], + }); + + if (signal) { + signal.addEventListener( + "abort", + () => { + input.destroy(); + ffmpeg.destroy(); + }, + { once: true }, + ); + } + + input.pipe(ffmpeg); + + return { + stream: ffmpeg, + dispose: () => { + input.destroy(); + ffmpeg.destroy(); + }, + }; + } + + async destroy(): Promise { + await this.worker.destroy(); + } +} diff --git a/src/services/python-json-worker.ts b/src/services/python-json-worker.ts new file mode 100644 index 0000000..f48f304 --- /dev/null +++ b/src/services/python-json-worker.ts @@ -0,0 +1,189 @@ +import { spawn, type ChildProcessWithoutNullStreams } from "node:child_process"; +import { createInterface } from "node:readline"; +import path from "node:path"; + +import type { AssistantRuntimeConfig } from "../config.js"; +import type { Logger } from "../logger.js"; +import { resolveLocalAiCachePath, resolvePythonLaunch } from "../python-runtime.js"; + +interface WorkerRequest { + id: number; + method: string; + params: Record; +} + +interface WorkerResponse { + id: number; + ok: boolean; + result?: unknown; + error?: string; +} + +export class PythonJsonWorker { + private child: ChildProcessWithoutNullStreams | null = null; + private nextId = 1; + private readonly pending = new Map< + number, + { + resolve: (value: unknown) => void; + reject: (error: Error) => void; + } + >(); + + constructor( + private readonly config: AssistantRuntimeConfig, + private readonly logger: Logger, + private readonly scriptName: string, + private readonly label: string, + private readonly workerEnv: Record, + ) {} + + async request(method: string, params: Record, signal?: AbortSignal): Promise { + const child = this.ensureStarted(); + const id = this.nextId++; + + return await new Promise((resolve, reject) => { + if (signal?.aborted) { + reject(new Error(`${this.label} request aborted before start`)); + return; + } + + const abortHandler = () => { + this.pending.delete(id); + reject(new Error(`${this.label} request aborted`)); + }; + + if (signal) { + signal.addEventListener("abort", abortHandler, { once: true }); + } + + this.pending.set(id, { + resolve: (value) => { + if (signal) { + signal.removeEventListener("abort", abortHandler); + } + resolve(value as T); + }, + reject: (error) => { + if (signal) { + signal.removeEventListener("abort", abortHandler); + } + reject(error); + }, + }); + + const message: WorkerRequest = { + id, + method, + params, + }; + + child.stdin.write(`${JSON.stringify(message)}\n`); + }); + } + + async destroy(): Promise { + this.rejectAll(new Error(`${this.label} worker terminated`)); + + if (!this.child) { + return; + } + + const child = this.child; + this.child = null; + + child.kill("SIGTERM"); + await new Promise((resolve) => { + child.once("exit", () => resolve()); + setTimeout(resolve, 1_500); + }); + } + + private ensureStarted(): ChildProcessWithoutNullStreams { + if (this.child) { + return this.child; + } + + const launch = resolvePythonLaunch(this.config); + const scriptPath = path.resolve(process.cwd(), "python", this.scriptName); + const cachePath = resolveLocalAiCachePath(this.config); + const recentStderr: string[] = []; + + const child = spawn(launch.command, [...launch.args, scriptPath], { + stdio: ["pipe", "pipe", "pipe"], + env: { + ...process.env, + HF_HOME: cachePath, + TRANSFORMERS_CACHE: cachePath, + PYTHONIOENCODING: "utf-8", + BOT_DEFAULT_LANGUAGE: this.config.BOT_DEFAULT_LANGUAGE, + ...this.workerEnv, + }, + }); + + createInterface({ + input: child.stdout, + crlfDelay: Number.POSITIVE_INFINITY, + }).on("line", (line) => { + if (!line.trim()) { + return; + } + + let payload: WorkerResponse; + try { + payload = JSON.parse(line) as WorkerResponse; + } catch (error) { + this.logger.warn(`${this.label} stdout parse failed`, error); + return; + } + + const pending = this.pending.get(payload.id); + if (!pending) { + return; + } + + this.pending.delete(payload.id); + if (payload.ok) { + pending.resolve(payload.result); + return; + } + + pending.reject(new Error(payload.error ?? `${this.label} worker error`)); + }); + + child.stderr.on("data", (chunk: Buffer) => { + const text = chunk.toString().trim(); + if (text.length > 0) { + recentStderr.push(text); + if (recentStderr.length > 20) { + recentStderr.shift(); + } + this.logger.warn(`[${this.label}]`, text); + } + }); + + child.on("exit", (code, signal) => { + if (this.child === child) { + this.child = null; + } + + const detail = recentStderr.length > 0 ? `\n${recentStderr.join("\n")}` : ""; + this.rejectAll(new Error(`${this.label} worker exited code=${code ?? "null"} signal=${signal ?? "null"}${detail}`)); + }); + + child.on("error", (error) => { + this.rejectAll(error as Error); + }); + + this.child = child; + return child; + } + + private rejectAll(error: Error): void { + const pending = [...this.pending.values()]; + this.pending.clear(); + for (const item of pending) { + item.reject(error); + } + } +} diff --git a/src/services/stt.ts b/src/services/stt.ts new file mode 100644 index 0000000..393aeb8 --- /dev/null +++ b/src/services/stt.ts @@ -0,0 +1,4 @@ +export interface SttService { + transcribePcm16(pcm16MonoAudio: Buffer): Promise; + destroy?(): Promise; +} diff --git a/src/services/tts.ts b/src/services/tts.ts new file mode 100644 index 0000000..556b3a6 --- /dev/null +++ b/src/services/tts.ts @@ -0,0 +1,11 @@ +import type { Readable } from "node:stream"; + +export interface PreparedSpeechAudio { + stream: Readable; + dispose: () => void; +} + +export interface TtsService { + preparePlayback(text: string, signal?: AbortSignal): Promise; + destroy?(): Promise; +} diff --git a/src/setup-local-ai.ts b/src/setup-local-ai.ts new file mode 100644 index 0000000..229ac92 --- /dev/null +++ b/src/setup-local-ai.ts @@ -0,0 +1,88 @@ +import { existsSync } from "node:fs"; +import { mkdir } from "node:fs/promises"; +import { spawn } from "node:child_process"; +import path from "node:path"; + +import { loadConfig } from "./config.js"; +import { resolveLocalAiCachePath, resolveLocalAiVenvPath, resolvePythonLaunch, resolveVenvPythonPath } from "./python-runtime.js"; + +async function run(command: string, args: string[], extraEnv?: NodeJS.ProcessEnv): Promise { + await new Promise((resolve, reject) => { + const child = spawn(command, args, { + stdio: "inherit", + env: { + ...process.env, + ...extraEnv, + }, + }); + + child.on("exit", (code) => { + if (code === 0) { + resolve(); + return; + } + reject(new Error(`${command} ${args.join(" ")} exited with code ${code ?? "null"}`)); + }); + child.on("error", reject); + }); +} + +async function ensurePip(pythonBin: string, env: NodeJS.ProcessEnv): Promise { + await new Promise((resolve, reject) => { + const child = spawn(pythonBin, ["-m", "pip", "--version"], { + stdio: "ignore", + env, + }); + child.on("exit", (code) => { + if (code === 0) { + resolve(); + return; + } + reject(new Error("pip missing")); + }); + child.on("error", reject); + }).catch(async () => { + await run(pythonBin, ["-m", "ensurepip", "--upgrade"], env); + }); +} + +async function main(): Promise { + const config = loadConfig(); + const venvPath = resolveLocalAiVenvPath(config); + const venvPython = resolveVenvPythonPath(config); + const cachePath = resolveLocalAiCachePath(config); + const requirementsPath = path.resolve(process.cwd(), "python", "requirements.txt"); + const baseEnv = { + HF_HOME: cachePath, + TRANSFORMERS_CACHE: cachePath, + PYTHONIOENCODING: "utf-8", + }; + + await mkdir(cachePath, { recursive: true }); + + if (!existsSync(venvPython)) { + const launch = resolvePythonLaunch(config, { preferVenv: false }); + console.log(`기본 Python 확인: ${launch.command} ${launch.args.join(" ")}`.trim()); + console.log(`가상환경 생성: ${venvPath}`); + await run(launch.command, [...launch.args, "-m", "venv", venvPath], baseEnv); + } + + await ensurePip(venvPython, { + ...process.env, + ...baseEnv, + }); + + console.log("로컬 AI 의존성 설치를 시작합니다."); + await run(venvPython, ["-m", "pip", "install", "--upgrade", "pip", "setuptools", "wheel"], baseEnv); + await run(venvPython, ["-m", "pip", "install", "-r", requirementsPath], baseEnv); + + console.log("설치가 끝났습니다."); + console.log("다음 순서:"); + console.log("1. bun run devices"); + console.log("2. bun run start:local"); +} + +void main().catch((error) => { + console.error(error instanceof Error ? error.message : String(error)); + process.exit(1); +});