bringIntoView returned the last boundingBox() unconditionally after the
scroll loop exhausted, so an element still outside the viewport would be
clicked anyway. Validate the final box against the actual viewport bounds
on both axes (innerWidth/innerHeight) and return null otherwise, so
humanClick fails instead of clicking an off-screen coordinate.
Address review accuracy: humanClick used DOM scrollIntoViewIfNeeded and fell
back to Playwright locator.click() when an element had no box - neither is real
input. Now it brings elements into view with a real wheel scroll and throws if
there is no on-screen box (no synthetic click). Header comment and README
corrected: xdotool injects synthetic X input (not a physical HID device), and
all actions are real input while the CDP/DOM API is used only to read state.
Make every action real keyboard/mouse via xdotool, not just the visible
browsing: address-bar navigation (Ctrl+L + char-by-char typing), the YouTube
settings gear -> 화질 -> 1080p menu (real clicks, verified hd1080), the autoplay
toggle, the play button, and fullscreen via the real 'f' key (F11 isn't honored
by this WM; 'f' yields true 1080p fullscreen without pausing). CDP/DOM API is
now used only to read state for verification.
The startup catch cleared this.active unconditionally. In a stop()+restart
race during the slow login/pauses, the first attempt's catch would fire after
the second start() had already taken the lock, unlocking it mid-startup and
letting a third start() race in. Guard the active/state reset with
`this.controller === controller`, matching the field-null and playStream
.finally guards.
Verified live: stop during login then restart keeps the restart's lock
(active stays true), and it clears to false only once truly stopped; no crash.
The human-pause delays leave start() in-flight for several seconds, which
exposed two races:
- stop() during a pause only ended the pause; start() continued and called
joinVoice on the streamer stop() had already nulled (null deref).
- `active` was set only just before go-live, so a second /stream during the
delay passed the guard and both calls raced on the same overwritten streamer.
Now start() locks `active` before any await, keeps controller/streamer/capture
as local refs, and calls signal.throwIfAborted() after each await so an
interleaved stop() unwinds into a catch that tears down via the local refs and
clears instance state only if it still points at this attempt. isActive() now
reflects "starting" during the delay too.
Verified live: concurrent start is rejected ("이미 송출 중입니다"), stop() mid-
startup returns a cancel message with isActive=false and no uncaught error, and
the happy path still goes live and tears down cleanly. tsc --noEmit passes.
Joining voice and starting the broadcast instantly looks like a bot. Add
randomised, human-plausible pauses (~0.9-2.2s after coming online before
joining the channel, ~2.5-5s after joining before hitting Go Live) so the
cadence isn't machine-instant or fingerprintable. The pause resolves
immediately on stop() so teardown never hangs mid-wait.
Verified live: end-to-end join -> settle -> Go Live took ~8s before the
stream went live, held for 15s, and tore down cleanly. tsc --noEmit passes.
Address review: the capture ffmpeg had no -b:v, so it encoded at nvenc's
low default (~2.47 Mbps) and the library then re-encoded to 8 Mbps, which
only upscaled already-lost detail. The double encode also kept CPU decode
+ scale + re-encode in the library, contradicting the "GPU handles it"
claim.
Now the system ffmpeg produces the final Discord-ready H264 in one pass
(-b:v/-maxrate at the configured bitrate, -bf 0, 1s keyframes, yuv420p,
-forced-idr) and prepareStream uses noTranscoding:true to remux only. One
GPU encode, no library decode/scale/re-encode.
Verified locally: high-motion source fills 8.7 Mbps at these args (vs the
~2.47 Mbps no-bitrate default), real :1 desktop holds 60fps at realtime,
and the capture -> copy/remux chain yields h264 1920x1080 yuv420p 60fps
has_b_frames=0. tsc --noEmit passes. Live Discord test pending reboot.
Bump the default broadcast to 1080p 60fps at 8 Mbps and route both encode
stages through the GPU (RTX 5050, h264_nvenc) so 60fps stays smooth without
loading the 4-core host.
- selfbot.ts: capture ffmpeg uses h264_nvenc when streamHw is on (falls back
to software x264 otherwise), and prepareStream now passes Encoders.nvenc()
so the library's transcode runs on the GPU too. Guard loadLib for Encoders.
- config.ts: VNC_FRAMERATE default 30 -> 60, VNC_BITRATE_KBPS 4000 -> 8000.
- .env.example: document the new 1080p60/8 Mbps defaults and STREAM_HW.
Verified locally: h264_nvenc x11grab holds a steady 60fps with headroom,
Encoders.nvenc() returns valid h264_nvenc settings, and tsc --noEmit passes.
Live Discord voice-channel verification pending a host reboot.
End-to-end verified with a real burner token + voice channel: login OK, posts
to the text channel, joins voice, and Go-Live streams the host :1 desktop.
- selfbot.ts now captures the X display with the SYSTEM ffmpeg (reliable
x11grab) and pipes it into prepareStream, instead of relying on the lib's
bundled libav input devices (not portable). Capture process is killed on stop.
- package.json: trustedDependencies (node-av, @lng2004/node-datachannel) so the
native streaming deps build automatically on bun install (incl. Docker).
- Dropped the unused nvenc path (the lib's exported `nvenc` is undefined at
runtime); software H264 encode for now.
get-token.ts now writes the Remote Auth URL as a 512x512 QR image
(/tmp/javis_qr.png, override via QR_OUT) in addition to printing the link, so
it can be sent to the user and scanned from a second screen with the Discord
mobile app. Adds the qrcode dependency.
bot/src/get-token.ts uses discord.js-selfbot-v13 DiscordAuthWebsocket: it
prints the Discord Remote Auth URL (https://discord.com/ra/<code> — the same
thing a login QR encodes). Open it on a phone with the Discord app, approve the
"New login" prompt, and the user token is written to .env as
DISCORD_SELFBOT_TOKEN. Works from a single mobile device (no second screen, no
password, no browser devtools). `bun run token`.
- voice.ts: reply playback is now a FIFO queue (AudioPlayerStatus.Idle drains
it) so concurrent speakers no longer cut each other's replies off.
- selfbot.ts: rewritten against the REAL @dank074/discord-video-stream v6 API
(verified from its d.ts): prepareStream(input, opts, signal)->{command,output},
playStream(output, streamer, {type:"go-live"}, signal), Streamer.joinVoice.
x11grab via customInputOptions; optional NVENC encode (RTX 5050) via exported
`nvenc`. package.json pinned to ^6.0.0 (was a wrong ^4.2.1).
- Dockerfile: dropped the hardcoded python3.12 LD_LIBRARY_PATH. faster-whisper
>=1.1 self-locates the pip CUDA libs; ldconfig (full path, glob) registers
them as a robust fallback. Verified: ld.so cache lists libcublas/libcudnn and
GPU whisper works with LD_LIBRARY_PATH empty.
- bridge: STT resample 48k->16k upgraded from nearest-neighbor to linear
(np.interp).
Verified: tsc clean, image builds, GPU whisper OK via ldconfig, compose valid.
Code review of the bridge/bot/docker work found:
- TTS bug: bridge called PiperVoice.synthesize(text, wav) but that method
returns AudioChunks and takes a SynthesisConfig as its 2nd arg, not a wav
file -> TTS would fail. Switched to synthesize_wav(text, wav_file).
Verified: produces a valid 22050Hz mono WAV.
- run-bot.sh now waits if ANY of DISCORD_BOT_TOKEN/APP_ID/GUILD_ID is missing
(config.ts throws on a missing one), preventing a supervisor crash-loop.
Verified clean: discord.js Events.ClientReady == 'clientReady' (existing
handler correct); image rebuilds.
GPU acceleration is now on by default and verified end-to-end on the
Blackwell RTX 5050 (sm_120):
- Ollama offloads 100% to GPU (log: library=CUDA compute=12.0,
BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI
(devices: nvidia.com/gpu=all) to both ollama and javis.
- Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12,
LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on
sm_120; bridge auto-falls back to CPU when no GPU is present.
- Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling,
~5GB Q4). Embed stays nomic-embed-text.
- README documents the host one-time setup (nvidia-container-toolkit +
`nvidia-ctk cdi generate`) and GPU on/off.
Verified: image builds; GPU visible in both containers via compose;
ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK;
bridge /health 200.
`docker compose up -d --build` now brings up the whole thing automatically —
no host setup needed:
- All-in-one javis image: TigerVNC+XFCE desktop, Chrome, Python brain bridge,
Node/bun bot, managed by supervisord (verified: all 6 programs RUNNING).
- ollama service + one-shot ollama-init that auto-pulls chat+embed models
(verified end-to-end; `ollama list` shows pulled models).
- Discord token deferred: without DISCORD_BOT_TOKEN the desktop, bridge,
Ollama and models all run; only the bot waits (no crash loop).
- Slim container deps (bridge/requirements-bridge.txt) drop the unused
PyQt6/torch/chatterbox/sounddevice stack. Piper voice + Whisper models
auto-download into named volumes.
- Configurable host ports (VNC_PORT/NOVNC_PORT/BRIDGE_PORT) to avoid clashing
with a host VNC already on 5901. Bridge binds 0.0.0.0 in-container.
Verified: image builds; brain imports; bridge /health 200; noVNC 200;
X display :1 @1920x1080; auto-pull completes; supervisorctl status all RUNNING.