First increment of the STREAM_BROWSER real-time-info modes (true = browser,
false = Gemini):
- browse-search.mjs: drives the on-screen Chrome via CDP so the action shows on
the broadcast. `search` returns the top Google results (title/url/snippet);
`youtube` plays the first result. Verified live: real-time Seoul weather
results, and IU 'Good Day' MV playback.
- .env.example: GEMINI_API_KEY / GEMINI_MODEL for the false-mode Gemini account.
- docs/stream_browser_modes.md: architecture + integration map (brain config,
the two mode-gated tools, registry, design decisions) for the remaining wiring.
The Python brain wiring (config.py mode/gemini fields, browseAndSearch +
geminiSearch tools, registry, specs, llm_contexts) lands next - it needs a
running brain and a Gemini key to verify, rather than committing untested edits
into the 39k-line engine.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses review of the STREAM_BROWSER / broadcast-defaults work:
- SelfbotStreamer now spawns broadcast-helper.mjs on stream start and kills it on
stop/self-end (alongside capture + keepalive). The ad-skip, subtitle rule and
fullscreen-toolbar-hide are therefore guaranteed broadcast-wide defaults tied
to the broadcast - not a manual process. Fail-open: if node/Chrome deps are
absent the stream runs without the helper. Verified the helper is a child of
the broadcast holder and armed.
- Enforce STREAM_BROWSER at the streamer (start() returns early when
screenBrowser===false), so EVERY caller including stream-hold.ts is voice-only
when it's off, not just the slash command. stream-hold.ts reads STREAM_BROWSER.
- Fix broadcast-helper fullscreen: resolve the window of the tab actually in
HTML5 fullscreen (via its CDP targetId) instead of the first HTTP tab, so the
right Chrome window is toggled when multiple windows exist.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add STREAM_BROWSER (.env) gating screen-share/browser mode. false => the
/자비스 stream command stays voice + API/MCP only (no Go-Live); true (default)
=> screen share as before. (Browser-driven info retrieval in true mode is a
follow-up build; the bot has no browser-control tools yet.)
- Make the two test-time fixes broadcast-wide defaults via broadcast-helper.mjs:
it now also watches every tab for HTML5 fullscreen and toggles Chrome window
fullscreen so the address bar is hidden for ANY video (xfwm4 won't hide it on
'f' alone), restoring on exit. Subtitles were already enforced per video.
scenario.mjs drops its own fullscreen toggle and relies on the helper.
- Revert the test-settings env vars from .env.example (not wanted).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses review of the ad/subtitle work (the ad-skip.mjs -> broadcast-helper.mjs
rename's other half; the prior commit only recorded the deletion):
- ad mute leak: the ad-skipper muted during an ad but never un-muted, so the
main video stayed silent after the first ad. Save the pre-ad muted/playbackRate
and restore them when the ad ends (verified: muted false -> true -> false).
- captions were only applied once when scenario.mjs ran, not for the whole
broadcast. The persistent helper now applies the rule (OFF by default, Korean
ON if offered) per video and ENFORCES it every tick - one-shot did not hold
because YouTube silently re-enabled captions (verified it stays off across 8s).
- ad-skip + captions merged into broadcast-helper.mjs (one CDP process).
- the 60fps MV test now lives in the repo: scenario.mjs gains MV_QUERY (search +
auto-pick the first >=60fps result) and WATCH_SECONDS, plus the
fullscreen-toolbar-hide fix. The broadcast runs via the committed
stream-hold.ts (audio + keepalive), not an out-of-repo copy.
- document the test env vars (CDP_PORT, HOLD_MS, TEST_*, MV_QUERY, WATCH_SECONDS).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses review of the ad/subtitle work:
- ad mute leak: the ad-skipper muted during an ad but never un-muted, so the
main video stayed silent after the first ad. Save the pre-ad muted/playbackRate
and restore them when the ad ends (verified: muted false -> true -> false).
- captions were only applied once when scenario.mjs ran, not for the whole
broadcast. Move the rule (OFF by default, Korean ON if offered) into the
persistent helper so it runs per video, and ENFORCE it every tick - one-shot
did not hold because YouTube silently re-enabled captions (verified it now
stays off across 8s).
- merge ad-skip.mjs + captions into broadcast-helper.mjs (one CDP process).
- the actual 60fps MV test now lives in the repo: scenario.mjs gains MV_QUERY
(search + auto-pick the first >=60fps result) and WATCH_SECONDS, with the
fullscreen-toolbar-hide fix. The broadcast runs via the committed
stream-hold.ts (audio + keepalive), not an out-of-repo copy.
- document the test env vars (CDP_PORT, HOLD_MS, TEST_*, MV_QUERY, WATCH_SECONDS)
in .env.example.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds ad-skip.mjs: connects over CDP and injects a watcher into every tab
(current and future) that clicks "Skip ad" the moment it appears, closes overlay
ads, and fast-forwards unskippable ads (seek-to-end + 16x + mute) so they clear
in ~1s. Self-contained (no extension, no hosts/network changes) and reconnects
across Chrome restarts. Documented in the README.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two broadcast-experience improvements:
- Audio: the Go-Live stream was video-only. Capture the desktop sound (the
default PipeWire/Pulse sink monitor, @DEFAULT_MONITOR@) as a second ffmpeg
input and mux AAC into the mpegts; the library re-encodes it to Opus for
Discord. Controlled by STREAM_AUDIO / STREAM_AUDIO_SOURCE (default on). ffmpeg
inherits XDG_RUNTIME_DIR to reach the pulse socket. Verified: the streamer now
reports "Found audio stream" and the monitor carries Chrome audio (~-11 dB).
- Subtitles: in the browse scenario, default captions OFF, but auto-enable a
Korean track when the video offers one (getOption captions tracklist ->
setOption / unloadModule).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On the streamed VNC desktop (xfwm4), Chrome did not hide its toolbar when a
video entered HTML5 fullscreen via 'f' - the window was full-screen (outerHeight
1080) but the tab/address bar stayed, leaving only 988px of content, so the
address bar bled into the Go-Live broadcast.
Toggle Chrome-initiated browser fullscreen via CDP (Browser.setWindowBounds
windowState fullscreen) around the 'f' step. That reliably hides the toolbar
(innerHeight 1080 vs 988); the toolbar is restored on exit, so normal browsing
still shows it. Verified live: clean full-screen video, no toolbar.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Go-Live broadcast looked badly choppy: video and scrolling stuttered while
the cursor stayed smooth. Root cause is TigerVNC: it only refreshes its
framebuffer while a VNC client is attached, but the broadcast reads that
framebuffer with x11grab (not as a VNC client). With no viewer attached the
captured screen idled at ~1.5 fps (measured 3/30 distinct frames); the cursor
looked smooth only because x11grab overlays the live cursor on every frame.
- Add a headless RFB keepalive (vnc-keepalive.ts) that stays connected for the
life of the stream and requests incremental framebuffer updates at the stream
framerate. SelfbotStreamer starts it on broadcast start and tears it down on
stop/self-end. Measured 3/30 -> 57/60 distinct frames at 60 fps. Fail-open;
authenticates with VNC_PASSWORD or the ~/.config/tigervnc/passwd file.
- Fix a resource leak: when the Go-Live ended on its own, only the active flag
was cleared, leaving the x11grab->nvenc ffmpeg running forever (pinning a CPU
core while no media was transmitted, with only the gateway TCP left and no UDP
media). The self-end path now tears down capture, keepalive and voice like
stop() does.
- Tests for both paths (self-end teardown; keepalive DES auth, port mapping,
password resolution). Add @types/bun so bun:test typechecks; document the
keepalive and recommended Chrome flags in README and .env.example.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bringIntoView returned the last boundingBox() unconditionally after the
scroll loop exhausted, so an element still outside the viewport would be
clicked anyway. Validate the final box against the actual viewport bounds
on both axes (innerWidth/innerHeight) and return null otherwise, so
humanClick fails instead of clicking an off-screen coordinate.
Address review accuracy: humanClick used DOM scrollIntoViewIfNeeded and fell
back to Playwright locator.click() when an element had no box - neither is real
input. Now it brings elements into view with a real wheel scroll and throws if
there is no on-screen box (no synthetic click). Header comment and README
corrected: xdotool injects synthetic X input (not a physical HID device), and
all actions are real input while the CDP/DOM API is used only to read state.
Make every action real keyboard/mouse via xdotool, not just the visible
browsing: address-bar navigation (Ctrl+L + char-by-char typing), the YouTube
settings gear -> 화질 -> 1080p menu (real clicks, verified hd1080), the autoplay
toggle, the play button, and fullscreen via the real 'f' key (F11 isn't honored
by this WM; 'f' yields true 1080p fullscreen without pausing). CDP/DOM API is
now used only to read state for verification.
The startup catch cleared this.active unconditionally. In a stop()+restart
race during the slow login/pauses, the first attempt's catch would fire after
the second start() had already taken the lock, unlocking it mid-startup and
letting a third start() race in. Guard the active/state reset with
`this.controller === controller`, matching the field-null and playStream
.finally guards.
Verified live: stop during login then restart keeps the restart's lock
(active stays true), and it clears to false only once truly stopped; no crash.
The human-pause delays leave start() in-flight for several seconds, which
exposed two races:
- stop() during a pause only ended the pause; start() continued and called
joinVoice on the streamer stop() had already nulled (null deref).
- `active` was set only just before go-live, so a second /stream during the
delay passed the guard and both calls raced on the same overwritten streamer.
Now start() locks `active` before any await, keeps controller/streamer/capture
as local refs, and calls signal.throwIfAborted() after each await so an
interleaved stop() unwinds into a catch that tears down via the local refs and
clears instance state only if it still points at this attempt. isActive() now
reflects "starting" during the delay too.
Verified live: concurrent start is rejected ("이미 송출 중입니다"), stop() mid-
startup returns a cancel message with isActive=false and no uncaught error, and
the happy path still goes live and tears down cleanly. tsc --noEmit passes.
Joining voice and starting the broadcast instantly looks like a bot. Add
randomised, human-plausible pauses (~0.9-2.2s after coming online before
joining the channel, ~2.5-5s after joining before hitting Go Live) so the
cadence isn't machine-instant or fingerprintable. The pause resolves
immediately on stop() so teardown never hangs mid-wait.
Verified live: end-to-end join -> settle -> Go Live took ~8s before the
stream went live, held for 15s, and tore down cleanly. tsc --noEmit passes.
Address review: the capture ffmpeg had no -b:v, so it encoded at nvenc's
low default (~2.47 Mbps) and the library then re-encoded to 8 Mbps, which
only upscaled already-lost detail. The double encode also kept CPU decode
+ scale + re-encode in the library, contradicting the "GPU handles it"
claim.
Now the system ffmpeg produces the final Discord-ready H264 in one pass
(-b:v/-maxrate at the configured bitrate, -bf 0, 1s keyframes, yuv420p,
-forced-idr) and prepareStream uses noTranscoding:true to remux only. One
GPU encode, no library decode/scale/re-encode.
Verified locally: high-motion source fills 8.7 Mbps at these args (vs the
~2.47 Mbps no-bitrate default), real :1 desktop holds 60fps at realtime,
and the capture -> copy/remux chain yields h264 1920x1080 yuv420p 60fps
has_b_frames=0. tsc --noEmit passes. Live Discord test pending reboot.
Bump the default broadcast to 1080p 60fps at 8 Mbps and route both encode
stages through the GPU (RTX 5050, h264_nvenc) so 60fps stays smooth without
loading the 4-core host.
- selfbot.ts: capture ffmpeg uses h264_nvenc when streamHw is on (falls back
to software x264 otherwise), and prepareStream now passes Encoders.nvenc()
so the library's transcode runs on the GPU too. Guard loadLib for Encoders.
- config.ts: VNC_FRAMERATE default 30 -> 60, VNC_BITRATE_KBPS 4000 -> 8000.
- .env.example: document the new 1080p60/8 Mbps defaults and STREAM_HW.
Verified locally: h264_nvenc x11grab holds a steady 60fps with headroom,
Encoders.nvenc() returns valid h264_nvenc settings, and tsc --noEmit passes.
Live Discord voice-channel verification pending a host reboot.
End-to-end verified with a real burner token + voice channel: login OK, posts
to the text channel, joins voice, and Go-Live streams the host :1 desktop.
- selfbot.ts now captures the X display with the SYSTEM ffmpeg (reliable
x11grab) and pipes it into prepareStream, instead of relying on the lib's
bundled libav input devices (not portable). Capture process is killed on stop.
- package.json: trustedDependencies (node-av, @lng2004/node-datachannel) so the
native streaming deps build automatically on bun install (incl. Docker).
- Dropped the unused nvenc path (the lib's exported `nvenc` is undefined at
runtime); software H264 encode for now.
get-token.ts now writes the Remote Auth URL as a 512x512 QR image
(/tmp/javis_qr.png, override via QR_OUT) in addition to printing the link, so
it can be sent to the user and scanned from a second screen with the Discord
mobile app. Adds the qrcode dependency.
bot/src/get-token.ts uses discord.js-selfbot-v13 DiscordAuthWebsocket: it
prints the Discord Remote Auth URL (https://discord.com/ra/<code> — the same
thing a login QR encodes). Open it on a phone with the Discord app, approve the
"New login" prompt, and the user token is written to .env as
DISCORD_SELFBOT_TOKEN. Works from a single mobile device (no second screen, no
password, no browser devtools). `bun run token`.
- voice.ts: reply playback is now a FIFO queue (AudioPlayerStatus.Idle drains
it) so concurrent speakers no longer cut each other's replies off.
- selfbot.ts: rewritten against the REAL @dank074/discord-video-stream v6 API
(verified from its d.ts): prepareStream(input, opts, signal)->{command,output},
playStream(output, streamer, {type:"go-live"}, signal), Streamer.joinVoice.
x11grab via customInputOptions; optional NVENC encode (RTX 5050) via exported
`nvenc`. package.json pinned to ^6.0.0 (was a wrong ^4.2.1).
- Dockerfile: dropped the hardcoded python3.12 LD_LIBRARY_PATH. faster-whisper
>=1.1 self-locates the pip CUDA libs; ldconfig (full path, glob) registers
them as a robust fallback. Verified: ld.so cache lists libcublas/libcudnn and
GPU whisper works with LD_LIBRARY_PATH empty.
- bridge: STT resample 48k->16k upgraded from nearest-neighbor to linear
(np.interp).
Verified: tsc clean, image builds, GPU whisper OK via ldconfig, compose valid.
Code review of the bridge/bot/docker work found:
- TTS bug: bridge called PiperVoice.synthesize(text, wav) but that method
returns AudioChunks and takes a SynthesisConfig as its 2nd arg, not a wav
file -> TTS would fail. Switched to synthesize_wav(text, wav_file).
Verified: produces a valid 22050Hz mono WAV.
- run-bot.sh now waits if ANY of DISCORD_BOT_TOKEN/APP_ID/GUILD_ID is missing
(config.ts throws on a missing one), preventing a supervisor crash-loop.
Verified clean: discord.js Events.ClientReady == 'clientReady' (existing
handler correct); image rebuilds.
GPU acceleration is now on by default and verified end-to-end on the
Blackwell RTX 5050 (sm_120):
- Ollama offloads 100% to GPU (log: library=CUDA compute=12.0,
BLACKWELL_NATIVE_FP4=1). compose passes GPU via CDI
(devices: nvidia.com/gpu=all) to both ollama and javis.
- Whisper STT on GPU: faster-whisper>=1.1.0 + nvidia-cublas/cudnn cu12,
LD_LIBRARY_PATH baked into the image. Verified float16 transcribe on
sm_120; bridge auto-falls back to CPU when no GPU is present.
- Model: default chat model -> qwen3:8b (best 8GB-VRAM tool-calling,
~5GB Q4). Embed stays nomic-embed-text.
- README documents the host one-time setup (nvidia-container-toolkit +
`nvidia-ctk cdi generate`) and GPU on/off.
Verified: image builds; GPU visible in both containers via compose;
ollama ps = 100% GPU; faster-whisper cuda OK + CPU fallback OK;
bridge /health 200.
`docker compose up -d --build` now brings up the whole thing automatically —
no host setup needed:
- All-in-one javis image: TigerVNC+XFCE desktop, Chrome, Python brain bridge,
Node/bun bot, managed by supervisord (verified: all 6 programs RUNNING).
- ollama service + one-shot ollama-init that auto-pulls chat+embed models
(verified end-to-end; `ollama list` shows pulled models).
- Discord token deferred: without DISCORD_BOT_TOKEN the desktop, bridge,
Ollama and models all run; only the bot waits (no crash loop).
- Slim container deps (bridge/requirements-bridge.txt) drop the unused
PyQt6/torch/chatterbox/sounddevice stack. Piper voice + Whisper models
auto-download into named volumes.
- Configurable host ports (VNC_PORT/NOVNC_PORT/BRIDGE_PORT) to avoid clashing
with a host VNC already on 5901. Bridge binds 0.0.0.0 in-container.
Verified: image builds; brain imports; bridge /health 200; noVNC 200;
X display :1 @1920x1080; auto-pull completes; supervisorctl status all RUNNING.