Files
javis_bot/EVALS.md
javis-bot c4abf63f38
Some checks failed
Release / semantic-release (push) Successful in 59s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 13m45s
Release / build-linux (push) Failing after 7m47s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
Add Discord-native hybrid front-end for Jarvis (bot + bridge)
Transform isair/jarvis into a Discord-controlled voice assistant running on
the Ubuntu VNC desktop, keeping the mature ~39k-line Python brain intact.

- bot/ (Node + bun, discord.js): /자비스 slash commands (ephemeral),
  voice channel join + voice receive/playback, pluggable VNC screen broadcast
  (selfbot live / noVNC / screenshot)
- bridge/ (Python, Flask): wraps jarvis STT + run_reply_engine + Piper TTS
  behind a thin localhost HTTP API
- .env.example, scripts/ (start_bridge/start_bot/dev), README rewrite,
  docs/language-comparison.md and docs/vnc-xfce-setup.md

Language decision: hybrid (Python brain + Node/bun Discord layer) because
Discord blocks bot video; native screen broadcast only works via a Node
selfbot library.
2026-06-09 14:51:05 +09:00

20 KiB
Raw Blame History

🧪 Jarvis Evaluation Report

Generated: 2026-05-04 (gemma4:e2b column refreshed with retry-aware outcomes from a full --single run; gpt-oss:20b column inherited unchanged from the 2026-04-27 regen)

📊 TL;DR

Overall: 🟢 340/354 passed (96.0%) across all categories (small-model column re-baselined from a fresh gemma4:e2b run with up to 3× retries; three new tests added in #352, one intent-judge regression introduced by a8f133c recovered by the prompt fix in this PR — see "Intent judge" below)

Category Model Passed Failed Skipped Pass Rate
🤖 Agent behaviour gemma4:e2b 136 7 2 🟢 95.1%
🤖 Agent behaviour gpt-oss:20b 145 7 0 🟢 95.4%
🎤 Intent judge gemma4:e2b (fixed) 48 0 0 🟢 100.0%
🧠 Memory merge consolidation gemma4:e2b 11 0 0 🟢 100.0%

💡 Model Selection Guide

Model Best For Trade-offs
gemma4:e2b Quick responses, lower RAM usage May struggle with complex reasoning
gpt-oss:20b Best accuracy, complex tasks Slower, requires more RAM

🤖 Agent behaviour

Runs the full agent pipeline against each judge model. Tests are compared side-by-side.

Test Case gemma4:e2b gpt-oss:20b
3-turn conversation with topic changes 1/1 (100%) 1/1 (100%)
Active hot window follow up accepted 1/1 (100%) 1/1 (100%)
Adversarial: all three branches in one summary 1/1 (100%) 1/1 (100%)
Adversarial: food preference (USER) vs list-length rule (DIRECTIVES) 1/1 (100%) 1/1 (100%)
Agent calls webSearch for info queries 1/1 (100%) 1/1 (100%)
Agent chains search → fetch for details 1/1 (100%) 1/1 (100%)
Agent uses memory + nutrition data 1/1 (100%) 1/1 (100%)
Assistant checks memory before asking about interests 1/1 (100%) 1/1 (100%)
Assistant does not deny having long-term memory 1/1 (100%) 1/1 (100%)
Bad: deflection without attempting answer 1/1 (100%) 1/1 (100%)
Bad: empty acknowledgment 1/1 (100%) 1/1 (100%)
Bad: generic greeting ignores query 1/1 (100%) 1/1 (100%)
Casual statement without wake word rejected 1/1 (100%) 1/1 (100%)
Chained research: who directed Possessor and what else have they made 1/1 (100%) 1/1 (100%)
Correction loop accepts single or retry 1/1 (100%) 1/1 (100%)
Cross turn pronoun resolution 1/1 (100%) 1/1 (100%)
DIRECTIVES: tone, length, forbidden phrases, address form 1/1 (100%) 1/1 (100%)
Date query with date in context returns none 1/1 (100%) 1/1 (100%)
Diary location grounds getWeather call (#352) 0/1 (0%)
Diet changed from bulking to cutting ⏭️ SKIPPED 🔸 1/1 XFAIL
Digested tool result produces grounded reply 🔸 1/1 XFAIL 1/1 (100%)
Director-then-filmography needs two searches 1/1 (100%) 1/1 (100%)
Enrichment results appear in system message 1/1 (100%) 1/1 (100%)
Enrichment skips questions answered by context 1/1 (100%) 1/1 (100%)
Escape hatch then follow up action 1/1 (100%) 1/1 (100%)
Evaluator emits structured tool call for obvious search 1/1 (100%) 🔸 1/1 XFAIL
Extraction with explicit quantities 1/1 (100%) 1/1 (100%)
First turn calls web search not clarification 🔸 1/1 XFAIL 1/1 (100%)
Follow up after correction calls web search 🔸 1/1 XFAIL 1/1 (100%)
Follow up resolves pronoun in search query 1/1 (100%) 1/1 (100%)
Follow-up references previous turn context 1/1 (100%) 1/1 (100%)
Followup naming place routes to getWeather 1/1 (100%) 1/1 (100%)
Followup supplies missing tool arg — short follow-up continues previous tool chain (#352) 1/1 (100%)
Good: brief but informative 1/1 (100%) 1/1 (100%)
Good: complete weekly forecast 1/1 (100%) 1/1 (100%)
Graph supplies missing tool arg — warm-profile fact grounds getWeather call (#352) 0/1 (0%)
Graph-enriched facts surface in the reply, no denial 1/1 (100%) 1/1 (100%)
Greeting: hello 1/1 (100%) 1/1 (100%)
Greeting: ni hao (Chinese) 1/1 (100%) 1/1 (100%)
Handles ambiguous portion descriptions 1/1 (100%) 1/1 (100%)
Honest block when all providers fail 1/1 (100%) 1/1 (100%)
Hot window query is directed and non empty 1/1 (100%) 1/1 (100%)
Identity query does not trigger recommendation engagement rule 1/1 (100%) 1/1 (100%)
Identity query surfaces multiple user facts when present 1/1 (100%) 1/1 (100%)
Identity query surfaces user stated fact over past qa 1/1 (100%) 1/1 (100%)
Identity query with only past qa returns none or no false facts 1/1 (100%) 1/1 (100%)
Instruction: be more brief 1/1 (100%) 1/1 (100%)
Instruction: use Celsius 1/1 (100%) 1/1 (100%)
Judge echo claim overridden in hot window 1/1 (100%) 1/1 (100%)
LLM uses enrichment-surfaced interests for personalised search 1/1 (100%) 1/1 (100%)
Links only payload produces honest cant read reply 🔸 1/1 XFAIL 1/1 (100%)
Location context flows to search queries 1/1 (100%) 1/1 (100%)
Location query with location in context returns none 1/1 (100%) 1/1 (100%)
Location query with partial hint still routes sensibly 🔸 1/1 XFAIL 1/1 (100%)
LogMealTool stores meals with macros 1/1 (100%) 1/1 (100%)
Max-turn cap delivers a digest reply, never silence 1/1 (100%) 1/1 (100%)
Memory enrichment: personalized news 1/1 (100%) 1/1 (100%)
Memory enrichment: time-based recall 1/1 (100%) 1/1 (100%)
Memory enrichment: topic recall 1/1 (100%) 1/1 (100%)
Mixed summary: keep novel facts, drop stale weather/recommendations 1/1 (100%) 1/1 (100%)
Navigate prose gets nudged into tool call 🔸 1/1 XFAIL 1/1 (100%)
No deflection: tech news 1/1 (100%) 1/1 (100%)
No deflection: time query 1/1 (100%) 1/1 (100%)
No deflection: tomorrow weather 1/1 (100%) 1/1 (100%)
No deflection: weekly rain forecast 1/1 (100%) 1/1 (100%)
No email tool declines honestly 1/1 (100%) 1/1 (100%)
No hint at all still routes sensibly 1/1 (100%) 1/1 (100%)
No wake word rejected despite judge 1/1 (100%) 1/1 (100%)
Novel knowledge: local business details and user location 1/1 (100%) 1/1 (100%)
Novel knowledge: non-English summary (Turkish) 1/1 (100%) 1/1 (100%)
Novel knowledge: relocation plans and employment 1/1 (100%) 1/1 (100%)
Novel knowledge: user diet plan and preferred recipe 1/1 (100%) 1/1 (100%)
Nudge cap stops loop 1/1 (100%) 1/1 (100%)
Nutrition: cheeseburger with fries 1/1 (100%) 1/1 (100%)
Nutrition: chicken with broccoli 1/1 (100%) 1/1 (100%)
Nutrition: oatmeal with banana 1/1 (100%) 1/1 (100%)
Office days changed from Mon/Wed to Mon/Thu ⏭️ SKIPPED 🔸 1/1 XFAIL
Omits deflection narration for unknown entity 1/1 (100%) 🔸 1/1 XFAIL
Omits deflection when topic never resolved 🔸 1/1 XFAIL 1/1 (100%)
Open-ended prompt grounds in stored knowledge 0/1 (0%) 1/1 (100%)
Parallel weather lookup: compare Paris and London 1/1 (100%) 1/1 (100%)
Preserves legitimate user preferences 1/1 (100%) 1/1 (100%)
Realistic web search payload is not deflected to links 🔸 1/1 XFAIL 1/1 (100%)
Recommendation query still surfaces engagement when user facts present 1/1 (100%) 1/1 (100%)
Reframing: life events framed as facts with temporal context 1/1 (100%) 1/1 (100%)
Reframing: requests become knowledge, not interaction descriptions 1/1 (100%) 1/1 (100%)
Reject: assistant self-references (recommendations are not knowledge) 1/1 (100%) 1/1 (100%)
Reject: stale temporal snapshots (weather, time of day) 1/1 (100%) 1/1 (100%)
Restaurant recommendation surfaces past cuisine interest 1/1 (100%) 1/1 (100%)
Returns NONE for non-food inputs 1/1 (100%) 1/1 (100%)
Returns valid JSON with all required fields 0/1 (0%) 1/1 (100%)
Simple meal baseline (2 boiled eggs) 1/1 (100%) 1/1 (100%)
Single weather query ends after one tool call 1/1 (100%) 0/1 (0%)
Speech long after tts requires wake word 1/1 (100%) 1/1 (100%)
Stop during tts interrupts immediately 1/1 (100%) 1/1 (100%)
Time query with time in context returns none 1/1 (100%) 1/1 (100%)
Tool calls literal not surfaced after web search 1/1 (100%) 1/1 (100%)
Tool retry: explicit tool mention 1/1 (100%) 1/1 (100%)
Tool retry: vague go ahead 1/1 (100%) 1/1 (100%)
Tool retry: vague just try 1/1 (100%) 1/1 (100%)
Toolsearchtool widens then navigate 🔸 1/1 XFAIL 🔸 1/1 XFAIL
Topic switch: search → weather uses getWeather 1/1 (100%) 1/1 (100%)
Topic switch: weather → store hours uses webSearch 1/1 (100%) 1/1 (100%)
Trivial conversations produce no extracted facts 1/1 (100%) 1/1 (100%)
Tts echo segments skipped user query extracted 1/1 (100%) 1/1 (100%)
Turn1 possessor then turn2 weather 0/1 (0%) 1/1 (100%)
Two-turn celebrity flow: identity then pronoun follow-up 🔸 1/1 XFAIL 0/1 (0%)
USER: identity, location, pets, diet, job 1/1 (100%) 1/1 (100%)
Unknown entity with poisoned diary still triggers web search live 🔸 1/1 XFAIL 1/1 (100%)
Unknown entity: Piranesi (book) 🔸 1/1 XFAIL 1/1 (100%)
Unknown entity: Possessor (film) 1/1 (100%) 1/1 (100%)
Unknown entity: have-you-heard-of (Piranesi) 🔸 1/1 XFAIL 1/1 (100%)
Unknown entity: permission-framed (Possessor) 🔸 1/1 XFAIL 1/1 (100%)
Unrelated domain still returns none 1/1 (100%) 1/1 (100%)
Unrelated topics are not welded into one clause 1/1 (100%) 1/1 (100%)
User query not confused with echo after tts 1/1 (100%) 1/1 (100%)
Utterance started during tts treated as hot window 1/1 (100%) 1/1 (100%)
WORLD: local business details, film attribution 1/1 (100%) 1/1 (100%)
Wake word query after echo segments 1/1 (100%) 1/1 (100%)
Wake word query uses judge extraction 1/1 (100%) 1/1 (100%)
Watch recommendation surfaces recently discussed films 1/1 (100%) 1/1 (100%)
Weather query is answered with current conditions 0/1 (0%) 1/1 (100%)
Weather query still picks getWeather 1/1 (100%) 1/1 (100%)
Weather query still triggers tools after a greeting 1/1 (100%) 1/1 (100%)
Wikipedia payload produces grounded reply 1/1 (100%) 1/1 (100%)
Wikipedia rescues when ddg blocks 1/1 (100%) 1/1 (100%)
calorie budget \u2192 fetchMeals 1/1 (100%) 1/1 (100%)
cold-memory-short-query-how's the weather 1/1 (100%) 1/1 (100%)
cold-memory-week-forecast-what's the weather this week 1/1 (100%) 0/1 (0%)
dietary check \u2192 fetchMeals 1/1 (100%) 1/1 (100%)
explicit-recall-then-search 1/1 (100%) 0/1 (0%)
find the invoice PDF on my computer 1/1 (100%) 1/1 (100%)
food decision \u2192 fetchMeals 🔸 1/1 XFAIL 1/1 (100%)
jacket \u2192 getWeather 1/1 (100%) 1/1 (100%)
location weather query selects getWeather and few others 1/1 (100%) 0/1 (0%)
log that I just ate a banana 1/1 (100%) 1/1 (100%)
meal logging selects logMeal and few others 1/1 (100%) 1/1 (100%)
meal recall (colloquial) \u2192 fetchMeals 1/1 (100%) 1/1 (100%)
meal recall selects fetchMeals and few others 1/1 (100%) 1/1 (100%)
news-interesting-for-me 1/1 (100%) 1/1 (100%)
news-of-interest-to-me 1/1 (100%) 0/1 (0%)
news-that-would-interest-me 1/1 (100%) 1/1 (100%)
recommend a book I'd like 1/1 (100%) 1/1 (100%)
research \u2192 webSearch + fetchWebPage 1/1 (100%) 1/1 (100%)
run forecast \u2192 getWeather 1/1 (100%) 1/1 (100%)
search the web for flight deals 1/1 (100%) 1/1 (100%)
suggest something I'd enjoy watching ton 1/1 (100%) 1/1 (100%)
take a screenshot 1/1 (100%) 1/1 (100%)
tell me some news that might interest me 1/1 (100%) 0/1 (0%)
warm-memory-short-query-how's the weather 1/1 (100%) 1/1 (100%)
weather + meals 1/1 (100%) 1/1 (100%)
weather query selects getWeather and few others 1/1 (100%) 1/1 (100%)
web search query selects webSearch and few others 1/1 (100%) 1/1 (100%)
weekly weather keeps getWeather 1/1 (100%) 1/1 (100%)
what is the capital of France 1/1 (100%) 1/1 (100%)
what should I cook for dinner 1/1 (100%) 1/1 (100%)
what's 2 plus 2 1/1 (100%) 1/1 (100%)
what's on my screen right now? 1/1 (100%) 1/1 (100%)
what's the weather like? 1/1 (100%) 1/1 (100%)
who is Britney Spears 0/1 (0%) 1/1 (100%)

🎤 Intent judge

Pinned to gemma4:e2b (the voice intent classifier). Not affected by the judge model. Re-run on 2026-05-04 with the prompt fix in this PR; cells repped 5× where they sit on the small-model edge.

Notes:

  • cross_segment_answer_that_with_noise regressed between main and develop (introduced by a8f133c's "big Mac" few-shot example, which biased the small model toward preserving user text instead of resolving cross-segment imperatives). Two contrasting examples added in this PR — one for prior-question-with-noise, one for the multi-word "go ahead and answer" imperative — restore both this case and multi_person_weather_discussion and cross_segment_go_ahead_and_answer (each 5/5).
  • New case wake_word_trailing_after_capitalised_brand (added in a8f133c) covers the original "big Mac" regression and is preserved by the fix.
  • The three edge cases were each repped 5× during the prompt iteration to confirm stability; recorded as 1/1 here for consistency with the rest of the table.
Test Case Pass Rate Status
Hot window mode indicated in prompt 1/1 (100%)
Old query not re extracted 1/1 (100%)
Processed segment not reextracted 1/1 (100%)
Returns none when ollama unavailable 1/1 (100%)
System prompt has echo guidance 1/1 (100%)
Tts text included for echo detection 1/1 (100%)
alias_after_narrative_context 1/1 (100%)
alias_treated_as_wake_word 1/1 (100%)
buffer_echo_then_followup_hot_window 1/1 (100%)
buried_target_amid_unrelated_chatter 1/1 (100%)
buried_target_plural_vague_ref_they 1/1 (100%)
buried_target_topicless_question 1/1 (100%)
context_synthesis_weather_opinion 1/1 (100%)
context_synthesis_with_prior_ambient 1/1 (100%)
cross_segment_answer_that_weather 1/1 (100%)
cross_segment_answer_that_with_noise 1/1 (100%)
cross_segment_answered_that_whisper_variant 1/1 (100%)
cross_segment_dinosaur_opinion 1/1 (100%)
cross_segment_go_ahead_and_answer 1/1 (100%)
cross_segment_hot_window_followup 1/1 (100%)
cross_segment_imperative_superseded_by_new_question 1/1 (100%)
echo_plus_followup_extracted 1/1 (100%)
echo_plus_rejected_similar_plus_wake_retry 1/1 (100%)
hot_window_override_topicless_followup 1/1 (100%)
hot_window_simple_followup 1/1 (100%)
mentioned_in_narrative_past_tense 1/1 (100%)
multi_person_vague_reference 1/1 (100%)
multi_person_weather_discussion 1/1 (100%)
multiple_echoes_then_interrupt 1/1 (100%)
no_wake_word_casual_speech 1/1 (100%)
no_wake_word_in_buffer 1/1 (100%)
stop_command_during_tts 1/1 (100%)
user_followup_statement_after_question_nihilism 1/1 (100%)
wake_word_after_narrative_addresses_assistant 1/1 (100%)
wake_word_command_timer 1/1 (100%)
wake_word_mid_sentence 1/1 (100%)
wake_word_open_imperative_give_me_advice 1/1 (100%)
wake_word_open_imperative_say_something 1/1 (100%)
wake_word_open_imperative_surprise_me 1/1 (100%)
wake_word_open_imperative_tell_me_a_joke 1/1 (100%)
wake_word_open_imperative_tell_me_anything 1/1 (100%)
wake_word_share_statement_burger 1/1 (100%)
wake_word_share_statement_feeling 1/1 (100%)
wake_word_share_statement_trailing 1/1 (100%)
wake_word_simple_question 1/1 (100%)
wake_word_statement_remember 1/1 (100%)
wake_word_trailing_after_capitalised_brand 1/1 (100%)
wake_word_trailing_after_named_entity 1/1 (100%)

🧠 Memory merge consolidation

Exercises merge_node_data against a real picker model. Pins the rewrite-on-write merge against its five advertised behaviours: dedupe of near-duplicates, pattern consolidation of repeated activities, independence (unrelated facts coexist, no silent erasure), meta-narrative pruning (assistant-narrating extractor leftovers get scrubbed), and end-to-end correctness of the batched signature. Run via pytest evals/test_merge_consolidation.py.

Test Case Pass Rate Status
Dedupe — same fact, different wording (lives-in vs based-in London) 1/1 (100%)
Dedupe — job title rephrased 1/1 (100%)
Pattern — repeated sushi meals fold into "regularly eats sushi" 1/1 (100%)
Pattern boundary — distinct one-off dated events stay distinct 1/1 (100%)
Independence — peanut allergy + tea preference survive unrelated hiking fact 1/1 (100%)
Independence — software-engineer job survives unrelated guitar fact 1/1 (100%)
Meta-narrative — capability-denial line dropped, real directive kept 1/1 (100%)
Meta-narrative — assistant-suggested line dropped, factual lookup survives 1/1 (100%)
Meta-narrative — polluted node receiving new fact: drop + incorporate 1/1 (100%)
Meta-narrative — clean directives node not over-pruned 1/1 (100%)
Batched merge — three independent new facts in one call all land 1/1 (100%)

Notes: the pattern-boundary case was previously xfail(strict=False) because gemma4:e2b clustered dated entries and silently dropped older ones. After the META-NARRATIVE rule landed it now passes 3/3 reps; the causal link is unconfirmed but the eval is the right place to catch a regression, so the marker is dropped and the case stands as a regular PASS.


📖 Legend

Symbol Meaning
Fully passed (100% pass rate)
⚠️ Partial pass (some runs failed)
Fully failed (0% pass rate)
⏭️ Skipped (missing dependencies)
🔸 Expected failure (known limitation)
🎉 Unexpectedly passed (bug fixed!)
Not run for this model

Report generated by Jarvis eval suite