Files
javis_bot/EVALS.md
javis-bot c4abf63f38
Some checks failed
Release / semantic-release (push) Successful in 59s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 13m45s
Release / build-linux (push) Failing after 7m47s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
Add Discord-native hybrid front-end for Jarvis (bot + bridge)
Transform isair/jarvis into a Discord-controlled voice assistant running on
the Ubuntu VNC desktop, keeping the mature ~39k-line Python brain intact.

- bot/ (Node + bun, discord.js): /자비스 slash commands (ephemeral),
  voice channel join + voice receive/playback, pluggable VNC screen broadcast
  (selfbot live / noVNC / screenshot)
- bridge/ (Python, Flask): wraps jarvis STT + run_reply_engine + Piper TTS
  behind a thin localhost HTTP API
- .env.example, scripts/ (start_bridge/start_bot/dev), README rewrite,
  docs/language-comparison.md and docs/vnc-xfce-setup.md

Language decision: hybrid (Python brain + Node/bun Discord layer) because
Discord blocks bot video; native screen broadcast only works via a Node
selfbot library.
2026-06-09 14:51:05 +09:00

290 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🧪 Jarvis Evaluation Report
**Generated:** 2026-05-04 (gemma4:e2b column refreshed with retry-aware outcomes from a full `--single` run; gpt-oss:20b column inherited unchanged from the 2026-04-27 regen)
## 📊 TL;DR
**Overall:** 🟢 **340/354 passed (96.0%)** across all categories *(small-model column re-baselined from a fresh `gemma4:e2b` run with up to 3× retries; three new tests added in #352, one intent-judge regression introduced by `a8f133c` recovered by the prompt fix in this PR — see "Intent judge" below)*
| Category | Model | Passed | Failed | Skipped | Pass Rate |
|----------|-------|-------:|-------:|--------:|----------:|
| 🤖 Agent behaviour | `gemma4:e2b` | 136 | 7 | 2 | 🟢 95.1% |
| 🤖 Agent behaviour | `gpt-oss:20b` | 145 | 7 | 0 | 🟢 95.4% |
| 🎤 Intent judge | `gemma4:e2b` (fixed) | 48 | 0 | 0 | 🟢 100.0% |
| 🧠 Memory merge consolidation | `gemma4:e2b` | 11 | 0 | 0 | 🟢 100.0% |
### 💡 Model Selection Guide
| Model | Best For | Trade-offs |
|-------|----------|------------|
| `gemma4:e2b` | Quick responses, lower RAM usage | May struggle with complex reasoning |
| `gpt-oss:20b` | Best accuracy, complex tasks | Slower, requires more RAM |
---
## 🤖 Agent behaviour
> Runs the full agent pipeline against each judge model. Tests are compared side-by-side.
| Test Case | gemma4:e2b | gpt-oss:20b |
|-----------|----------:|----------:|
| 3-turn conversation with topic changes | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Active hot window follow up accepted | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Adversarial: all three branches in one summary | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Adversarial: food preference (USER) vs list-length rule (DIRECTIVES) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Agent calls webSearch for info queries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Agent chains search → fetch for details | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Agent uses memory + nutrition data | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Assistant checks memory before asking about interests | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Assistant does not deny having long-term memory | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Bad: deflection without attempting answer | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Bad: empty acknowledgment | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Bad: generic greeting ignores query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Casual statement without wake word rejected | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Chained research: who directed Possessor and what else have they made | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Correction loop accepts single or retry | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Cross turn pronoun resolution | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| DIRECTIVES: tone, length, forbidden phrases, address form | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Date query with date in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Diary location grounds getWeather call (#352) | ❌ 0/1 (0%) | |
| Diet changed from bulking to cutting | ⏭️ SKIPPED | 🔸 1/1 XFAIL |
| Digested tool result produces grounded reply | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Director-then-filmography needs two searches | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Enrichment results appear in system message | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Enrichment skips questions answered by context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Escape hatch then follow up action | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Evaluator emits structured tool call for obvious search | ✅ 1/1 (100%) | 🔸 1/1 XFAIL |
| Extraction with explicit quantities | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| First turn calls web search not clarification | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Follow up after correction calls web search | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Follow up resolves pronoun in search query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Follow-up references previous turn context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Followup naming place routes to getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Followup supplies missing tool arg — short follow-up continues previous tool chain (#352) | ✅ 1/1 (100%) | |
| Good: brief but informative | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Good: complete weekly forecast | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Graph supplies missing tool arg — warm-profile fact grounds getWeather call (#352) | ❌ 0/1 (0%) | |
| Graph-enriched facts surface in the reply, no denial | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Greeting: hello | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Greeting: ni hao (Chinese) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Handles ambiguous portion descriptions | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Honest block when all providers fail | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Hot window query is directed and non empty | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Identity query does not trigger recommendation engagement rule | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Identity query surfaces multiple user facts when present | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Identity query surfaces user stated fact over past qa | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Identity query with only past qa returns none or no false facts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Instruction: be more brief | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Instruction: use Celsius | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Judge echo claim overridden in hot window | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| LLM uses enrichment-surfaced interests for personalised search | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Links only payload produces honest cant read reply | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Location context flows to search queries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Location query with location in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Location query with partial hint still routes sensibly | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| LogMealTool stores meals with macros | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Max-turn cap delivers a digest reply, never silence | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Memory enrichment: personalized news | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Memory enrichment: time-based recall | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Memory enrichment: topic recall | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Mixed summary: keep novel facts, drop stale weather/recommendations | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Navigate prose gets nudged into tool call | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| No deflection: tech news | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No deflection: time query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No deflection: tomorrow weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No deflection: weekly rain forecast | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No email tool declines honestly | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No hint at all still routes sensibly | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| No wake word rejected despite judge | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Novel knowledge: local business details and user location | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Novel knowledge: non-English summary (Turkish) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Novel knowledge: relocation plans and employment | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Novel knowledge: user diet plan and preferred recipe | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Nudge cap stops loop | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Nutrition: cheeseburger with fries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Nutrition: chicken with broccoli | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Nutrition: oatmeal with banana | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Office days changed from Mon/Wed to Mon/Thu | ⏭️ SKIPPED | 🔸 1/1 XFAIL |
| Omits deflection narration for unknown entity | ✅ 1/1 (100%) | 🔸 1/1 XFAIL |
| Omits deflection when topic never resolved | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Open-ended prompt grounds in stored knowledge | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
| Parallel weather lookup: compare Paris and London | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Preserves legitimate user preferences | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Realistic web search payload is not deflected to links | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Recommendation query still surfaces engagement when user facts present | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Reframing: life events framed as facts with temporal context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Reframing: requests become knowledge, not interaction descriptions | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Reject: assistant self-references (recommendations are not knowledge) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Reject: stale temporal snapshots (weather, time of day) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Restaurant recommendation surfaces past cuisine interest | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Returns NONE for non-food inputs | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Returns valid JSON with all required fields | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
| Simple meal baseline (2 boiled eggs) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Single weather query ends after one tool call | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| Speech long after tts requires wake word | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Stop during tts interrupts immediately | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Time query with time in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Tool calls literal not surfaced after web search | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Tool retry: explicit tool mention | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Tool retry: vague go ahead | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Tool retry: vague just try | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Toolsearchtool widens then navigate | 🔸 1/1 XFAIL | 🔸 1/1 XFAIL |
| Topic switch: search → weather uses getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Topic switch: weather → store hours uses webSearch | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Trivial conversations produce no extracted facts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Tts echo segments skipped user query extracted | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Turn1 possessor then turn2 weather | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
| Two-turn celebrity flow: identity then pronoun follow-up | 🔸 1/1 XFAIL | ❌ 0/1 (0%) |
| USER: identity, location, pets, diet, job | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Unknown entity with poisoned diary still triggers web search live | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Unknown entity: Piranesi (book) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Unknown entity: Possessor (film) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Unknown entity: have-you-heard-of (Piranesi) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Unknown entity: permission-framed (Possessor) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| Unrelated domain still returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Unrelated topics are not welded into one clause | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| User query not confused with echo after tts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Utterance started during tts treated as hot window | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| WORLD: local business details, film attribution | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Wake word query after echo segments | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Wake word query uses judge extraction | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Watch recommendation surfaces recently discussed films | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Weather query is answered with current conditions | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
| Weather query still picks getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Weather query still triggers tools after a greeting | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Wikipedia payload produces grounded reply | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| Wikipedia rescues when ddg blocks | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| calorie budget \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| cold-memory-short-query-how's the weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| cold-memory-week-forecast-what's the weather this week | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| dietary check \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| explicit-recall-then-search | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| find the invoice PDF on my computer | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| food decision \u2192 fetchMeals | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
| jacket \u2192 getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| location weather query selects getWeather and few others | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| log that I just ate a banana | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| meal logging selects logMeal and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| meal recall (colloquial) \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| meal recall selects fetchMeals and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| news-interesting-for-me | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| news-of-interest-to-me | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| news-that-would-interest-me | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| recommend a book I'd like | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| research \u2192 webSearch + fetchWebPage | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| run forecast \u2192 getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| search the web for flight deals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| suggest something I'd enjoy watching ton | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| take a screenshot | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| tell me some news that might interest me | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
| warm-memory-short-query-how's the weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| weather + meals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| weather query selects getWeather and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| web search query selects webSearch and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| weekly weather keeps getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| what is the capital of France | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| what should I cook for dinner | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| what's 2 plus 2 | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| what's on my screen right now? | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| what's the weather like? | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
| who is Britney Spears | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
---
## 🎤 Intent judge
> Pinned to `gemma4:e2b` (the voice intent classifier). Not affected by the judge model. Re-run on 2026-05-04 with the prompt fix in this PR; cells repped 5× where they sit on the small-model edge.
**Notes:**
- `cross_segment_answer_that_with_noise` regressed between `main` and `develop` (introduced by `a8f133c`'s "big Mac" few-shot example, which biased the small model toward preserving user text instead of resolving cross-segment imperatives). Two contrasting examples added in this PR — one for prior-question-with-noise, one for the multi-word "go ahead and answer" imperative — restore both this case and `multi_person_weather_discussion` and `cross_segment_go_ahead_and_answer` (each 5/5).
- New case `wake_word_trailing_after_capitalised_brand` (added in `a8f133c`) covers the original "big Mac" regression and is preserved by the fix.
- The three edge cases were each repped 5× during the prompt iteration to confirm stability; recorded as 1/1 here for consistency with the rest of the table.
| Test Case | Pass Rate | Status |
|-----------|-----------|:------:|
| Hot window mode indicated in prompt | 1/1 (100%) | ✅ |
| Old query not re extracted | 1/1 (100%) | ✅ |
| Processed segment not reextracted | 1/1 (100%) | ✅ |
| Returns none when ollama unavailable | 1/1 (100%) | ✅ |
| System prompt has echo guidance | 1/1 (100%) | ✅ |
| Tts text included for echo detection | 1/1 (100%) | ✅ |
| alias_after_narrative_context | 1/1 (100%) | ✅ |
| alias_treated_as_wake_word | 1/1 (100%) | ✅ |
| buffer_echo_then_followup_hot_window | 1/1 (100%) | ✅ |
| buried_target_amid_unrelated_chatter | 1/1 (100%) | ✅ |
| buried_target_plural_vague_ref_they | 1/1 (100%) | ✅ |
| buried_target_topicless_question | 1/1 (100%) | ✅ |
| context_synthesis_weather_opinion | 1/1 (100%) | ✅ |
| context_synthesis_with_prior_ambient | 1/1 (100%) | ✅ |
| cross_segment_answer_that_weather | 1/1 (100%) | ✅ |
| cross_segment_answer_that_with_noise | 1/1 (100%) | ✅ |
| cross_segment_answered_that_whisper_variant | 1/1 (100%) | ✅ |
| cross_segment_dinosaur_opinion | 1/1 (100%) | ✅ |
| cross_segment_go_ahead_and_answer | 1/1 (100%) | ✅ |
| cross_segment_hot_window_followup | 1/1 (100%) | ✅ |
| cross_segment_imperative_superseded_by_new_question | 1/1 (100%) | ✅ |
| echo_plus_followup_extracted | 1/1 (100%) | ✅ |
| echo_plus_rejected_similar_plus_wake_retry | 1/1 (100%) | ✅ |
| hot_window_override_topicless_followup | 1/1 (100%) | ✅ |
| hot_window_simple_followup | 1/1 (100%) | ✅ |
| mentioned_in_narrative_past_tense | 1/1 (100%) | ✅ |
| multi_person_vague_reference | 1/1 (100%) | ✅ |
| multi_person_weather_discussion | 1/1 (100%) | ✅ |
| multiple_echoes_then_interrupt | 1/1 (100%) | ✅ |
| no_wake_word_casual_speech | 1/1 (100%) | ✅ |
| no_wake_word_in_buffer | 1/1 (100%) | ✅ |
| stop_command_during_tts | 1/1 (100%) | ✅ |
| user_followup_statement_after_question_nihilism | 1/1 (100%) | ✅ |
| wake_word_after_narrative_addresses_assistant | 1/1 (100%) | ✅ |
| wake_word_command_timer | 1/1 (100%) | ✅ |
| wake_word_mid_sentence | 1/1 (100%) | ✅ |
| wake_word_open_imperative_give_me_advice | 1/1 (100%) | ✅ |
| wake_word_open_imperative_say_something | 1/1 (100%) | ✅ |
| wake_word_open_imperative_surprise_me | 1/1 (100%) | ✅ |
| wake_word_open_imperative_tell_me_a_joke | 1/1 (100%) | ✅ |
| wake_word_open_imperative_tell_me_anything | 1/1 (100%) | ✅ |
| wake_word_share_statement_burger | 1/1 (100%) | ✅ |
| wake_word_share_statement_feeling | 1/1 (100%) | ✅ |
| wake_word_share_statement_trailing | 1/1 (100%) | ✅ |
| wake_word_simple_question | 1/1 (100%) | ✅ |
| wake_word_statement_remember | 1/1 (100%) | ✅ |
| wake_word_trailing_after_capitalised_brand | 1/1 (100%) | ✅ |
| wake_word_trailing_after_named_entity | 1/1 (100%) | ✅ |
---
## 🧠 Memory merge consolidation
> Exercises `merge_node_data` against a real picker model. Pins the rewrite-on-write merge against its five advertised behaviours: dedupe of near-duplicates, pattern consolidation of repeated activities, independence (unrelated facts coexist, no silent erasure), meta-narrative pruning (assistant-narrating extractor leftovers get scrubbed), and end-to-end correctness of the batched signature. Run via `pytest evals/test_merge_consolidation.py`.
| Test Case | Pass Rate | Status |
|-----------|-----------|:------:|
| Dedupe — same fact, different wording (lives-in vs based-in London) | 1/1 (100%) | ✅ |
| Dedupe — job title rephrased | 1/1 (100%) | ✅ |
| Pattern — repeated sushi meals fold into "regularly eats sushi" | 1/1 (100%) | ✅ |
| Pattern boundary — distinct one-off dated events stay distinct | 1/1 (100%) | ✅ |
| Independence — peanut allergy + tea preference survive unrelated hiking fact | 1/1 (100%) | ✅ |
| Independence — software-engineer job survives unrelated guitar fact | 1/1 (100%) | ✅ |
| Meta-narrative — capability-denial line dropped, real directive kept | 1/1 (100%) | ✅ |
| Meta-narrative — assistant-suggested line dropped, factual lookup survives | 1/1 (100%) | ✅ |
| Meta-narrative — polluted node receiving new fact: drop + incorporate | 1/1 (100%) | ✅ |
| Meta-narrative — clean directives node not over-pruned | 1/1 (100%) | ✅ |
| Batched merge — three independent new facts in one call all land | 1/1 (100%) | ✅ |
**Notes:** the pattern-boundary case was previously `xfail(strict=False)` because `gemma4:e2b` clustered dated entries and silently dropped older ones. After the META-NARRATIVE rule landed it now passes 3/3 reps; the causal link is unconfirmed but the eval is the right place to catch a regression, so the marker is dropped and the case stands as a regular PASS.
---
### 📖 Legend
| Symbol | Meaning |
|--------|---------|
| ✅ | Fully passed (100% pass rate) |
| ⚠️ | Partial pass (some runs failed) |
| ❌ | Fully failed (0% pass rate) |
| ⏭️ | Skipped (missing dependencies) |
| 🔸 | Expected failure (known limitation) |
| 🎉 | Unexpectedly passed (bug fixed!) |
| | Not run for this model |
*Report generated by Jarvis eval suite*