Some checks failed
Release / semantic-release (push) Successful in 59s
tests / Unit tests (Linux, Python 3.11) (push) Successful in 13m45s
Release / build-linux (push) Failing after 7m47s
Release / build-windows (push) Has been cancelled
Release / build-macos (arm64, macos-latest) (push) Has been cancelled
Release / build-macos (x64, macos-15-intel) (push) Has been cancelled
Release / release-main (push) Has been cancelled
Release / release-develop (push) Has been cancelled
Transform isair/jarvis into a Discord-controlled voice assistant running on the Ubuntu VNC desktop, keeping the mature ~39k-line Python brain intact. - bot/ (Node + bun, discord.js): /자비스 slash commands (ephemeral), voice channel join + voice receive/playback, pluggable VNC screen broadcast (selfbot live / noVNC / screenshot) - bridge/ (Python, Flask): wraps jarvis STT + run_reply_engine + Piper TTS behind a thin localhost HTTP API - .env.example, scripts/ (start_bridge/start_bot/dev), README rewrite, docs/language-comparison.md and docs/vnc-xfce-setup.md Language decision: hybrid (Python brain + Node/bun Discord layer) because Discord blocks bot video; native screen broadcast only works via a Node selfbot library.
290 lines
20 KiB
Markdown
290 lines
20 KiB
Markdown
# 🧪 Jarvis Evaluation Report
|
||
|
||
**Generated:** 2026-05-04 (gemma4:e2b column refreshed with retry-aware outcomes from a full `--single` run; gpt-oss:20b column inherited unchanged from the 2026-04-27 regen)
|
||
|
||
## 📊 TL;DR
|
||
|
||
**Overall:** 🟢 **340/354 passed (96.0%)** across all categories *(small-model column re-baselined from a fresh `gemma4:e2b` run with up to 3× retries; three new tests added in #352, one intent-judge regression introduced by `a8f133c` recovered by the prompt fix in this PR — see "Intent judge" below)*
|
||
|
||
| Category | Model | Passed | Failed | Skipped | Pass Rate |
|
||
|----------|-------|-------:|-------:|--------:|----------:|
|
||
| 🤖 Agent behaviour | `gemma4:e2b` | 136 | 7 | 2 | 🟢 95.1% |
|
||
| 🤖 Agent behaviour | `gpt-oss:20b` | 145 | 7 | 0 | 🟢 95.4% |
|
||
| 🎤 Intent judge | `gemma4:e2b` (fixed) | 48 | 0 | 0 | 🟢 100.0% |
|
||
| 🧠 Memory merge consolidation | `gemma4:e2b` | 11 | 0 | 0 | 🟢 100.0% |
|
||
|
||
### 💡 Model Selection Guide
|
||
|
||
| Model | Best For | Trade-offs |
|
||
|-------|----------|------------|
|
||
| `gemma4:e2b` | Quick responses, lower RAM usage | May struggle with complex reasoning |
|
||
| `gpt-oss:20b` | Best accuracy, complex tasks | Slower, requires more RAM |
|
||
|
||
---
|
||
|
||
## 🤖 Agent behaviour
|
||
|
||
> Runs the full agent pipeline against each judge model. Tests are compared side-by-side.
|
||
|
||
| Test Case | gemma4:e2b | gpt-oss:20b |
|
||
|-----------|----------:|----------:|
|
||
| 3-turn conversation with topic changes | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Active hot window follow up accepted | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Adversarial: all three branches in one summary | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Adversarial: food preference (USER) vs list-length rule (DIRECTIVES) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Agent calls webSearch for info queries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Agent chains search → fetch for details | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Agent uses memory + nutrition data | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Assistant checks memory before asking about interests | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Assistant does not deny having long-term memory | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Bad: deflection without attempting answer | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Bad: empty acknowledgment | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Bad: generic greeting ignores query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Casual statement without wake word rejected | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Chained research: who directed Possessor and what else have they made | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Correction loop accepts single or retry | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Cross turn pronoun resolution | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| DIRECTIVES: tone, length, forbidden phrases, address form | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Date query with date in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Diary location grounds getWeather call (#352) | ❌ 0/1 (0%) | ➖ |
|
||
| Diet changed from bulking to cutting | ⏭️ SKIPPED | 🔸 1/1 XFAIL |
|
||
| Digested tool result produces grounded reply | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Director-then-filmography needs two searches | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Enrichment results appear in system message | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Enrichment skips questions answered by context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Escape hatch then follow up action | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Evaluator emits structured tool call for obvious search | ✅ 1/1 (100%) | 🔸 1/1 XFAIL |
|
||
| Extraction with explicit quantities | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| First turn calls web search not clarification | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Follow up after correction calls web search | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Follow up resolves pronoun in search query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Follow-up references previous turn context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Followup naming place routes to getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Followup supplies missing tool arg — short follow-up continues previous tool chain (#352) | ✅ 1/1 (100%) | ➖ |
|
||
| Good: brief but informative | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Good: complete weekly forecast | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Graph supplies missing tool arg — warm-profile fact grounds getWeather call (#352) | ❌ 0/1 (0%) | ➖ |
|
||
| Graph-enriched facts surface in the reply, no denial | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Greeting: hello | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Greeting: ni hao (Chinese) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Handles ambiguous portion descriptions | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Honest block when all providers fail | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Hot window query is directed and non empty | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Identity query does not trigger recommendation engagement rule | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Identity query surfaces multiple user facts when present | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Identity query surfaces user stated fact over past qa | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Identity query with only past qa returns none or no false facts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Instruction: be more brief | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Instruction: use Celsius | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Judge echo claim overridden in hot window | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| LLM uses enrichment-surfaced interests for personalised search | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Links only payload produces honest cant read reply | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Location context flows to search queries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Location query with location in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Location query with partial hint still routes sensibly | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| LogMealTool stores meals with macros | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Max-turn cap delivers a digest reply, never silence | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Memory enrichment: personalized news | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Memory enrichment: time-based recall | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Memory enrichment: topic recall | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Mixed summary: keep novel facts, drop stale weather/recommendations | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Navigate prose gets nudged into tool call | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| No deflection: tech news | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No deflection: time query | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No deflection: tomorrow weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No deflection: weekly rain forecast | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No email tool declines honestly | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No hint at all still routes sensibly | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| No wake word rejected despite judge | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Novel knowledge: local business details and user location | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Novel knowledge: non-English summary (Turkish) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Novel knowledge: relocation plans and employment | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Novel knowledge: user diet plan and preferred recipe | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Nudge cap stops loop | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Nutrition: cheeseburger with fries | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Nutrition: chicken with broccoli | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Nutrition: oatmeal with banana | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Office days changed from Mon/Wed to Mon/Thu | ⏭️ SKIPPED | 🔸 1/1 XFAIL |
|
||
| Omits deflection narration for unknown entity | ✅ 1/1 (100%) | 🔸 1/1 XFAIL |
|
||
| Omits deflection when topic never resolved | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Open-ended prompt grounds in stored knowledge | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
|
||
| Parallel weather lookup: compare Paris and London | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Preserves legitimate user preferences | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Realistic web search payload is not deflected to links | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Recommendation query still surfaces engagement when user facts present | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Reframing: life events framed as facts with temporal context | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Reframing: requests become knowledge, not interaction descriptions | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Reject: assistant self-references (recommendations are not knowledge) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Reject: stale temporal snapshots (weather, time of day) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Restaurant recommendation surfaces past cuisine interest | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Returns NONE for non-food inputs | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Returns valid JSON with all required fields | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
|
||
| Simple meal baseline (2 boiled eggs) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Single weather query ends after one tool call | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| Speech long after tts requires wake word | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Stop during tts interrupts immediately | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Time query with time in context returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Tool calls literal not surfaced after web search | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Tool retry: explicit tool mention | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Tool retry: vague go ahead | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Tool retry: vague just try | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Toolsearchtool widens then navigate | 🔸 1/1 XFAIL | 🔸 1/1 XFAIL |
|
||
| Topic switch: search → weather uses getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Topic switch: weather → store hours uses webSearch | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Trivial conversations produce no extracted facts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Tts echo segments skipped user query extracted | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Turn1 possessor then turn2 weather | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
|
||
| Two-turn celebrity flow: identity then pronoun follow-up | 🔸 1/1 XFAIL | ❌ 0/1 (0%) |
|
||
| USER: identity, location, pets, diet, job | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Unknown entity with poisoned diary still triggers web search live | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Unknown entity: Piranesi (book) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Unknown entity: Possessor (film) | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Unknown entity: have-you-heard-of (Piranesi) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Unknown entity: permission-framed (Possessor) | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| Unrelated domain still returns none | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Unrelated topics are not welded into one clause | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| User query not confused with echo after tts | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Utterance started during tts treated as hot window | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| WORLD: local business details, film attribution | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Wake word query after echo segments | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Wake word query uses judge extraction | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Watch recommendation surfaces recently discussed films | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Weather query is answered with current conditions | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
|
||
| Weather query still picks getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Weather query still triggers tools after a greeting | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Wikipedia payload produces grounded reply | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| Wikipedia rescues when ddg blocks | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| calorie budget \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| cold-memory-short-query-how's the weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| cold-memory-week-forecast-what's the weather this week | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| dietary check \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| explicit-recall-then-search | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| find the invoice PDF on my computer | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| food decision \u2192 fetchMeals | 🔸 1/1 XFAIL | ✅ 1/1 (100%) |
|
||
| jacket \u2192 getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| location weather query selects getWeather and few others | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| log that I just ate a banana | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| meal logging selects logMeal and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| meal recall (colloquial) \u2192 fetchMeals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| meal recall selects fetchMeals and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| news-interesting-for-me | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| news-of-interest-to-me | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| news-that-would-interest-me | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| recommend a book I'd like | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| research \u2192 webSearch + fetchWebPage | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| run forecast \u2192 getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| search the web for flight deals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| suggest something I'd enjoy watching ton | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| take a screenshot | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| tell me some news that might interest me | ✅ 1/1 (100%) | ❌ 0/1 (0%) |
|
||
| warm-memory-short-query-how's the weather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| weather + meals | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| weather query selects getWeather and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| web search query selects webSearch and few others | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| weekly weather keeps getWeather | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| what is the capital of France | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| what should I cook for dinner | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| what's 2 plus 2 | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| what's on my screen right now? | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| what's the weather like? | ✅ 1/1 (100%) | ✅ 1/1 (100%) |
|
||
| who is Britney Spears | ❌ 0/1 (0%) | ✅ 1/1 (100%) |
|
||
|
||
---
|
||
|
||
## 🎤 Intent judge
|
||
|
||
> Pinned to `gemma4:e2b` (the voice intent classifier). Not affected by the judge model. Re-run on 2026-05-04 with the prompt fix in this PR; cells repped 5× where they sit on the small-model edge.
|
||
|
||
**Notes:**
|
||
- `cross_segment_answer_that_with_noise` regressed between `main` and `develop` (introduced by `a8f133c`'s "big Mac" few-shot example, which biased the small model toward preserving user text instead of resolving cross-segment imperatives). Two contrasting examples added in this PR — one for prior-question-with-noise, one for the multi-word "go ahead and answer" imperative — restore both this case and `multi_person_weather_discussion` and `cross_segment_go_ahead_and_answer` (each 5/5).
|
||
- New case `wake_word_trailing_after_capitalised_brand` (added in `a8f133c`) covers the original "big Mac" regression and is preserved by the fix.
|
||
- The three edge cases were each repped 5× during the prompt iteration to confirm stability; recorded as 1/1 here for consistency with the rest of the table.
|
||
|
||
| Test Case | Pass Rate | Status |
|
||
|-----------|-----------|:------:|
|
||
| Hot window mode indicated in prompt | 1/1 (100%) | ✅ |
|
||
| Old query not re extracted | 1/1 (100%) | ✅ |
|
||
| Processed segment not reextracted | 1/1 (100%) | ✅ |
|
||
| Returns none when ollama unavailable | 1/1 (100%) | ✅ |
|
||
| System prompt has echo guidance | 1/1 (100%) | ✅ |
|
||
| Tts text included for echo detection | 1/1 (100%) | ✅ |
|
||
| alias_after_narrative_context | 1/1 (100%) | ✅ |
|
||
| alias_treated_as_wake_word | 1/1 (100%) | ✅ |
|
||
| buffer_echo_then_followup_hot_window | 1/1 (100%) | ✅ |
|
||
| buried_target_amid_unrelated_chatter | 1/1 (100%) | ✅ |
|
||
| buried_target_plural_vague_ref_they | 1/1 (100%) | ✅ |
|
||
| buried_target_topicless_question | 1/1 (100%) | ✅ |
|
||
| context_synthesis_weather_opinion | 1/1 (100%) | ✅ |
|
||
| context_synthesis_with_prior_ambient | 1/1 (100%) | ✅ |
|
||
| cross_segment_answer_that_weather | 1/1 (100%) | ✅ |
|
||
| cross_segment_answer_that_with_noise | 1/1 (100%) | ✅ |
|
||
| cross_segment_answered_that_whisper_variant | 1/1 (100%) | ✅ |
|
||
| cross_segment_dinosaur_opinion | 1/1 (100%) | ✅ |
|
||
| cross_segment_go_ahead_and_answer | 1/1 (100%) | ✅ |
|
||
| cross_segment_hot_window_followup | 1/1 (100%) | ✅ |
|
||
| cross_segment_imperative_superseded_by_new_question | 1/1 (100%) | ✅ |
|
||
| echo_plus_followup_extracted | 1/1 (100%) | ✅ |
|
||
| echo_plus_rejected_similar_plus_wake_retry | 1/1 (100%) | ✅ |
|
||
| hot_window_override_topicless_followup | 1/1 (100%) | ✅ |
|
||
| hot_window_simple_followup | 1/1 (100%) | ✅ |
|
||
| mentioned_in_narrative_past_tense | 1/1 (100%) | ✅ |
|
||
| multi_person_vague_reference | 1/1 (100%) | ✅ |
|
||
| multi_person_weather_discussion | 1/1 (100%) | ✅ |
|
||
| multiple_echoes_then_interrupt | 1/1 (100%) | ✅ |
|
||
| no_wake_word_casual_speech | 1/1 (100%) | ✅ |
|
||
| no_wake_word_in_buffer | 1/1 (100%) | ✅ |
|
||
| stop_command_during_tts | 1/1 (100%) | ✅ |
|
||
| user_followup_statement_after_question_nihilism | 1/1 (100%) | ✅ |
|
||
| wake_word_after_narrative_addresses_assistant | 1/1 (100%) | ✅ |
|
||
| wake_word_command_timer | 1/1 (100%) | ✅ |
|
||
| wake_word_mid_sentence | 1/1 (100%) | ✅ |
|
||
| wake_word_open_imperative_give_me_advice | 1/1 (100%) | ✅ |
|
||
| wake_word_open_imperative_say_something | 1/1 (100%) | ✅ |
|
||
| wake_word_open_imperative_surprise_me | 1/1 (100%) | ✅ |
|
||
| wake_word_open_imperative_tell_me_a_joke | 1/1 (100%) | ✅ |
|
||
| wake_word_open_imperative_tell_me_anything | 1/1 (100%) | ✅ |
|
||
| wake_word_share_statement_burger | 1/1 (100%) | ✅ |
|
||
| wake_word_share_statement_feeling | 1/1 (100%) | ✅ |
|
||
| wake_word_share_statement_trailing | 1/1 (100%) | ✅ |
|
||
| wake_word_simple_question | 1/1 (100%) | ✅ |
|
||
| wake_word_statement_remember | 1/1 (100%) | ✅ |
|
||
| wake_word_trailing_after_capitalised_brand | 1/1 (100%) | ✅ |
|
||
| wake_word_trailing_after_named_entity | 1/1 (100%) | ✅ |
|
||
|
||
---
|
||
|
||
## 🧠 Memory merge consolidation
|
||
|
||
> Exercises `merge_node_data` against a real picker model. Pins the rewrite-on-write merge against its five advertised behaviours: dedupe of near-duplicates, pattern consolidation of repeated activities, independence (unrelated facts coexist, no silent erasure), meta-narrative pruning (assistant-narrating extractor leftovers get scrubbed), and end-to-end correctness of the batched signature. Run via `pytest evals/test_merge_consolidation.py`.
|
||
|
||
| Test Case | Pass Rate | Status |
|
||
|-----------|-----------|:------:|
|
||
| Dedupe — same fact, different wording (lives-in vs based-in London) | 1/1 (100%) | ✅ |
|
||
| Dedupe — job title rephrased | 1/1 (100%) | ✅ |
|
||
| Pattern — repeated sushi meals fold into "regularly eats sushi" | 1/1 (100%) | ✅ |
|
||
| Pattern boundary — distinct one-off dated events stay distinct | 1/1 (100%) | ✅ |
|
||
| Independence — peanut allergy + tea preference survive unrelated hiking fact | 1/1 (100%) | ✅ |
|
||
| Independence — software-engineer job survives unrelated guitar fact | 1/1 (100%) | ✅ |
|
||
| Meta-narrative — capability-denial line dropped, real directive kept | 1/1 (100%) | ✅ |
|
||
| Meta-narrative — assistant-suggested line dropped, factual lookup survives | 1/1 (100%) | ✅ |
|
||
| Meta-narrative — polluted node receiving new fact: drop + incorporate | 1/1 (100%) | ✅ |
|
||
| Meta-narrative — clean directives node not over-pruned | 1/1 (100%) | ✅ |
|
||
| Batched merge — three independent new facts in one call all land | 1/1 (100%) | ✅ |
|
||
|
||
**Notes:** the pattern-boundary case was previously `xfail(strict=False)` because `gemma4:e2b` clustered dated entries and silently dropped older ones. After the META-NARRATIVE rule landed it now passes 3/3 reps; the causal link is unconfirmed but the eval is the right place to catch a regression, so the marker is dropped and the case stands as a regular PASS.
|
||
|
||
---
|
||
|
||
### 📖 Legend
|
||
|
||
| Symbol | Meaning |
|
||
|--------|---------|
|
||
| ✅ | Fully passed (100% pass rate) |
|
||
| ⚠️ | Partial pass (some runs failed) |
|
||
| ❌ | Fully failed (0% pass rate) |
|
||
| ⏭️ | Skipped (missing dependencies) |
|
||
| 🔸 | Expected failure (known limitation) |
|
||
| 🎉 | Unexpectedly passed (bug fixed!) |
|
||
| ➖ | Not run for this model |
|
||
|
||
*Report generated by Jarvis eval suite* |