# ๐Ÿงช Jarvis Evaluation Report **Generated:** 2026-05-04 (gemma4:e2b column refreshed with retry-aware outcomes from a full `--single` run; gpt-oss:20b column inherited unchanged from the 2026-04-27 regen) ## ๐Ÿ“Š TL;DR **Overall:** ๐ŸŸข **340/354 passed (96.0%)** across all categories *(small-model column re-baselined from a fresh `gemma4:e2b` run with up to 3ร— retries; three new tests added in #352, one intent-judge regression introduced by `a8f133c` recovered by the prompt fix in this PR โ€” see "Intent judge" below)* | Category | Model | Passed | Failed | Skipped | Pass Rate | |----------|-------|-------:|-------:|--------:|----------:| | ๐Ÿค– Agent behaviour | `gemma4:e2b` | 136 | 7 | 2 | ๐ŸŸข 95.1% | | ๐Ÿค– Agent behaviour | `gpt-oss:20b` | 145 | 7 | 0 | ๐ŸŸข 95.4% | | ๐ŸŽค Intent judge | `gemma4:e2b` (fixed) | 48 | 0 | 0 | ๐ŸŸข 100.0% | | ๐Ÿง  Memory merge consolidation | `gemma4:e2b` | 11 | 0 | 0 | ๐ŸŸข 100.0% | ### ๐Ÿ’ก Model Selection Guide | Model | Best For | Trade-offs | |-------|----------|------------| | `gemma4:e2b` | Quick responses, lower RAM usage | May struggle with complex reasoning | | `gpt-oss:20b` | Best accuracy, complex tasks | Slower, requires more RAM | --- ## ๐Ÿค– Agent behaviour > Runs the full agent pipeline against each judge model. Tests are compared side-by-side. | Test Case | gemma4:e2b | gpt-oss:20b | |-----------|----------:|----------:| | 3-turn conversation with topic changes | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Active hot window follow up accepted | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Adversarial: all three branches in one summary | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Adversarial: food preference (USER) vs list-length rule (DIRECTIVES) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Agent calls webSearch for info queries | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Agent chains search โ†’ fetch for details | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Agent uses memory + nutrition data | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Assistant checks memory before asking about interests | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Assistant does not deny having long-term memory | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Bad: deflection without attempting answer | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Bad: empty acknowledgment | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Bad: generic greeting ignores query | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Casual statement without wake word rejected | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Chained research: who directed Possessor and what else have they made | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Correction loop accepts single or retry | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Cross turn pronoun resolution | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | DIRECTIVES: tone, length, forbidden phrases, address form | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Date query with date in context returns none | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Diary location grounds getWeather call (#352) | โŒ 0/1 (0%) | โž– | | Diet changed from bulking to cutting | โญ๏ธ SKIPPED | ๐Ÿ”ธ 1/1 XFAIL | | Digested tool result produces grounded reply | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Director-then-filmography needs two searches | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Enrichment results appear in system message | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Enrichment skips questions answered by context | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Escape hatch then follow up action | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Evaluator emits structured tool call for obvious search | โœ… 1/1 (100%) | ๐Ÿ”ธ 1/1 XFAIL | | Extraction with explicit quantities | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | First turn calls web search not clarification | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Follow up after correction calls web search | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Follow up resolves pronoun in search query | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Follow-up references previous turn context | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Followup naming place routes to getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Followup supplies missing tool arg โ€” short follow-up continues previous tool chain (#352) | โœ… 1/1 (100%) | โž– | | Good: brief but informative | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Good: complete weekly forecast | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Graph supplies missing tool arg โ€” warm-profile fact grounds getWeather call (#352) | โŒ 0/1 (0%) | โž– | | Graph-enriched facts surface in the reply, no denial | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Greeting: hello | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Greeting: ni hao (Chinese) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Handles ambiguous portion descriptions | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Honest block when all providers fail | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Hot window query is directed and non empty | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Identity query does not trigger recommendation engagement rule | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Identity query surfaces multiple user facts when present | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Identity query surfaces user stated fact over past qa | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Identity query with only past qa returns none or no false facts | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Instruction: be more brief | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Instruction: use Celsius | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Judge echo claim overridden in hot window | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | LLM uses enrichment-surfaced interests for personalised search | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Links only payload produces honest cant read reply | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Location context flows to search queries | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Location query with location in context returns none | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Location query with partial hint still routes sensibly | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | LogMealTool stores meals with macros | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Max-turn cap delivers a digest reply, never silence | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Memory enrichment: personalized news | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Memory enrichment: time-based recall | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Memory enrichment: topic recall | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Mixed summary: keep novel facts, drop stale weather/recommendations | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Navigate prose gets nudged into tool call | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | No deflection: tech news | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No deflection: time query | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No deflection: tomorrow weather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No deflection: weekly rain forecast | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No email tool declines honestly | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No hint at all still routes sensibly | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | No wake word rejected despite judge | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Novel knowledge: local business details and user location | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Novel knowledge: non-English summary (Turkish) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Novel knowledge: relocation plans and employment | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Novel knowledge: user diet plan and preferred recipe | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Nudge cap stops loop | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Nutrition: cheeseburger with fries | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Nutrition: chicken with broccoli | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Nutrition: oatmeal with banana | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Office days changed from Mon/Wed to Mon/Thu | โญ๏ธ SKIPPED | ๐Ÿ”ธ 1/1 XFAIL | | Omits deflection narration for unknown entity | โœ… 1/1 (100%) | ๐Ÿ”ธ 1/1 XFAIL | | Omits deflection when topic never resolved | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Open-ended prompt grounds in stored knowledge | โŒ 0/1 (0%) | โœ… 1/1 (100%) | | Parallel weather lookup: compare Paris and London | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Preserves legitimate user preferences | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Realistic web search payload is not deflected to links | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Recommendation query still surfaces engagement when user facts present | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Reframing: life events framed as facts with temporal context | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Reframing: requests become knowledge, not interaction descriptions | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Reject: assistant self-references (recommendations are not knowledge) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Reject: stale temporal snapshots (weather, time of day) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Restaurant recommendation surfaces past cuisine interest | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Returns NONE for non-food inputs | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Returns valid JSON with all required fields | โŒ 0/1 (0%) | โœ… 1/1 (100%) | | Simple meal baseline (2 boiled eggs) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Single weather query ends after one tool call | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | Speech long after tts requires wake word | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Stop during tts interrupts immediately | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Time query with time in context returns none | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Tool calls literal not surfaced after web search | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Tool retry: explicit tool mention | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Tool retry: vague go ahead | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Tool retry: vague just try | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Toolsearchtool widens then navigate | ๐Ÿ”ธ 1/1 XFAIL | ๐Ÿ”ธ 1/1 XFAIL | | Topic switch: search โ†’ weather uses getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Topic switch: weather โ†’ store hours uses webSearch | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Trivial conversations produce no extracted facts | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Tts echo segments skipped user query extracted | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Turn1 possessor then turn2 weather | โŒ 0/1 (0%) | โœ… 1/1 (100%) | | Two-turn celebrity flow: identity then pronoun follow-up | ๐Ÿ”ธ 1/1 XFAIL | โŒ 0/1 (0%) | | USER: identity, location, pets, diet, job | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Unknown entity with poisoned diary still triggers web search live | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Unknown entity: Piranesi (book) | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Unknown entity: Possessor (film) | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Unknown entity: have-you-heard-of (Piranesi) | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Unknown entity: permission-framed (Possessor) | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | Unrelated domain still returns none | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Unrelated topics are not welded into one clause | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | User query not confused with echo after tts | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Utterance started during tts treated as hot window | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | WORLD: local business details, film attribution | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Wake word query after echo segments | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Wake word query uses judge extraction | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Watch recommendation surfaces recently discussed films | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Weather query is answered with current conditions | โŒ 0/1 (0%) | โœ… 1/1 (100%) | | Weather query still picks getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Weather query still triggers tools after a greeting | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Wikipedia payload produces grounded reply | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | Wikipedia rescues when ddg blocks | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | calorie budget \u2192 fetchMeals | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | cold-memory-short-query-how's the weather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | cold-memory-week-forecast-what's the weather this week | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | dietary check \u2192 fetchMeals | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | explicit-recall-then-search | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | find the invoice PDF on my computer | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | food decision \u2192 fetchMeals | ๐Ÿ”ธ 1/1 XFAIL | โœ… 1/1 (100%) | | jacket \u2192 getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | location weather query selects getWeather and few others | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | log that I just ate a banana | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | meal logging selects logMeal and few others | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | meal recall (colloquial) \u2192 fetchMeals | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | meal recall selects fetchMeals and few others | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | news-interesting-for-me | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | news-of-interest-to-me | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | news-that-would-interest-me | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | recommend a book I'd like | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | research \u2192 webSearch + fetchWebPage | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | run forecast \u2192 getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | search the web for flight deals | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | suggest something I'd enjoy watching ton | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | take a screenshot | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | tell me some news that might interest me | โœ… 1/1 (100%) | โŒ 0/1 (0%) | | warm-memory-short-query-how's the weather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | weather + meals | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | weather query selects getWeather and few others | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | web search query selects webSearch and few others | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | weekly weather keeps getWeather | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | what is the capital of France | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | what should I cook for dinner | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | what's 2 plus 2 | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | what's on my screen right now? | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | what's the weather like? | โœ… 1/1 (100%) | โœ… 1/1 (100%) | | who is Britney Spears | โŒ 0/1 (0%) | โœ… 1/1 (100%) | --- ## ๐ŸŽค Intent judge > Pinned to `gemma4:e2b` (the voice intent classifier). Not affected by the judge model. Re-run on 2026-05-04 with the prompt fix in this PR; cells repped 5ร— where they sit on the small-model edge. **Notes:** - `cross_segment_answer_that_with_noise` regressed between `main` and `develop` (introduced by `a8f133c`'s "big Mac" few-shot example, which biased the small model toward preserving user text instead of resolving cross-segment imperatives). Two contrasting examples added in this PR โ€” one for prior-question-with-noise, one for the multi-word "go ahead and answer" imperative โ€” restore both this case and `multi_person_weather_discussion` and `cross_segment_go_ahead_and_answer` (each 5/5). - New case `wake_word_trailing_after_capitalised_brand` (added in `a8f133c`) covers the original "big Mac" regression and is preserved by the fix. - The three edge cases were each repped 5ร— during the prompt iteration to confirm stability; recorded as 1/1 here for consistency with the rest of the table. | Test Case | Pass Rate | Status | |-----------|-----------|:------:| | Hot window mode indicated in prompt | 1/1 (100%) | โœ… | | Old query not re extracted | 1/1 (100%) | โœ… | | Processed segment not reextracted | 1/1 (100%) | โœ… | | Returns none when ollama unavailable | 1/1 (100%) | โœ… | | System prompt has echo guidance | 1/1 (100%) | โœ… | | Tts text included for echo detection | 1/1 (100%) | โœ… | | alias_after_narrative_context | 1/1 (100%) | โœ… | | alias_treated_as_wake_word | 1/1 (100%) | โœ… | | buffer_echo_then_followup_hot_window | 1/1 (100%) | โœ… | | buried_target_amid_unrelated_chatter | 1/1 (100%) | โœ… | | buried_target_plural_vague_ref_they | 1/1 (100%) | โœ… | | buried_target_topicless_question | 1/1 (100%) | โœ… | | context_synthesis_weather_opinion | 1/1 (100%) | โœ… | | context_synthesis_with_prior_ambient | 1/1 (100%) | โœ… | | cross_segment_answer_that_weather | 1/1 (100%) | โœ… | | cross_segment_answer_that_with_noise | 1/1 (100%) | โœ… | | cross_segment_answered_that_whisper_variant | 1/1 (100%) | โœ… | | cross_segment_dinosaur_opinion | 1/1 (100%) | โœ… | | cross_segment_go_ahead_and_answer | 1/1 (100%) | โœ… | | cross_segment_hot_window_followup | 1/1 (100%) | โœ… | | cross_segment_imperative_superseded_by_new_question | 1/1 (100%) | โœ… | | echo_plus_followup_extracted | 1/1 (100%) | โœ… | | echo_plus_rejected_similar_plus_wake_retry | 1/1 (100%) | โœ… | | hot_window_override_topicless_followup | 1/1 (100%) | โœ… | | hot_window_simple_followup | 1/1 (100%) | โœ… | | mentioned_in_narrative_past_tense | 1/1 (100%) | โœ… | | multi_person_vague_reference | 1/1 (100%) | โœ… | | multi_person_weather_discussion | 1/1 (100%) | โœ… | | multiple_echoes_then_interrupt | 1/1 (100%) | โœ… | | no_wake_word_casual_speech | 1/1 (100%) | โœ… | | no_wake_word_in_buffer | 1/1 (100%) | โœ… | | stop_command_during_tts | 1/1 (100%) | โœ… | | user_followup_statement_after_question_nihilism | 1/1 (100%) | โœ… | | wake_word_after_narrative_addresses_assistant | 1/1 (100%) | โœ… | | wake_word_command_timer | 1/1 (100%) | โœ… | | wake_word_mid_sentence | 1/1 (100%) | โœ… | | wake_word_open_imperative_give_me_advice | 1/1 (100%) | โœ… | | wake_word_open_imperative_say_something | 1/1 (100%) | โœ… | | wake_word_open_imperative_surprise_me | 1/1 (100%) | โœ… | | wake_word_open_imperative_tell_me_a_joke | 1/1 (100%) | โœ… | | wake_word_open_imperative_tell_me_anything | 1/1 (100%) | โœ… | | wake_word_share_statement_burger | 1/1 (100%) | โœ… | | wake_word_share_statement_feeling | 1/1 (100%) | โœ… | | wake_word_share_statement_trailing | 1/1 (100%) | โœ… | | wake_word_simple_question | 1/1 (100%) | โœ… | | wake_word_statement_remember | 1/1 (100%) | โœ… | | wake_word_trailing_after_capitalised_brand | 1/1 (100%) | โœ… | | wake_word_trailing_after_named_entity | 1/1 (100%) | โœ… | --- ## ๐Ÿง  Memory merge consolidation > Exercises `merge_node_data` against a real picker model. Pins the rewrite-on-write merge against its five advertised behaviours: dedupe of near-duplicates, pattern consolidation of repeated activities, independence (unrelated facts coexist, no silent erasure), meta-narrative pruning (assistant-narrating extractor leftovers get scrubbed), and end-to-end correctness of the batched signature. Run via `pytest evals/test_merge_consolidation.py`. | Test Case | Pass Rate | Status | |-----------|-----------|:------:| | Dedupe โ€” same fact, different wording (lives-in vs based-in London) | 1/1 (100%) | โœ… | | Dedupe โ€” job title rephrased | 1/1 (100%) | โœ… | | Pattern โ€” repeated sushi meals fold into "regularly eats sushi" | 1/1 (100%) | โœ… | | Pattern boundary โ€” distinct one-off dated events stay distinct | 1/1 (100%) | โœ… | | Independence โ€” peanut allergy + tea preference survive unrelated hiking fact | 1/1 (100%) | โœ… | | Independence โ€” software-engineer job survives unrelated guitar fact | 1/1 (100%) | โœ… | | Meta-narrative โ€” capability-denial line dropped, real directive kept | 1/1 (100%) | โœ… | | Meta-narrative โ€” assistant-suggested line dropped, factual lookup survives | 1/1 (100%) | โœ… | | Meta-narrative โ€” polluted node receiving new fact: drop + incorporate | 1/1 (100%) | โœ… | | Meta-narrative โ€” clean directives node not over-pruned | 1/1 (100%) | โœ… | | Batched merge โ€” three independent new facts in one call all land | 1/1 (100%) | โœ… | **Notes:** the pattern-boundary case was previously `xfail(strict=False)` because `gemma4:e2b` clustered dated entries and silently dropped older ones. After the META-NARRATIVE rule landed it now passes 3/3 reps; the causal link is unconfirmed but the eval is the right place to catch a regression, so the marker is dropped and the case stands as a regular PASS. --- ### ๐Ÿ“– Legend | Symbol | Meaning | |--------|---------| | โœ… | Fully passed (100% pass rate) | | โš ๏ธ | Partial pass (some runs failed) | | โŒ | Fully failed (0% pass rate) | | โญ๏ธ | Skipped (missing dependencies) | | ๐Ÿ”ธ | Expected failure (known limitation) | | ๐ŸŽ‰ | Unexpectedly passed (bug fixed!) | | โž– | Not run for this model | *Report generated by Jarvis eval suite*