# Performance tests Per-context timings for the reply pipeline. Excluded from the default pytest run (see `pytest.ini`'s `addopts = -m "not performance"`). ## Running ```bash pytest tests/performance/ -v -m performance -s ``` The `-s` flag lets the report table print to stdout. Tests auto-skip when Ollama is unreachable, so the harness is safe to leave in the repo. ## Env vars | Var | Default | Description | |-----|---------|-------------| | `JARVIS_PERF_OLLAMA_URL` | `http://localhost:11434` | Ollama endpoint | | `JARVIS_PERF_MODEL` | `gemma4:e2b` | Model pulled in Ollama for the run | | `JARVIS_PERF_RUNS` | `3` | Runs per query (bump for tighter p95) | | `JARVIS_PERF_REPORT_DIR` | `tests/performance/reports/` | JSON report output | `PERF_RUNS=3` is a fast-iteration default. For stable p95 numbers when benchmarking a change, use `JARVIS_PERF_RUNS=10` or higher. ## What it measures - **`test_micro_benchmark_tiny_prompt`** — one warmup + N tiny round-trips. Hardware baseline: the floor for every context's per-call cost. - **`test_pipeline_timings_by_context`** — three representative queries × N runs of `run_reply_engine`, with per-context timings bucketed via stack-frame inspection in [`timing_recorder.py`](timing_recorder.py). Shape invariants (not absolute numbers): - Evaluator p50 ≤ main chat turn p50 × 1.5. - Tool router p50 ≤ main chat turn p50 × 1.5. - Enrichment extractor shares the router model chain. Unmapped callers print as `other:` — that's a signal to update the `_CALLER_TO_CONTEXT` map in `timing_recorder.py` alongside `docs/llm_contexts.md`. Reports are written to `reports/` and git-ignored.