Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
eval-audit is eval as a pre-flight check, not a nightly batch. It sits between /review and /ship: on a PR diff that touches prompts, model calls, or tool definitions, the skill runs only the eval cases that stress the changed surface, computes per-evaluator deltas vs the baseline, and blocks the ship if any safety-class evaluator regresses at p < 0.05. Targeted runs fit in under 60 seconds on a typical PR.
When it auto-invokes
The skill auto-invokes in two situations: Post-/review triggers. After /review succeeds AND the diff touches any of:
- Files with LLM call sites (
anthropic/openai/google/litellmimports plus.create()or.completion()calls). - System prompts or instruction-tuned templates.
- Tool definitions (the
tools=argument, function-call schemas). - Retrieval pipeline code (chunkers, embedders, rerankers).
- Evaluator threshold YAML or
_calibration_data/.
- “audit this prompt change”
- “will this regress evals”
- “regression check before I ship”
- “eval impact of this PR”
allowed-tools
git diff and the eval entrypoint, Read and Grep to scope the change, and Edit only for writing the audit JSON output file. It does not Write new source files.
What it does
Scope the diff
Runs
git diff --name-only origin/main...HEAD to enumerate changed files. Cross-references against multivon_eval.attribution.scan (added in 0.9.4) to find which prompt fingerprints actually changed — not every code change touches a prompt-relevant surface.Identify stressing cases
Reads the existing
eval_suite.py. For each evaluator, marks which seed cases exercise the changed surface — a system-prompt edit affects all cases, while a single tool definition change only affects cases whose expected_tool_calls reference that tool.Targeted run
Executes only the marked cases. Aims for under 60 seconds wall-clock. For flaky-sensitive evaluators, re-runs multiple times to surface real signal vs noise via
python eval_suite.py --runs 3. The --runs flag belongs to your eval-suite entrypoint (it threads runs= into EvalSuite.run), not to the multivon-eval CLI itself.Compare against baseline
Loads
baseline_report.json if a previous /ship committed one, otherwise re-runs against origin/main. Computes per-evaluator delta, Wilson CI, and paired-McNemar p-value via multivon_eval.compare_reports — the skill does not reimplement the math.The three verdicts
PASS — no regression at p < 0.05:Safety-class auto-escalation
Any evaluator whose name containssafety, toxicity, bias, pii, or hallucination is treated as safety-class. A regression at p < 0.05 on a safety-class evaluator is always BLOCK — never WARN, never PASS-with-note. Non-safety regressions at the same statistical strength surface as WARN with delta plus CI so you can decide.
Output path convention
The audit JSON is written to the first existing location among:benchmarks/results/eval-audit/<head_sha>.json— if the repo uses the multivon-evalbenchmarks/results/pattern.evals/results/eval-audit/<head_sha>.json— if the repo has anevals/directory fromeval-bootstrap..eval-audit/<head_sha>.jsonat the repo root — otherwise.
cat | jq straight to the failing cases:
CI-side counterpart
eval-audit is the pre-ship local check. For post-merge runs, scheduled nightly suites, and PR comment automation, pair it with eval-action — the GitHub Action that runs the full suite on push, posts a diff comment to the PR, and (optionally) blocks merge on safety regressions. The two share the same multivon-eval machinery; the skill is local-fast-targeted, the Action is CI-thorough-comprehensive.
What it doesn’t do
- Doesn’t replace the full nightly suite. This is a targeted pre-ship check; comprehensive runs go through
eval-action. - Doesn’t auto-fix regressions. It surfaces them — fixes are still human judgment.
- Doesn’t add new eval cases. If a regression points at an unexercised surface, the skill suggests adding a case but doesn’t write it inline (you pick the right framing).
See also
- eval-bootstrap — generates the suite this skill audits against.
- eval-explain — when a particular evaluator flags a regression and you ask “wait, what does this even measure?”
- CI/CD integration guide — wiring
multivon-evalinto GitHub Actions.

