Introduction

The problem

The question every team building with AI eventually hits: did this change make the model better or worse? Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. multivon-eval gives you eval scoring you can trust and defend. Scores are calibrated against human labels, every report carries confidence intervals, and the audit trail holds up when you ship into regulated environments.

# pip install multivon-eval
from multivon_eval import EvalSuite, EvalCase, Faithfulness

suite = EvalSuite("rag")
suite.add_cases([EvalCase(input="What's the refund window?",
                          context="Refunds within 30 days.")])
suite.add_evaluators(Faithfulness())
report = suite.run(my_model_fn, save_json="report.json")

print(f"Pass rate {report.pass_rate:.0%}  ({report.evaluated} evaluated, {report.errors} errored)")

Install, import, run. The default judge is claude-haiku-4-5 (F1 0.804 [0.71–0.88] on HaluEval QA in-distribution; F1 0.830 [0.70–0.92] held-out on HaluEval-Sum; $0.00127 per case). Swap in any judge via JudgeConfig, or run fully offline with a local Ollama: JudgeConfig(provider="ollama", model="qwen2.5:14b"). Want a runnable eval suite from a one-paragraph product description? multivon-eval bootstrap. Want it wired into Claude Code as auto-invoking skills? multivon-eval install-skills.

Why trust the numbers? Every shipped benchmark cites Wilson + bootstrap 95% CIs, and thresholds are calibrated per (judge × evaluator) with provenance. When our own measurement catches us, we publish it: a PDF leaderboard that nearly inverted once we sent pixels instead of text, a determinacy gate we set at 50% and missed at 20.9%, and the 0.9.4→0.9.7 self-correction trail. See Why multivon-eval.

New in 0.16.0: multivon-eval validate grades your graders against reference outputs before any model is blamed, pass@k and pass^k separate capability from reliability with real CIs, and a saturation monitor quantifies what a 100% pass rate can still claim. The release-to-recommendation mapping is in Demystifying evals, operationalized.

What it does

Deterministic

String matching, regex, JSON schema, BLEU, ROUGE, latency. Instant, free, no LLM.

LLM-as-judge

QAG scoring — yes/no questions instead of unreliable 1-10 ratings. Faithfulness, hallucination, relevance, and more.

Agent trace

Tool call accuracy, plan quality, step faithfulness, task completion. Framework-agnostic.

Conversation

Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.

Compliance

Local PII detection across GDPR, HIPAA, DPDP, and more, plus Pydantic / JSON Schema validation. Zero API calls.

Consistency

Self-consistency across repeated runs — catch answers that drift between identical prompts.

Multimodal

VQA faithfulness and document grounding for image and document inputs.

The full catalog spans seven families — see the Evaluators reference for every metric.

Key features

QAG scoring. Instead of asking a judge “rate this 1-10”, we generate binary yes/no questions about the output and score by fraction answered correctly. Binary questions eliminate scale ambiguity and are easier for LLMs to answer consistently. Every score is auditable down to which questions passed or failed. Plain-English checks. suite.add_check("Response should mention the return policy") is all you need to write your first eval. The SDK generates yes/no questions from your criterion automatically — no evaluator class to pick, no QAG questions to craft. No cold start, either: point generate_from_file() at your docs and get eval cases in seconds, no labeled dataset required. Reliability and flakiness detection. Run each case N times with suite.run(runs=5) and flaky cases (inconsistent pass/fail) are flagged automatically. Experiment comparison shows statistical significance, so you know whether a regression is real or sampling noise. NAACL 2025 research found single-run eval scores unreliable enough that variance alone can reverse model rankings. For tracking over time, every run can be recorded to ~/.multivon/experiments/; compare two runs side-by-side and get a pass rate delta with p-values. Framework integrations. Capture agent traces from LangChain, LangSmith, or any custom agent. Import existing LangSmith runs as eval cases without re-running the agent. Shareable HTML reports. report.save_html("report.html") produces a self-contained HTML file with per-case breakdowns, evaluator scores, and flakiness indicators. No server required. CI/CD first. One line exits with code 1 if your pass rate drops below a threshold. Evals that don’t run in CI catch nothing. Prompt-drift staleness and provenance. multivon-eval staleness diffs a committed baseline of every prompt call site in your repo against a live scan and tells you which prompts changed since your cases were authored: CHANGED, REMOVED, ADDED, or UNKNOWN, never overclaiming what static analysis can know. An opt-in runtime recorder (pytest --record-prompts) captures the prompts your code actually rendered, with verdicts phrased as “matched k of N observed renderings”. See Prompt-drift staleness. Persona simulation. multivon-eval simulate drives adaptive multi-turn conversations against your live system. A persona LLM with a profile, goal, and behavior traits generates each user turn in response to what your model actually said; transcripts are scored by the conversation evaluators plus a goal-completion judge, with a hard budget ceiling and every output labeled as synthetic. See Simulate.

Install

pip install multivon-eval

Requires Python 3.10+. Set ANTHROPIC_API_KEY and/or OPENAI_API_KEY for LLM-judge evaluators.

Quickstart

Up and running in 5 minutes

Generate datasets

No labeled data? Generate cases from your docs

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

The problem

What it does

Deterministic

LLM-as-judge

Agent trace

Conversation

Compliance

Consistency

Multimodal

Key features

Install

Quickstart

Generate datasets

​The problem

​What it does

Deterministic

LLM-as-judge

Agent trace

Conversation

Compliance

Consistency

Multimodal

​Key features

​Install

Quickstart

Generate datasets

The problem

What it does

Key features

Install