Skip to main content

The problem

The question every team building with AI eventually hits: did this change make the model better or worse? Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. multivon-eval gives you eval scoring you can trust and defend. Scores are calibrated against human labels, every report carries confidence intervals, and the audit trail holds up when you ship into regulated environments.
# pip install multivon-eval
from multivon_eval import EvalSuite, EvalCase, Faithfulness

suite = EvalSuite("rag")
suite.add_cases([EvalCase(input="What's the refund window?",
                          context="Refunds within 30 days.")])
suite.add_evaluators(Faithfulness())
report = suite.run(my_model_fn, save_json="report.json")

print(f"Pass rate {report.pass_rate:.0%}  ({report.evaluated} evaluated, {report.errors} errored)")
Install, import, run. The default judge is claude-haiku-4-5 (F1 0.804 [0.71–0.88] on HaluEval QA in-distribution; F1 0.830 [0.70–0.92] held-out on HaluEval-Sum; $0.00127 per case). Swap in any judge via JudgeConfig, or run fully offline with a local Ollama: JudgeConfig(provider="ollama", model="qwen2.5:14b"). Want a runnable eval suite from a one-paragraph product description? multivon-eval bootstrap. Want it wired into Claude Code as auto-invoking skills? multivon-eval install-skills.
Why trust the numbers? Every shipped benchmark cites Wilson + bootstrap 95% CIs, thresholds are calibrated per (judge × evaluator) with provenance, and the 0.9.4→0.9.7 release sequence is a public self-correction trail. See Why multivon-eval.

What it does

Deterministic

String matching, regex, JSON schema, BLEU, ROUGE, latency. Instant, free, no LLM.

LLM-as-judge

QAG scoring — yes/no questions instead of unreliable 1-10 ratings. Faithfulness, hallucination, relevance, and more.

Agent trace

Tool call accuracy, plan quality, step faithfulness, task completion. Framework-agnostic.

Conversation

Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.

Key features

QAG scoring. Instead of asking a judge “rate this 1-10”, we generate binary yes/no questions about the output and score by fraction answered correctly. Binary questions eliminate scale ambiguity and are easier for LLMs to answer consistently. Every score is auditable down to which questions passed or failed. Plain-English checks. suite.add_check("Response should mention the return policy") is all you need to write your first eval. The SDK generates yes/no questions from your criterion automatically — no evaluator class to pick, no QAG questions to craft. No cold start, either: point generate_from_file() at your docs and get eval cases in seconds, no labeled dataset required. Reliability and flakiness detection. Run each case N times with suite.run(runs=5) and flaky cases (inconsistent pass/fail) are flagged automatically. Experiment comparison shows statistical significance, so you know whether a regression is real or sampling noise. NAACL 2025 research found single-run eval scores unreliable enough that variance alone can reverse model rankings. For tracking over time, every run can be recorded to ~/.multivon/experiments/; compare two runs side-by-side and get a pass rate delta with p-values. Framework integrations. Capture agent traces from LangChain, LangSmith, or any custom agent. Import existing LangSmith runs as eval cases without re-running the agent. Shareable HTML reports. report.save_html("report.html") produces a self-contained HTML file with per-case breakdowns, evaluator scores, and flakiness indicators. No server required. CI/CD first. One line exits with code 1 if your pass rate drops below a threshold. Evals that don’t run in CI catch nothing. Prompt-drift staleness and provenance. multivon-eval staleness diffs a committed baseline of every prompt call site in your repo against a live scan and tells you which prompts changed since your cases were authored: CHANGED, REMOVED, ADDED, or UNKNOWN, never overclaiming what static analysis can know. An opt-in runtime recorder (pytest --record-prompts) captures the prompts your code actually rendered, with verdicts phrased as “matched k of N observed renderings”. See Prompt-drift staleness. Persona simulation. multivon-eval simulate drives adaptive multi-turn conversations against your live system. A persona LLM with a profile, goal, and behavior traits generates each user turn in response to what your model actually said; transcripts are scored by the conversation evaluators plus a goal-completion judge, with a hard budget ceiling and every output labeled as synthetic. See Simulate.

Install

pip install multivon-eval
Requires Python 3.10+. Set ANTHROPIC_API_KEY and/or OPENAI_API_KEY for LLM-judge evaluators.

Quickstart

Up and running in 5 minutes

Generate datasets

No labeled data? Generate cases from your docs