Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

The problem

Every team building with AI hits the same wall: how do you know if your model is getting better or worse? Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. multivon-eval gives you eval scoring you can trust and defend — calibrated against human labels, statistically rigorous, and audit-ready when you need to ship into regulated environments.
# pip install multivon-eval
from multivon_eval import EvalSuite, EvalCase, Faithfulness

suite = EvalSuite("rag")
suite.add_cases([EvalCase(input="What's the refund window?",
                          context="Refunds within 30 days.")])
suite.add_evaluators(Faithfulness())
report = suite.run(my_model_fn, save_json="report.json")

print(f"Pass rate {report.pass_rate:.0%}  ({report.evaluated} evaluated, {report.errors} errored)")
That’s everything — install, import, run. The default judge is claude-haiku-4-5 (F1 0.804 [0.71–0.88] on HaluEval QA in-distribution; F1 0.830 [0.70–0.92] held-out on HaluEval-Sum; $0.00127 per case). Swap in any judge via JudgeConfig, or run fully offline with a local Ollama: JudgeConfig(provider="ollama", model="qwen2.5:14b"). Want a runnable eval suite from a one-paragraph product description? multivon-eval bootstrap. Want it wired into Claude Code as auto-invoking skills? multivon-eval install-skills.
Why trust the numbers? Every shipped benchmark cites Wilson + bootstrap 95% CIs; thresholds are calibrated per (judge × evaluator) with provenance; the 0.9.4→0.9.7 sequence is a self-correction trail that shows what real eval discipline looks like. See Why multivon-eval.

What it does

Deterministic

String matching, regex, JSON schema, BLEU, ROUGE, latency. Instant, free, no LLM.

LLM-as-judge

QAG scoring — yes/no questions instead of unreliable 1-10 ratings. Faithfulness, hallucination, relevance, and more.

Agent trace

Tool call accuracy, plan quality, step faithfulness, task completion. Framework-agnostic.

Conversation

Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.

Key features

QAG scoring — Instead of asking a judge “rate this 1-10”, we generate binary yes/no questions about the output and score by fraction answered correctly. Binary questions eliminate scale ambiguity, are easier for LLMs to answer consistently, and make every score fully auditable — you can see exactly which questions passed or failed. Plain-English checkssuite.add_check("Response should mention the return policy") is all you need to write your first eval. The SDK generates yes/no questions from your criterion automatically. No need to pick an evaluator class or craft QAG questions manually. No cold start — Point generate_from_file() at your docs and get eval cases in seconds. No labeled dataset required to get started. Reliability & flakiness detection — Run each case N times with suite.run(runs=5). Flaky cases (inconsistent pass/fail) are flagged automatically. Experiment comparison shows statistical significance so you know whether a regression is real or sampling noise. NAACL 2025 research confirms single-run eval scores are unreliable — variance alone can reverse model rankings. Experiment tracking — Record every run to ~/.multivon/experiments/. Compare two runs side-by-side and get a pass rate delta with p-values. Catch regressions before they ship. Framework integrations — Capture agent traces from LangChain, LangSmith, or any custom agent. Import existing LangSmith runs as eval cases without re-running the agent. Shareable HTML reportsreport.save_html("report.html") produces a self-contained HTML file with per-case breakdowns, evaluator scores, and flakiness indicators. No server required. CI/CD first — One line exits with code 1 if your pass rate drops below a threshold. Evals that don’t run in CI catch nothing.

Install

pip install multivon-eval
Requires Python 3.10+. Set ANTHROPIC_API_KEY and/or OPENAI_API_KEY for LLM-judge evaluators.

Quickstart

Up and running in 5 minutes

Generate datasets

No labeled data? Generate cases from your docs