Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
The problem
Every team building with AI hits the same wall: how do you know if your model is getting better or worse? Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. multivon-eval gives you eval scoring you can trust and defend — calibrated against human labels, statistically rigorous, and audit-ready when you need to ship into regulated environments.claude-haiku-4-5 (F1 0.804 [0.71–0.88] on HaluEval QA in-distribution; F1 0.830 [0.70–0.92] held-out on HaluEval-Sum; $0.00127 per case). Swap in any judge via JudgeConfig, or run fully offline with a local Ollama: JudgeConfig(provider="ollama", model="qwen2.5:14b").
Want a runnable eval suite from a one-paragraph product description? multivon-eval bootstrap. Want it wired into Claude Code as auto-invoking skills? multivon-eval install-skills.
What it does
Deterministic
String matching, regex, JSON schema, BLEU, ROUGE, latency. Instant, free, no LLM.
LLM-as-judge
QAG scoring — yes/no questions instead of unreliable 1-10 ratings. Faithfulness, hallucination, relevance, and more.
Agent trace
Tool call accuracy, plan quality, step faithfulness, task completion. Framework-agnostic.
Conversation
Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.
Key features
QAG scoring — Instead of asking a judge “rate this 1-10”, we generate binary yes/no questions about the output and score by fraction answered correctly. Binary questions eliminate scale ambiguity, are easier for LLMs to answer consistently, and make every score fully auditable — you can see exactly which questions passed or failed. Plain-English checks —suite.add_check("Response should mention the return policy") is all you need to write your first eval. The SDK generates yes/no questions from your criterion automatically. No need to pick an evaluator class or craft QAG questions manually.
No cold start — Point generate_from_file() at your docs and get eval cases in seconds. No labeled dataset required to get started.
Reliability & flakiness detection — Run each case N times with suite.run(runs=5). Flaky cases (inconsistent pass/fail) are flagged automatically. Experiment comparison shows statistical significance so you know whether a regression is real or sampling noise. NAACL 2025 research confirms single-run eval scores are unreliable — variance alone can reverse model rankings.
Experiment tracking — Record every run to ~/.multivon/experiments/. Compare two runs side-by-side and get a pass rate delta with p-values. Catch regressions before they ship.
Framework integrations — Capture agent traces from LangChain, LangSmith, or any custom agent. Import existing LangSmith runs as eval cases without re-running the agent.
Shareable HTML reports — report.save_html("report.html") produces a self-contained HTML file with per-case breakdowns, evaluator scores, and flakiness indicators. No server required.
CI/CD first — One line exits with code 1 if your pass rate drops below a threshold. Evals that don’t run in CI catch nothing.
Install
ANTHROPIC_API_KEY and/or OPENAI_API_KEY for LLM-judge evaluators.
Quickstart
Up and running in 5 minutes
Generate datasets
No labeled data? Generate cases from your docs

