Single-run benchmark scores are unreliable. NAACL 2025 research showed that variance between runs is large enough to reverse model rankings — a model that looks 5% better in one run may simply be lucky. multivon-eval operationalizes this: CIs on every report by default, power warnings, multiple comparison correction, and judge calibration against human ground truth.Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
CIs shown by default
Everysuite.run() report now includes confidence intervals without any extra code:
avg_score hides. A model that scores 0.95 or 0.40 (never in between) has the same avg_score as one that always scores 0.67 — but they behave very differently. A bimodal distribution usually means the evaluation criterion has a sharp decision boundary and your model is straddling it.
Why single-run scores lie
LLMs are non-deterministic. Even withtemperature=0, hosted APIs introduce variance through hardware parallelism and batching. A 91% pass rate on 50 cases could be anywhere from 80% to 97% if you ran the same cases again.
The fix: run more cases, run each case multiple times, and use confidence intervals to understand what your score actually means.
Confidence intervals with wilson_interval
The Wilson score interval is the most reliable CI for binomial proportions — it handles small n and extreme pass rates much better than the normal approximation.
experiment.compare() shows these automatically:
Know how many cases you need
Before running an eval, calculate whether your test suite is large enough to detect the improvement you care about.| Effect size | Min cases needed |
|---|---|
| 15% improvement | ~118 |
| 10% improvement | ~291 |
| 5% improvement | ~1,248 |
| 2% improvement | ~7,700 |
Power hints in compare()
When compare() finds a difference that doesn’t reach p < 0.05, it tells you how many more cases you’d need:
Multi-run flakiness detection
Combineruns=N with statistical rigor for per-case stability analysis:
Recommended defaults
| Scenario | Setting |
|---|---|
| Quick iteration | runs=1, 20–50 cases (fast, coarse) |
| Pre-ship check | runs=3, 100+ cases |
| Regression gate | runs=5, 200+ cases, fail_threshold=0.85 |
| Significance test | runs=1, ≥291 cases for 10% delta detection |
| Flakiness audit | runs=10, any case count |
Multiple comparison correction
Running N evaluators and reporting N raw p-values inflates the false positive rate. At α=0.05 with 10 evaluators, you’d expect ~0.5 spurious “significant” results per run just by chance.exp.compare() applies Benjamini-Hochberg correction automatically when comparing evaluator scores, showing adjusted p-values with * for those that survive correction:
Judge calibration
A passing eval score only means something if your judge actually agrees with human judgment.suite.calibrate() measures this directly.
- Agreement ≥ 85%: judge is reliable for CI gating
- Agreement 70–85%: usable for iteration but don’t gate deploys on it alone
- Agreement < 70%: judge and humans disagree too often — reconsider your evaluator or threshold
Interpretation checklist
Before trusting an eval result, ask:- Is the improvement statistically significant? (
exp.compare()shows p-value) - Are the confidence intervals non-overlapping? If CI(before) and CI(after) overlap, the difference is inconclusive.
- Do I have enough cases? The power warning tells you automatically; use
runs_needed()to plan ahead. - Are there flaky cases inflating the variance? Check
report.flaky_count. - Are multi-evaluator comparisons corrected?
exp.compare()applies BH correction automatically; watch the*markers. - Is my judge calibrated? Run
suite.calibrate()once on a labeled sample before using eval scores to gate deploys.

