Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Single-run benchmark scores are unreliable. NAACL 2025 research showed that variance between runs is large enough to reverse model rankings — a model that looks 5% better in one run may simply be lucky. multivon-eval operationalizes this: CIs on every report by default, power warnings, multiple comparison correction, and judge calibration against human ground truth.

CIs shown by default

Every suite.run() report now includes confidence intervals without any extra code:
Pass Rate: 80% [69%–89% 95% CI]   Avg Score: 0.82 [0.74–0.90]
Score distribution  p10:0.41  p50:0.88  p90:0.96
Access them programmatically:
lo, hi = report.pass_rate_ci()        # Wilson 95% CI on pass rate
lo, hi = report.avg_score_ci()        # bootstrap 95% CI on mean score
pct    = report.score_percentiles()   # {"p10": 0.41, "p50": 0.88, "p90": 0.96}
The percentiles reveal what avg_score hides. A model that scores 0.95 or 0.40 (never in between) has the same avg_score as one that always scores 0.67 — but they behave very differently. A bimodal distribution usually means the evaluation criterion has a sharp decision boundary and your model is straddling it.

Why single-run scores lie

LLMs are non-deterministic. Even with temperature=0, hosted APIs introduce variance through hardware parallelism and batching. A 91% pass rate on 50 cases could be anywhere from 80% to 97% if you ran the same cases again. The fix: run more cases, run each case multiple times, and use confidence intervals to understand what your score actually means.

Confidence intervals with wilson_interval

The Wilson score interval is the most reliable CI for binomial proportions — it handles small n and extreme pass rates much better than the normal approximation.
from multivon_eval import wilson_interval

# 80 passing out of 100 cases
lo, hi = wilson_interval(80, 100)
print(f"95% CI: [{lo:.1%}, {hi:.1%}]")
# → 95% CI: [71.1%, 86.7%]

# Small test suite: 8 of 10 passing
lo, hi = wilson_interval(8, 10)
print(f"95% CI: [{lo:.1%}, {hi:.1%}]")
# → 95% CI: [49.0%, 97.3%]
# Wide interval — you can't conclude much from 10 cases
experiment.compare() shows these automatically:
  95% CI (before): [71.4%, 89.3%]
  95% CI (after):  [83.5%, 96.2%]
  Statistical significance: p=0.01 ✦✦ highly significant
  Verdict: IMPROVED — pass rate up +12.0%

Know how many cases you need

Before running an eval, calculate whether your test suite is large enough to detect the improvement you care about.
from multivon_eval import runs_needed

# How many test cases to detect a 10% improvement (80% power)?
n = runs_needed(delta=0.10)
# → 291

# Smaller improvement requires more cases
n = runs_needed(delta=0.05)
# → 1248

# Higher power threshold
n = runs_needed(delta=0.10, power=0.90)
# → 390

# Different baseline pass rate
n = runs_needed(delta=0.10, baseline=0.85)
# → 193
Rule of thumb:
Effect sizeMin cases needed
15% improvement~118
10% improvement~291
5% improvement~1,248
2% improvement~7,700
A 2% improvement requires ~7,700 cases to confirm statistically. Most teams shouldn’t chase differences that small.

Power hints in compare()

When compare() finds a difference that doesn’t reach p < 0.05, it tells you how many more cases you’d need:
  Statistical significance: p=0.23 not significant (likely noise)
  Hint: need ≥291 test cases to detect this 10% delta at 80% power.
  Verdict: No meaningful change in pass rate.
This is the difference between “we improved” and “we think we improved but can’t tell yet.”

Multi-run flakiness detection

Combine runs=N with statistical rigor for per-case stability analysis:
report = suite.run(model_fn, runs=10)

print(f"Stability: {report.stability_score:.0%}")   # % of non-flaky cases
print(f"Flaky cases: {report.flaky_count}")

for cr in report.case_results:
    lo, hi = wilson_interval(cr.pass_count, cr.runs)
    print(f"  {cr.case_input[:40]}: {cr.run_pass_rate:.0%} [{lo:.0%}, {hi:.0%}]")

ScenarioSetting
Quick iterationruns=1, 20–50 cases (fast, coarse)
Pre-ship checkruns=3, 100+ cases
Regression gateruns=5, 200+ cases, fail_threshold=0.85
Significance testruns=1, ≥291 cases for 10% delta detection
Flakiness auditruns=10, any case count

Multiple comparison correction

Running N evaluators and reporting N raw p-values inflates the false positive rate. At α=0.05 with 10 evaluators, you’d expect ~0.5 spurious “significant” results per run just by chance. exp.compare() applies Benjamini-Hochberg correction automatically when comparing evaluator scores, showing adjusted p-values with * for those that survive correction:
  Evaluator scores         Before           After    BH-adj p
  ────────────────────────────────────────────────────────────
  faithfulness             0.7800  →   0.6800  ↓   0.023 *
  context_precision        0.9200  →   0.7900  ↓   0.034 *
  relevance                0.8800  →   0.8600        0.412

  (* significant after Benjamini-Hochberg correction, FDR 5%)
For standalone use:
from multivon_eval import benjamini_hochberg

# p-values from 5 simultaneous tests
raw = [0.001, 0.040, 0.030, 0.200, 0.800]
adj = benjamini_hochberg(raw)
# → [0.005, 0.067, 0.067, 0.250, 0.800]
# Only the first test survives at FDR 5%
BH is less conservative than Bonferroni — it controls the rate of false discoveries rather than the probability of any false discovery.

Judge calibration

A passing eval score only means something if your judge actually agrees with human judgment. suite.calibrate() measures this directly.
result = suite.calibrate([
    (EvalCase(input="How do I cancel?"), "Please contact billing.", False),
    (EvalCase(input="Reset password?"),  "Click Forgot Password.",  True),
    # ... more (case, output, human_pass) tuples
])
print(result)
Judge Calibration — 50 labeled cases
  Agreement:  88.0%
  Precision:  84.0%
  Recall:     91.0%
  F1 Score:   87.4%
  By evaluator:
    faithfulness: agreement=90.0%  F1=89.0%
    relevance:    agreement=82.0%  F1=80.0%
Interpreting the results:
  • Agreement ≥ 85%: judge is reliable for CI gating
  • Agreement 70–85%: usable for iteration but don’t gate deploys on it alone
  • Agreement < 70%: judge and humans disagree too often — reconsider your evaluator or threshold
Low precision means the judge passes cases humans would reject (over-permissive). Low recall means the judge rejects cases humans would pass (over-strict). Both affect CI reliability differently.

Interpretation checklist

Before trusting an eval result, ask:
  1. Is the improvement statistically significant? (exp.compare() shows p-value)
  2. Are the confidence intervals non-overlapping? If CI(before) and CI(after) overlap, the difference is inconclusive.
  3. Do I have enough cases? The power warning tells you automatically; use runs_needed() to plan ahead.
  4. Are there flaky cases inflating the variance? Check report.flaky_count.
  5. Are multi-evaluator comparisons corrected? exp.compare() applies BH correction automatically; watch the * markers.
  6. Is my judge calibrated? Run suite.calibrate() once on a labeled sample before using eval scores to gate deploys.