Statistical Rigor

Single-run benchmark scores are unreliable. NAACL 2025 research showed that variance between runs is large enough to reverse model rankings — a model that looks 5% better in one run may simply be lucky. multivon-eval operationalizes this: CIs on every report by default, power warnings, multiple comparison correction, and judge calibration against human ground truth.

CIs shown by default

Every suite.run() report now includes confidence intervals without any extra code:

Pass Rate: 80% [69%–89% 95% CI]   Avg Score: 0.82 [0.74–0.90]
Score distribution  p10:0.41  p50:0.88  p90:0.96

Access them programmatically:

lo, hi = report.pass_rate_ci()        # Wilson 95% CI on pass rate
lo, hi = report.avg_score_ci()        # bootstrap 95% CI on mean score
pct    = report.score_percentiles()   # {"p10": 0.41, "p50": 0.88, "p90": 0.96}

The percentiles reveal what avg_score hides. A model that scores 0.95 or 0.40 (never in between) has the same avg_score as one that always scores 0.67 — but they behave very differently. A bimodal distribution usually means the evaluation criterion has a sharp decision boundary and your model is straddling it.

Why single-run scores lie

LLMs are non-deterministic. Even with temperature=0, hosted APIs introduce variance through hardware parallelism and batching. A 91% pass rate on 50 cases could be anywhere from 80% to 97% if you ran the same cases again. The fix: run more cases, run each case multiple times, and use confidence intervals to understand what your score actually means.

Confidence intervals with `wilson_interval`

The Wilson score interval is the most reliable CI for binomial proportions — it handles small n and extreme pass rates much better than the normal approximation.

from multivon_eval import wilson_interval

# 80 passing out of 100 cases
lo, hi = wilson_interval(80, 100)
print(f"95% CI: [{lo:.1%}, {hi:.1%}]")
# → 95% CI: [71.1%, 86.7%]

# Small test suite: 8 of 10 passing
lo, hi = wilson_interval(8, 10)
print(f"95% CI: [{lo:.1%}, {hi:.1%}]")
# → 95% CI: [49.0%, 94.3%]
# Wide interval — you can't conclude much from 10 cases

experiment.compare() shows these automatically:

  95% CI (before): [71.4%, 89.3%]
  95% CI (after):  [83.5%, 96.2%]
  Statistical significance: p=0.01 ✦✦ highly significant
  Verdict: IMPROVED — pass rate up +12.0%

Know how many cases you need

Before running an eval, calculate whether your test suite is large enough to detect the improvement you care about.

from multivon_eval import runs_needed

# How many test cases to detect a 10% improvement (80% power)?
n = runs_needed(delta=0.10)
# → 291

# Smaller improvement requires more cases
n = runs_needed(delta=0.05)
# → 1248

# Higher power threshold
n = runs_needed(delta=0.10, power=0.90)
# → 389

# Different baseline pass rate
n = runs_needed(delta=0.10, baseline=0.85)
# → 138

Rule of thumb:

Effect size	Min cases needed
15% improvement	~118
10% improvement	~291
5% improvement	~1,248
2% improvement	~8,077

A 2% improvement requires ~8,000 cases to confirm statistically. Most teams shouldn’t chase differences that small.

Power hints in `compare()`

When compare() finds a difference that doesn’t reach p < 0.05, it tells you how many more cases you’d need:

  Statistical significance: p=0.23 not significant (likely noise)
  Hint: need ≥291 test cases to detect this 10% delta at 80% power.
  Verdict: NO MEANINGFUL CHANGE (delta +10.0%, not significant)

This is the difference between “we improved” and “we think we improved but can’t tell yet.”

What a 100% pass rate actually tells you

New in 0.16.0. A suite at 100% can no longer detect improvement — and, less obviously, it is bad at detecting regressions too. Two properties on every report quantify what a perfect score can still claim:

report.saturated                    # True when every EVALUATED task passed
report.min_detectable_regression    # smallest pass-rate DROP detectable at 80% power

saturated is built on evaluated, not total — a 100% assembled from judge outages doesn’t count. min_detectable_regression anchors its variance near the observed rate, capped at a 0.95 baseline, so a perfect score can’t flatter its own sensitivity. Concretely, a 40-task suite at 100%:

  ⚠ Saturated: 40/40 trials passed. All this run can claim is a pass rate
  ≥ 91.2% (95% Wilson). At n=40, the smallest regression this suite can
  detect at 80% power is ~14% — a real 5pp quality drop would look like
  noise. Graduate this suite to a regression suite (purpose='regression')
  and add harder capability tasks.

The Wilson lower bound is the honest floor: 40/40 is consistent with a true pass rate of 91.2%. And at n=40, a real 14pp drop is the smallest this suite would reliably notice (~6% at n=200). This is always a warning, never a gate. Graduation. Declare the suite’s intent with the purpose kwarg — '' (unset), 'capability', or 'regression':

suite = EvalSuite("smoke-tests", purpose="capability")   # saturation = nag to graduate
suite = EvalSuite("smoke-tests", purpose="regression")   # saturation = expected; warning inverts

A saturated capability suite (n ≥ 3) gets the graduation warning above. A purpose='regression' suite inverts it: 100% is the expected steady state, and any task below ceiling prints a triage warning instead (“N previously-passing task(s) below ceiling — something broke; triage before shipping”). The purpose is copied onto the report and serialized in the JSON summary alongside saturated and min_detectable_regression; view --dir shows a saturated badge. The 0% end of the scale has its own detector — see zero-pass suspects.

Multi-run flakiness detection

Combine runs=N with statistical rigor for per-case stability analysis:

report = suite.run(model_fn, runs=10)

print(f"Stability: {report.stability_score:.0%}")   # % of non-flaky cases
print(f"Flaky cases: {report.flaky_count}")

for cr in report.case_results:
    lo, hi = wilson_interval(cr.pass_count, cr.runs)
    print(f"  {cr.case_input[:40]}: {cr.run_pass_rate:.0%} [{lo:.0%}, {hi:.0%}]")

The same multi-run data also yields pass@k (capability) and pass^k (consistency) with cluster-bootstrap CIs — see pass@k and pass^k.

Recommended defaults

Scenario	Setting
Quick iteration	`runs=1`, 20–50 cases (fast, coarse)
Pre-ship check	`runs=3`, 100+ cases
Regression gate	`runs=5`, 200+ cases, `fail_threshold=0.85`
Significance test	`runs=1`, ≥291 cases for 10% delta detection
Flakiness audit	`runs=10`, any case count

Multiple comparison correction

Running N evaluators and reporting N raw p-values inflates the false positive rate. At α=0.05 with 10 evaluators, you’d expect ~0.5 spurious “significant” results per run just by chance. exp.compare() applies Benjamini-Hochberg correction automatically when comparing evaluator scores, showing adjusted p-values with * for those that survive correction:

  Evaluator scores         Before           After    BH-adj p
  ────────────────────────────────────────────────────────────
  faithfulness             0.7800  →   0.6800  ↓   0.023 *
  context_precision        0.9200  →   0.7900  ↓   0.034 *
  relevance                0.8800  →   0.8600        0.412

  (* significant after Benjamini-Hochberg correction, FDR 5%)

For standalone use:

from multivon_eval import benjamini_hochberg

# p-values from 5 simultaneous tests
raw = [0.001, 0.040, 0.030, 0.200, 0.800]
adj = benjamini_hochberg(raw)
# → [0.005, 0.067, 0.067, 0.250, 0.800]
# Only the first test survives at FDR 5%

BH is less conservative than Bonferroni — it controls the rate of false discoveries rather than the probability of any false discovery.

Judge calibration

A passing eval score only means something if your judge actually agrees with human judgment. suite.calibrate() measures this directly.

result = suite.calibrate([
    (EvalCase(input="How do I cancel?"), "Please contact billing.", False),
    (EvalCase(input="Reset password?"),  "Click Forgot Password.",  True),
    # ... more (case, output, human_pass) tuples
])
print(result)

Judge Calibration — 50 labeled cases
  Agreement:  88.0%
  Precision:  84.0%
  Recall:     91.0%
  F1 Score:   87.4%
  By evaluator:
    faithfulness: agreement=90.0%  F1=89.0%
    relevance:    agreement=82.0%  F1=80.0%

Interpreting the results:

Agreement ≥ 85%: judge is reliable for CI gating
Agreement 70–85%: usable for iteration but don’t gate deploys on it alone
Agreement < 70%: judge and humans disagree too often — reconsider your evaluator or threshold

Low precision means the judge passes cases humans would reject (over-permissive). Low recall means the judge rejects cases humans would pass (over-strict). Both affect CI reliability differently.

Interpretation checklist

Before trusting an eval result, ask:

Is the improvement statistically significant? (exp.compare() shows p-value)
Are the confidence intervals non-overlapping? If CI(before) and CI(after) overlap, the difference is inconclusive.
Do I have enough cases? The power warning tells you automatically; use runs_needed() to plan ahead.
Are there flaky cases inflating the variance? Check report.flaky_count.
Are multi-evaluator comparisons corrected? exp.compare() applies BH correction automatically; watch the * markers.
Is my judge calibrated? Run suite.calibrate() once on a labeled sample before using eval scores to gate deploys.
Is the suite saturated? A 100% pass rate only proves a Wilson floor; check report.min_detectable_regression for what the suite can still see, and graduate it to purpose='regression'.

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

CIs shown by default

Why single-run scores lie

Confidence intervals with `wilson_interval`

Know how many cases you need

Power hints in `compare()`

What a 100% pass rate actually tells you

Multi-run flakiness detection

Recommended defaults

Multiple comparison correction

Judge calibration

Interpretation checklist

​CIs shown by default

​Why single-run scores lie

​Confidence intervals with wilson_interval

​Know how many cases you need

​Power hints in compare()

​What a 100% pass rate actually tells you

​Multi-run flakiness detection

​Recommended defaults

​Multiple comparison correction

​Judge calibration

​Interpretation checklist

CIs shown by default

Why single-run scores lie

Confidence intervals with `wilson_interval`

Know how many cases you need

Power hints in `compare()`

What a 100% pass rate actually tells you

Multi-run flakiness detection

Recommended defaults

Multiple comparison correction

Judge calibration

Interpretation checklist