Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

EvalReport is the object returned by EvalSuite.run(). It exposes the run’s results both as flat attributes (the common readouts) and as derived methods (breakdowns, exports, comparisons). This page is the complete public API — if something isn’t on this page, treat it as internal and don’t rely on it.
report: EvalReport = suite.run(my_model_fn)
print(report.pass_rate)                  # float, 0.0–1.0
print(report.pass_rate_ci())             # (lo, hi) Wilson 95% CI
print(report.costs.by_model[0].cost_usd) # USD spent on the judge
for case in report.failed_cases:         # CaseResult objects
    print(case.input, case.score)

Quick reference

Attribute / methodTypeWhat it returns
suite_namestrSuite name passed to EvalSuite(name).
model_idstrIdentifier of the model under test (set automatically when known).
totalintTotal cases run.
evaluatedintCases that produced at least one evaluator result.
passedintCases where every evaluator passed.
failedintCases where at least one evaluator failed.
errorsintCases that errored before scoring (model or judge crash).
skippedintCases skipped because the case shape didn’t fit.
pass_ratefloatpassed / total as a fraction.
pass_rate_ci(confidence=0.95)tuple[float, float]Wilson CI on the pass rate.
avg_scorefloatMean score across all cases.
avg_score_ci(confidence=0.95)tuple[float, float]CI on the mean score.
score_percentiles(percentiles=[10,50,90])dict[str,float]{"p10": …, "p50": …, "p90": …}.
case_resultslist[CaseResult]The per-case results — see below.
passed_caseslist[CaseResult]Cases that fully passed.
failed_caseslist[CaseResult]Cases that failed at least one evaluator.
sample(n, failed_only=False)list[CaseResult]Random sample for spot-checking.
filter_by_evaluator(name)list[CaseResult]Cases that an evaluator scored, in original order.
passed_by_evaluator()dict[str,float]{evaluator_name: pass_rate}.
scores_by_evaluator()dict[str,float]{evaluator_name: avg_score}.
passed_by_tag()dict[str,float]Same shape, grouped by case tag.
scores_by_tag()dict[str,float]Average score per tag.
count_by_tag()dict[str,int]Case count per tag.
costsCostsToken / call / USD totals. See below.
flaky_countintCases where multiple runs disagreed. Requires runs > 1.
stability_scorefloat1.0 when no flakiness; lower when cases disagreed across runs.
judge_reliabilityfloat | NoneJudge agreement rate when JudgeConfig.reliability_check is enabled.
runs_per_caseintHow many times each case was rerun (from suite.run(runs=N)).
errors_by_kinddict[str,int]{ "model_error": 2, "judge_error": 1, ... }.
suite_lockSuiteLockHash chain over evaluators + cases. Use for reproducibility.
compare(other)AnyDiff vs another EvalReport. Use for regression detection.
assert_budget(**limits)NoneRaise if total cost / latency exceeds a limit. CI-friendly.
save_json(path)NoneWrite the report as JSON.
save_html(path)NoneWrite a static HTML viewer.
save_csv(path)NoneWrite a per-case CSV.
save_junit_xml(path)NoneWrite JUnit XML for CI runners.
to_json()strSame as save_json but returns the string.
to_html()strSame as save_html but returns the string.
to_junit_xml()strSame as save_junit_xml but returns the string.
from_dict(data)classmethodRe-hydrate from a to_json() payload.

Common gotchas

Field-vs-method shapes. Pass-rate and average score are attributes (plain access, no parens). The 95% CIs are methods (call them) so you can pass a different confidence level when needed. So:
report.pass_rate          # ✓ attr
report.pass_rate_ci()     # ✓ method
report.pass_rate_ci(0.99) # ✓ tighter band
costs is a dataclass, not a dict. Use attribute access:
report.costs.total_cost_usd          # ✓
report.costs.total_calls             # ✓
report.costs.by_model[0].provider    # ✓
report.costs['total_cost_usd']       # ✗ TypeError — not subscriptable
The serialised JSON exposes the same data under string keys (r['costs']['total_cost_usd']), which is sometimes a source of confusion. Use the dataclass at runtime, the JSON-keyed view when consuming a saved report. case_results is the iterable, not cases. Older docs and blog posts sometimes show report.cases[i]; the correct attribute is case_results. passed_by_evaluator is a method. Some older snippets show it as an attribute. Always call it:
report.passed_by_evaluator()   # ✓ → {'faithfulness': 0.83, 'hallucination': 0.67}
report.passed_by_evaluator     # ✗ returns the bound method, not the dict

CaseResult shape

The objects in report.case_results.
AttributeTypeMeaning
case_inputstrThe case’s input string.
actual_outputstrWhat the model produced.
resultslist[EvalResult]One per evaluator that ran on this case.
evaluatorsalias of resultsSame list — older code uses this name.
passedboolAll evaluators on this case passed.
scorefloatAverage of evaluator scores on this case.
latency_msfloatWall time of the model call only.
tagslist[str]Case tags inherited from EvalCase.tags.
model_errorstr | NoneSet when the model function raised.
judge_errorstr | NoneSet when a judge call raised.
evaluator_errorstr | NoneSet when an evaluator raised.
skippedboolCase was skipped end-to-end.
agent_tracelist[AgentStep]Populated when an AgentTracer instrumented the model.
runsintNumber of times this case was rerun (from suite.run(runs=N)).
all_scoreslist[float]Per-run scores; empty unless runs > 1.
pass_countintAcross runs; -1 when runs == 1.
retry_attemptsintHow many times judge_retry rescued a transient failure.
retry_errorslist[str]The transient errors that got retried.

EvalResult shape

The objects in case_result.results (per-evaluator).
AttributeTypeMeaning
evaluatorstrEvaluator name.
scorefloat0.0–1.0.
passedboolscore >= evaluator.threshold.
reasonstrHuman-readable reason. Strings starting with [skipped] mean the case shape didn’t fit this evaluator (returns a passing skip — does not contaminate aggregates).
metadatadict{"skipped": True} is set when the evaluator skipped. Free-form otherwise.

Costs shape

The dataclass on report.costs.
AttributeTypeMeaning
total_input_tokensintSum across all judge calls.
total_output_tokensintSum across all judge calls.
total_tokensintConvenience sum.
total_callsintNumber of judge requests issued.
total_cost_usdfloatUSD spent at provider list price.
by_modellist[ProviderUsage]Per (provider, model) breakdown.
ProviderUsage has provider, model, input_tokens, output_tokens, total_tokens, calls, and cost_usd.

CI examples

Fail the build on regressions:
report = suite.run(model_fn)
report.assert_budget(max_total_cost_usd=2.0, max_p95_latency_ms=5000)

if report.pass_rate < 0.85:
    sys.exit(1)
Compare vs a baseline run:
prev = EvalReport.from_dict(json.loads(Path("baseline.json").read_text()))
delta = report.compare(prev)
print(f"pass_rate Δ = {delta.pass_rate_delta:+.2%}")
Filter by tag (e.g. only the adversarial subset):
for c in report.filter_by_evaluator("hallucination"):
    if not c.passed and "adversarial:ungrounded_claim" in c.tags:
        print(c.case_input, c.results[0].reason)