Writing assertions is the easy part of LLM evaluation. Knowing what to measure for your specific product is the hard part, and multivon-eval bootstrap answers that question. Hand it a one-paragraph product description and a few real traces, and it emits a runnable EvalSuite with metrics tuned to your shape, thresholds calibrated from your data, and adversarial seed cases targeting the most likely failure mode.
The whole flow
pip install multivon-eval
multivon-eval bootstrap \
--product product.md \
--traces traces.jsonl \
--output ./eval-bootstrap/
You get five artifacts:
| File | What it is |
|---|
eval_suite.py | Runnable suite with 4–6 evaluators picked for your product shape |
seed_cases.jsonl | 30 adversarial seed cases targeting the primary failure mode |
thresholds.yaml | Per-evaluator thresholds calibrated from your traces (p25 of scores) |
DISCOVERY_REPORT.md | An eval design review you can forward to your team |
prompt_baseline.json | Snapshot of every prompt call site in your repo (written at the --repo root, default .) — the input to multivon-eval staleness |
Total cost: ~0.12perbootstrapondefaultsettings,or∗∗0.00 with --judge-provider ollama**. Latency: a few minutes on default settings — measured ~4 min, almost entirely judge-API wait, with progress lines printing to stderr as each stage starts. The local-judge path runs about 5× that wall-clock.
As of 0.9.4 the emitted eval_suite.py is genuinely runnable end-to-end: python eval_suite.py --runs 1 executes the full suite without further edits. Two things to know before that first run. It makes real judge calls (30 cases × your LLM evaluators ≈ 120 calls, a few cents on Haiku), and its main() exits 1 when the pass rate is under 50% — which is expected against the placeholder stub_model. The all-fail run is the signal to wire in your real model. 0.9.4 also fixed a stale suite.run(cases=...) kwarg and report.print_summary() call that previously broke the generated file; if you hit those on a 0.9.3-and-earlier install, upgrade.
Scaled, gated generation (0.12.0)
--n-seed-cases now goes to 500. Generation runs in batches of ≤30, with
later batches steered away from already-accepted inputs, and every case
passes gates before acceptance: well-formed (structural), duplicate
(NFC-normalized identity or token-Jaccard ≥ 0.85, across batches), and, with
--validate-cases --baseline-model-file model.py, the hardness band from
auto.validate_adversarial_cases (judge-priced, so opt-in).
No silent caps. The CLI summary and DISCOVERY_REPORT.md print the full
accounting — generated 500, accepted 431 — dropped 38 duplicates, 12 malformed, 19 outside hardness band [0.5, 1.0] — and a skipped hardness gate
says so explicitly. Each accepted case’s metadata["generation"] records its
batch, the gates it passed, and its hardness score. --budget-usd (default
2.00) is a pre-spend ceiling: the cost estimate is checked before the first
LLM call.
# Product
AcmeCorp customer support bot. Answers questions about return policies,
shipping windows, and refund timelines using a retrieval-augmented knowledge
base.
# Inputs / Outputs
Input: free-form question in English.
Output: 1-3 sentence answer grounded in the retrieved policy text.
# Known risks
- Hallucinating policy details not in the retrieved context
- Going off-topic into product recommendations (out of scope)
- Echoing customer PII back into the response unnecessarily
Suggested sections (none are required): # Product, # Users, # Inputs / Outputs, # Known risks. The LLM uses the description to ground its metric recommendations in your domain.
traces.jsonl — newline-delimited JSON, max 10 000 rows
Only input is required. Optional keys are detected and used to infer product shape:
{"input": "What's the refund window?", "context": "Refunds within 30 days.", "output": "30 days from purchase.", "expected_output": "30 days"}
{"input": "Can I return electronics after 20 days?", "context": "Electronics: 14-day window.", "output": "No, electronics have a 14-day return window."}
{"input": "How long does it take to receive my refund?", "context": "Refunds processed in 5-7 business days.", "output": "5-7 business days after we receive your item."}
| Optional key | Triggers |
|---|
context | RAG-shape detection → adds Faithfulness, Hallucination |
expected_output | Enables threshold calibration + AnswerAccuracy / ContextRecall |
expected_tool_calls | Agent-shape detection → adds ToolCallAccuracy |
conversation | Multi-turn detection → adds ConversationRelevance |
metadata.image_url | Multimodal detection → adds VQAFaithfulness |
If you don’t have output traces yet (pre-launch), the tool still proposes evaluators based on the product description alone — threshold calibration is the only step that’s skipped.
load_traces accepts your existing dump shape. As of 0.9.4, the loader auto-aliases field names from LangSmith (query → input, answer → output, retrieved_context → context), LangFuse (prompt / completion), and Phoenix (input / output). It prints a one-line summary to stderr — loaded 287/300 traces · renamed 814 fields · skipped 13 (missing input) — so you can see exactly what was renamed and what was dropped. The silent-skip behavior from earlier releases is gone.
How the recommendation works
Three layers, in order:
-
Heuristic anchor.
auto_evaluators() inspects the trace shape and picks a deterministic starting set — Faithfulness + Hallucination for RAG, ToolCallAccuracy for agents, etc. This runs in microseconds, no LLM call, and provides the safety net.
-
LLM proposal. A single call to Claude Haiku (configurable via
--judge-provider / --judge-model) reads your product description + trace summary + a sample of traces, then proposes 4–6 evaluators with per-metric rationale and threshold suggestions. The LLM is constrained to an enumerated allow-list of evaluator names — it cannot invent metrics, and any proposal outside the list is silently dropped.
-
Merge + threshold calibration. LLM picks take priority for the same evaluator name (more contextual); heuristic picks fill in tiers the LLM missed (e.g. NotEmpty as a guardrail). Thresholds are then re-tuned by running each evaluator over a sample of your traces (50 for deterministic evaluators, 15 for LLM-judge evaluators) and setting the threshold to the 25th percentile of observed scores.
The whole pipeline runs in one Haiku call for proposal + one Haiku call for seed cases + minimal trace-side scoring. Total cost: roughly $0.12 on default settings.
PII handling — no surprises
Every trace is locally scanned for high-confidence secrets and PII before any data leaves your machine: AWS keys, OpenAI / Anthropic / GitHub tokens, JWTs, private-key PEM headers, US SSNs, emails, Luhn-valid credit cards, and more. Three policies:
--pii-policy | Behavior |
|---|
redact (default) | Detections are replaced with [REDACTED:<label>] before being sent to the judge. Logged in the report. |
strict | Abort with a non-zero exit code if any high-confidence detection fires. For procurement / regulated environments. |
allow | Send raw traces, with an explicit terminal-prompt confirmation. Use only when you’re sure your traces are clean. |
If PII was detected and redacted, the report’s ## Trace evidence section surfaces the counts (email=12, ssn=2) and the bootstrap pipeline auto-adds PIIEvaluator as a guardrail in the generated suite.
The discovery report
Codex’s review of the design doc flagged this: “The most valuable output may be DISCOVERY_REPORT.md, not the generated eval code.” The report is built to be forwarded inside your team as an eval design review: what shape your product is, which metrics matter, what was rejected, and why. Many users treat it as the artifact that gets a launch greenlit, even when they edit the generated eval_suite.py heavily afterward.
Drift detection comes free — --repo (0.10.0)
As of 0.10.0,
bootstrap also sets up prompt-drift staleness detection,
at no extra cost or latency:
--repo PATH (default .) tells bootstrap which app repo to scan for
prompt call sites. It writes prompt_baseline.json at that root — the
committed snapshot that multivon-eval staleness later diffs against.
- Every generated case is stamped with
metadata._provenance:
authored_by="bootstrap", the repo SHA, and targets=[]. The empty
targets list means “authored against this repo state”, and nothing more.
Bootstrap generates cases from your product description and traces; it
knows nothing about call sites, so case→site bindings are never
fabricated. Bind cases explicitly later with
multivon-eval staleness stamp.
- The completion checklist gains one line confirming both:
✓ baseline + provenance stamped: 14 call site(s) @ a1b2c3d
Stamping flows through the existing case-serialization path and never
perturbs suite.lock (the lockfile’s cases hash excludes metadata by
design). From that point on, a zero-arg multivon-eval staleness . in the
repo reports which prompts changed since the suite was bootstrapped.
What’s NOT in the bootstrap (and why)
- No model-side wrapper. The bootstrap emits a suite that calls your
model_fn; you bring the wiring to your real model. No vendor lock-in.
- There’s no “monitor production for me” loop. The bootstrap is a setup-time tool: once you have the suite, you run it with
python eval_suite.py or wire it into CI via --fail-threshold.
- No template marketplace, either. Picking your suite from a 20-template menu is the old approach; the bootstrap picks for you and tells you why.
- No dashboard. Local-first by design. Use
multivon-eval view report.json if you want a browsable HTML report.
When to use it (and when not)
Use it when:
- You’re starting a new LLM feature and don’t yet have an eval suite.
- You inherited an LLM product with no evals and need a credible starting point.
- You’re switching evaluation tools and want a clean baseline.
- You’re preparing for a launch and want a forwardable “what we eval and why” document for your team.
Don’t use it when:
- You have a mature, hand-tuned eval suite and just want to add one more evaluator. Use
add_evaluators() directly.
- You only need a deterministic check (length, regex, schema). The bootstrap is overkill — write the assertion in two lines.
Configuration reference
multivon-eval bootstrap --help
| Flag | Default | Notes |
|---|
--product, -p | required | Path to product description markdown |
--traces, -t | required | Path to traces JSONL |
--output, -o | ./eval-bootstrap | Output directory (created if absent) |
--judge-model | claude-haiku-4-5-20251001 | Judge model for the proposal call |
--judge-provider | anthropic | anthropic | openai | google | ollama | litellm |
--judge-base-url | unset | OpenAI-compatible base URL (vLLM, LM Studio, custom Ollama host) |
--n-seed-cases | 30 | Adversarial cases to generate (set to 0 to skip) |
--pii-policy | redact | redact | strict | allow |
--skip-seed-cases | off | Save ~$0.02 if you don’t need adversarial cases |
--skip-calibration | off | Use proposed thresholds without trace-side scoring |
--repo | . | App repo to scan for prompt call sites — writes prompt_baseline.json there and stamps generated cases with repo-state provenance (targets=[]) |
Local judge — run bootstrap fully offline
The whole pipeline routes through make_judge_call as of 0.9.4 / 0.9.6 — including the adversarial seed-case generator (auto.py). That means --judge-provider ollama (or litellm) runs end-to-end, not just at argparse level.
multivon-eval bootstrap \
--judge-provider ollama \
--judge-model qwen2.5:14b \
--product product.md \
--traces traces.jsonl
OLLAMA_HOST env var controls the daemon address (default 127.0.0.1:11434).
--judge-base-url http://localhost:8000/v1 overrides for vLLM, LM Studio, or a remote Ollama.
- Wall-clock is roughly 5× the Haiku default at default
--n-seed-cases 30. Cost: $0.00.
- Cases generated under a local judge carry
metadata['judge_used'] = "ollama:qwen2.5:14b" and metadata['prompt_version'] for downstream replay.
Programmatic API
If you’d rather invoke the bootstrap from Python (e.g. inside a Jupyter notebook or CI pipeline), the same pipeline lives in multivon_eval.bootstrap:
from multivon_eval import bootstrap, JudgeConfig
result = bootstrap(
description_path="product.md",
traces_path="traces.jsonl",
output_dir="./eval-bootstrap",
judge=JudgeConfig(provider="anthropic", model="claude-haiku-4-5-20251001"),
pii_policy="redact",
n_seed_cases=30,
)
print(f"shape: {result.shape}")
print(f"evaluators: {[r.name for r in result.evaluators]}")
print(f"cost: ${result.cost_usd:.4f}")
print(f"artifacts: {result.artifacts}")
See also
- Intelligent eval primitives — full reference for the three primitives the bootstrap composes (
auto_evaluators, generate_adversarial_cases, validate_adversarial_cases). Use these directly when you want fine-grained control or want to compose them into your own pipeline.
- Synthetic data generation — the higher-level
generate_from_file / generate_from_text helpers, when you want generation without targeting a specific failure mode.
- Prompt-drift staleness — the drift report that consumes the
prompt_baseline.json and provenance stamps bootstrap writes.
- Quickstart — the manual path: write
EvalCase objects directly.
- /eval-bootstrap Claude Code skill — the auto-invoking wrapper around this CLI for Claude Code users. Install with
multivon-eval install-skills.