Skip to main content
Writing assertions is the easy part of LLM evaluation. Knowing what to measure for your specific product is the hard part, and multivon-eval bootstrap answers that question. Hand it a one-paragraph product description and a few real traces, and it emits a runnable EvalSuite with metrics tuned to your shape, thresholds calibrated from your data, and adversarial seed cases targeting the most likely failure mode.

The whole flow

pip install multivon-eval
multivon-eval bootstrap \
  --product product.md \
  --traces traces.jsonl \
  --output ./eval-bootstrap/
You get five artifacts:
FileWhat it is
eval_suite.pyRunnable suite with 4–6 evaluators picked for your product shape
seed_cases.jsonl30 adversarial seed cases targeting the primary failure mode
thresholds.yamlPer-evaluator thresholds calibrated from your traces (p25 of scores)
DISCOVERY_REPORT.mdAn eval design review you can forward to your team
prompt_baseline.jsonSnapshot of every prompt call site in your repo (written at the --repo root, default .) — the input to multivon-eval staleness
Total cost: ~0.12perbootstrapondefaultsettings,or0.12 per bootstrap on default settings, or **0.00 with --judge-provider ollama**. Latency: a few minutes on default settings — measured ~4 min, almost entirely judge-API wait, with progress lines printing to stderr as each stage starts. The local-judge path runs about 5× that wall-clock. As of 0.9.4 the emitted eval_suite.py is genuinely runnable end-to-end: python eval_suite.py --runs 1 executes the full suite without further edits. Two things to know before that first run. It makes real judge calls (30 cases × your LLM evaluators ≈ 120 calls, a few cents on Haiku), and its main() exits 1 when the pass rate is under 50% — which is expected against the placeholder stub_model. The all-fail run is the signal to wire in your real model. 0.9.4 also fixed a stale suite.run(cases=...) kwarg and report.print_summary() call that previously broke the generated file; if you hit those on a 0.9.3-and-earlier install, upgrade.

Scaled, gated generation (0.12.0)

--n-seed-cases now goes to 500. Generation runs in batches of ≤30, with later batches steered away from already-accepted inputs, and every case passes gates before acceptance: well-formed (structural), duplicate (NFC-normalized identity or token-Jaccard ≥ 0.85, across batches), and, with --validate-cases --baseline-model-file model.py, the hardness band from auto.validate_adversarial_cases (judge-priced, so opt-in). No silent caps. The CLI summary and DISCOVERY_REPORT.md print the full accounting — generated 500, accepted 431 — dropped 38 duplicates, 12 malformed, 19 outside hardness band [0.5, 1.0] — and a skipped hardness gate says so explicitly. Each accepted case’s metadata["generation"] records its batch, the gates it passed, and its hardness score. --budget-usd (default 2.00) is a pre-spend ceiling: the cost estimate is checked before the first LLM call.

What the inputs look like

product.md — free-form markdown, under 5000 words

# Product

AcmeCorp customer support bot. Answers questions about return policies,
shipping windows, and refund timelines using a retrieval-augmented knowledge
base.

# Inputs / Outputs

Input: free-form question in English.
Output: 1-3 sentence answer grounded in the retrieved policy text.

# Known risks

- Hallucinating policy details not in the retrieved context
- Going off-topic into product recommendations (out of scope)
- Echoing customer PII back into the response unnecessarily
Suggested sections (none are required): # Product, # Users, # Inputs / Outputs, # Known risks. The LLM uses the description to ground its metric recommendations in your domain.

traces.jsonl — newline-delimited JSON, max 10 000 rows

Only input is required. Optional keys are detected and used to infer product shape:
{"input": "What's the refund window?", "context": "Refunds within 30 days.", "output": "30 days from purchase.", "expected_output": "30 days"}
{"input": "Can I return electronics after 20 days?", "context": "Electronics: 14-day window.", "output": "No, electronics have a 14-day return window."}
{"input": "How long does it take to receive my refund?", "context": "Refunds processed in 5-7 business days.", "output": "5-7 business days after we receive your item."}
Optional keyTriggers
contextRAG-shape detection → adds Faithfulness, Hallucination
expected_outputEnables threshold calibration + AnswerAccuracy / ContextRecall
expected_tool_callsAgent-shape detection → adds ToolCallAccuracy
conversationMulti-turn detection → adds ConversationRelevance
metadata.image_urlMultimodal detection → adds VQAFaithfulness
If you don’t have output traces yet (pre-launch), the tool still proposes evaluators based on the product description alone — threshold calibration is the only step that’s skipped.
load_traces accepts your existing dump shape. As of 0.9.4, the loader auto-aliases field names from LangSmith (queryinput, answeroutput, retrieved_contextcontext), LangFuse (prompt / completion), and Phoenix (input / output). It prints a one-line summary to stderr — loaded 287/300 traces · renamed 814 fields · skipped 13 (missing input) — so you can see exactly what was renamed and what was dropped. The silent-skip behavior from earlier releases is gone.

How the recommendation works

Three layers, in order:
  1. Heuristic anchor. auto_evaluators() inspects the trace shape and picks a deterministic starting set — Faithfulness + Hallucination for RAG, ToolCallAccuracy for agents, etc. This runs in microseconds, no LLM call, and provides the safety net.
  2. LLM proposal. A single call to Claude Haiku (configurable via --judge-provider / --judge-model) reads your product description + trace summary + a sample of traces, then proposes 4–6 evaluators with per-metric rationale and threshold suggestions. The LLM is constrained to an enumerated allow-list of evaluator names — it cannot invent metrics, and any proposal outside the list is silently dropped.
  3. Merge + threshold calibration. LLM picks take priority for the same evaluator name (more contextual); heuristic picks fill in tiers the LLM missed (e.g. NotEmpty as a guardrail). Thresholds are then re-tuned by running each evaluator over a sample of your traces (50 for deterministic evaluators, 15 for LLM-judge evaluators) and setting the threshold to the 25th percentile of observed scores.
The whole pipeline runs in one Haiku call for proposal + one Haiku call for seed cases + minimal trace-side scoring. Total cost: roughly $0.12 on default settings.

PII handling — no surprises

Every trace is locally scanned for high-confidence secrets and PII before any data leaves your machine: AWS keys, OpenAI / Anthropic / GitHub tokens, JWTs, private-key PEM headers, US SSNs, emails, Luhn-valid credit cards, and more. Three policies:
--pii-policyBehavior
redact (default)Detections are replaced with [REDACTED:<label>] before being sent to the judge. Logged in the report.
strictAbort with a non-zero exit code if any high-confidence detection fires. For procurement / regulated environments.
allowSend raw traces, with an explicit terminal-prompt confirmation. Use only when you’re sure your traces are clean.
If PII was detected and redacted, the report’s ## Trace evidence section surfaces the counts (email=12, ssn=2) and the bootstrap pipeline auto-adds PIIEvaluator as a guardrail in the generated suite.

The discovery report

Codex’s review of the design doc flagged this: “The most valuable output may be DISCOVERY_REPORT.md, not the generated eval code.” The report is built to be forwarded inside your team as an eval design review: what shape your product is, which metrics matter, what was rejected, and why. Many users treat it as the artifact that gets a launch greenlit, even when they edit the generated eval_suite.py heavily afterward.

Drift detection comes free — --repo (0.10.0)

As of 0.10.0, bootstrap also sets up prompt-drift staleness detection, at no extra cost or latency:
  • --repo PATH (default .) tells bootstrap which app repo to scan for prompt call sites. It writes prompt_baseline.json at that root — the committed snapshot that multivon-eval staleness later diffs against.
  • Every generated case is stamped with metadata._provenance: authored_by="bootstrap", the repo SHA, and targets=[]. The empty targets list means “authored against this repo state”, and nothing more. Bootstrap generates cases from your product description and traces; it knows nothing about call sites, so case→site bindings are never fabricated. Bind cases explicitly later with multivon-eval staleness stamp.
  • The completion checklist gains one line confirming both:
  ✓ baseline + provenance stamped: 14 call site(s) @ a1b2c3d
Stamping flows through the existing case-serialization path and never perturbs suite.lock (the lockfile’s cases hash excludes metadata by design). From that point on, a zero-arg multivon-eval staleness . in the repo reports which prompts changed since the suite was bootstrapped.

What’s NOT in the bootstrap (and why)

  • No model-side wrapper. The bootstrap emits a suite that calls your model_fn; you bring the wiring to your real model. No vendor lock-in.
  • There’s no “monitor production for me” loop. The bootstrap is a setup-time tool: once you have the suite, you run it with python eval_suite.py or wire it into CI via --fail-threshold.
  • No template marketplace, either. Picking your suite from a 20-template menu is the old approach; the bootstrap picks for you and tells you why.
  • No dashboard. Local-first by design. Use multivon-eval view report.json if you want a browsable HTML report.

When to use it (and when not)

Use it when:
  • You’re starting a new LLM feature and don’t yet have an eval suite.
  • You inherited an LLM product with no evals and need a credible starting point.
  • You’re switching evaluation tools and want a clean baseline.
  • You’re preparing for a launch and want a forwardable “what we eval and why” document for your team.
Don’t use it when:
  • You have a mature, hand-tuned eval suite and just want to add one more evaluator. Use add_evaluators() directly.
  • You only need a deterministic check (length, regex, schema). The bootstrap is overkill — write the assertion in two lines.

Configuration reference

multivon-eval bootstrap --help
FlagDefaultNotes
--product, -prequiredPath to product description markdown
--traces, -trequiredPath to traces JSONL
--output, -o./eval-bootstrapOutput directory (created if absent)
--judge-modelclaude-haiku-4-5-20251001Judge model for the proposal call
--judge-provideranthropicanthropic | openai | google | ollama | litellm
--judge-base-urlunsetOpenAI-compatible base URL (vLLM, LM Studio, custom Ollama host)
--n-seed-cases30Adversarial cases to generate (set to 0 to skip)
--pii-policyredactredact | strict | allow
--skip-seed-casesoffSave ~$0.02 if you don’t need adversarial cases
--skip-calibrationoffUse proposed thresholds without trace-side scoring
--repo.App repo to scan for prompt call sites — writes prompt_baseline.json there and stamps generated cases with repo-state provenance (targets=[])

Local judge — run bootstrap fully offline

The whole pipeline routes through make_judge_call as of 0.9.4 / 0.9.6 — including the adversarial seed-case generator (auto.py). That means --judge-provider ollama (or litellm) runs end-to-end, not just at argparse level.
multivon-eval bootstrap \
  --judge-provider ollama \
  --judge-model qwen2.5:14b \
  --product product.md \
  --traces traces.jsonl
  • OLLAMA_HOST env var controls the daemon address (default 127.0.0.1:11434).
  • --judge-base-url http://localhost:8000/v1 overrides for vLLM, LM Studio, or a remote Ollama.
  • Wall-clock is roughly 5× the Haiku default at default --n-seed-cases 30. Cost: $0.00.
  • Cases generated under a local judge carry metadata['judge_used'] = "ollama:qwen2.5:14b" and metadata['prompt_version'] for downstream replay.

Programmatic API

If you’d rather invoke the bootstrap from Python (e.g. inside a Jupyter notebook or CI pipeline), the same pipeline lives in multivon_eval.bootstrap:
from multivon_eval import bootstrap, JudgeConfig

result = bootstrap(
    description_path="product.md",
    traces_path="traces.jsonl",
    output_dir="./eval-bootstrap",
    judge=JudgeConfig(provider="anthropic", model="claude-haiku-4-5-20251001"),
    pii_policy="redact",
    n_seed_cases=30,
)

print(f"shape: {result.shape}")
print(f"evaluators: {[r.name for r in result.evaluators]}")
print(f"cost: ${result.cost_usd:.4f}")
print(f"artifacts: {result.artifacts}")

See also

  • Intelligent eval primitives — full reference for the three primitives the bootstrap composes (auto_evaluators, generate_adversarial_cases, validate_adversarial_cases). Use these directly when you want fine-grained control or want to compose them into your own pipeline.
  • Synthetic data generation — the higher-level generate_from_file / generate_from_text helpers, when you want generation without targeting a specific failure mode.
  • Prompt-drift staleness — the drift report that consumes the prompt_baseline.json and provenance stamps bootstrap writes.
  • Quickstart — the manual path: write EvalCase objects directly.
  • /eval-bootstrap Claude Code skill — the auto-invoking wrapper around this CLI for Claude Code users. Install with multivon-eval install-skills.