Bootstrap an eval suite from your product

Writing assertions is the easy part of LLM evaluation. Knowing what to measure for your specific product is the hard part, and multivon-eval bootstrap answers that question. Hand it a one-paragraph product description and a few real traces, and it emits a runnable EvalSuite with metrics tuned to your shape, thresholds calibrated from your data, and adversarial seed cases targeting the most likely failure mode.

The whole flow

pip install multivon-eval
multivon-eval bootstrap \
  --product product.md \
  --traces traces.jsonl \
  --output ./eval-bootstrap/

You get five artifacts:

File	What it is
`eval_suite.py`	Runnable suite with 4–6 evaluators picked for your product shape
`seed_cases.jsonl`	30 adversarial seed cases targeting the primary failure mode
`thresholds.yaml`	Per-evaluator thresholds calibrated from your traces (p25 of scores)
`DISCOVERY_REPORT.md`	An eval design review you can forward to your team
`prompt_baseline.json`	Snapshot of every prompt call site in your repo (written at the `--repo` root, default `.`) — the input to `multivon-eval staleness`

Total cost: ~

0.12 per bootstrap on default settings, or **

0.00 with --judge-provider ollama**. Latency: a few minutes on default settings — measured ~4 min, almost entirely judge-API wait, with progress lines printing to stderr as each stage starts. The local-judge path runs about 5× that wall-clock. As of 0.9.4 the emitted eval_suite.py is genuinely runnable end-to-end: python eval_suite.py --runs 1 executes the full suite without further edits. Two things to know before that first run. It makes real judge calls (30 cases × your LLM evaluators ≈ 120 calls, a few cents on Haiku), and its main() exits 1 when the pass rate is under 50% — which is expected against the placeholder stub_model. The all-fail run is the signal to wire in your real model. 0.9.4 also fixed a stale suite.run(cases=...) kwarg and report.print_summary() call that previously broke the generated file; if you hit those on a 0.9.3-and-earlier install, upgrade.

Scaled, gated generation (0.12.0)

--n-seed-cases now goes to 500. Generation runs in batches of ≤30, with later batches steered away from already-accepted inputs, and every case passes gates before acceptance: well-formed (structural), duplicate (NFC-normalized identity or token-Jaccard ≥ 0.85, across batches), and, with --validate-cases --baseline-model-file model.py, the hardness band from auto.validate_adversarial_cases (judge-priced, so opt-in). No silent caps. The CLI summary and DISCOVERY_REPORT.md print the full accounting — generated 500, accepted 431 — dropped 38 duplicates, 12 malformed, 19 outside hardness band [0.5, 1.0] — and a skipped hardness gate says so explicitly. Each accepted case’s metadata["generation"] records its batch, the gates it passed, and its hardness score. --budget-usd (default 2.00) is a pre-spend ceiling: the cost estimate is checked before the first LLM call.

What the inputs look like

`product.md` — free-form markdown, under 5000 words

# Product

AcmeCorp customer support bot. Answers questions about return policies,
shipping windows, and refund timelines using a retrieval-augmented knowledge
base.

# Inputs / Outputs

Input: free-form question in English.
Output: 1-3 sentence answer grounded in the retrieved policy text.

# Known risks

- Hallucinating policy details not in the retrieved context
- Going off-topic into product recommendations (out of scope)
- Echoing customer PII back into the response unnecessarily

Suggested sections (none are required): # Product, # Users, # Inputs / Outputs, # Known risks. The LLM uses the description to ground its metric recommendations in your domain.

`traces.jsonl` — newline-delimited JSON, max 10 000 rows

Only input is required. Optional keys are detected and used to infer product shape:

{"input": "What's the refund window?", "context": "Refunds within 30 days.", "output": "30 days from purchase.", "expected_output": "30 days"}
{"input": "Can I return electronics after 20 days?", "context": "Electronics: 14-day window.", "output": "No, electronics have a 14-day return window."}
{"input": "How long does it take to receive my refund?", "context": "Refunds processed in 5-7 business days.", "output": "5-7 business days after we receive your item."}

Optional key	Triggers
`context`	RAG-shape detection → adds Faithfulness, Hallucination
`expected_output`	Enables threshold calibration + AnswerAccuracy / ContextRecall
`expected_tool_calls`	Agent-shape detection → adds ToolCallAccuracy
`conversation`	Multi-turn detection → adds ConversationRelevance
`metadata.image_url`	Multimodal detection → adds VQAFaithfulness

If you don’t have output traces yet (pre-launch), the tool still proposes evaluators based on the product description alone — threshold calibration is the only step that’s skipped.

load_traces accepts your existing dump shape. As of 0.9.4, the loader auto-aliases field names from LangSmith (query → input, answer → output, retrieved_context → context), LangFuse (prompt / completion), and Phoenix (input / output). It prints a one-line summary to stderr — loaded 287/300 traces · renamed 814 fields · skipped 13 (missing input) — so you can see exactly what was renamed and what was dropped. The silent-skip behavior from earlier releases is gone.

How the recommendation works

Three layers, in order:

Heuristic anchor. auto_evaluators() inspects the trace shape and picks a deterministic starting set — Faithfulness + Hallucination for RAG, ToolCallAccuracy for agents, etc. This runs in microseconds, no LLM call, and provides the safety net.
LLM proposal. A single call to Claude Haiku (configurable via --judge-provider / --judge-model) reads your product description + trace summary + a sample of traces, then proposes 4–6 evaluators with per-metric rationale and threshold suggestions. The LLM is constrained to an enumerated allow-list of evaluator names — it cannot invent metrics, and any proposal outside the list is silently dropped.
Merge + threshold calibration. LLM picks take priority for the same evaluator name (more contextual); heuristic picks fill in tiers the LLM missed (e.g. NotEmpty as a guardrail). Thresholds are then re-tuned by running each evaluator over a sample of your traces (50 for deterministic evaluators, 15 for LLM-judge evaluators) and setting the threshold to the 25th percentile of observed scores.

The whole pipeline runs in one Haiku call for proposal + one Haiku call for seed cases + minimal trace-side scoring. Total cost: roughly $0.12 on default settings.

PII handling — no surprises

Every trace is locally scanned for high-confidence secrets and PII before any data leaves your machine: AWS keys, OpenAI / Anthropic / GitHub tokens, JWTs, private-key PEM headers, US SSNs, emails, Luhn-valid credit cards, and more. Three policies:

`--pii-policy`	Behavior
`redact` (default)	Detections are replaced with `[REDACTED:<label>]` before being sent to the judge. Logged in the report.
`strict`	Abort with a non-zero exit code if any high-confidence detection fires. For procurement / regulated environments.
`allow`	Send raw traces, with an explicit terminal-prompt confirmation. Use only when you’re sure your traces are clean.

If PII was detected and redacted, the report’s ## Trace evidence section surfaces the counts (email=12, ssn=2) and the bootstrap pipeline auto-adds PIIEvaluator as a guardrail in the generated suite.

The discovery report

Codex’s review of the design doc flagged this: “The most valuable output may be DISCOVERY_REPORT.md, not the generated eval code.” The report is built to be forwarded inside your team as an eval design review: what shape your product is, which metrics matter, what was rejected, and why. Many users treat it as the artifact that gets a launch greenlit, even when they edit the generated eval_suite.py heavily afterward.

Drift detection comes free — `--repo` (0.10.0)

As of 0.10.0, bootstrap also sets up prompt-drift staleness detection, at no extra cost or latency:

--repo PATH (default .) tells bootstrap which app repo to scan for prompt call sites. It writes prompt_baseline.json at that root — the committed snapshot that multivon-eval staleness later diffs against.
Every generated case is stamped with metadata._provenance: authored_by="bootstrap", the repo SHA, and targets=[]. The empty targets list means “authored against this repo state”, and nothing more. Bootstrap generates cases from your product description and traces; it knows nothing about call sites, so case→site bindings are never fabricated. Bind cases explicitly later with multivon-eval staleness stamp.
The completion checklist gains one line confirming both:

  ✓ baseline + provenance stamped: 14 call site(s) @ a1b2c3d

Stamping flows through the existing case-serialization path and never perturbs suite.lock (the lockfile’s cases hash excludes metadata by design). From that point on, a zero-arg multivon-eval staleness . in the repo reports which prompts changed since the suite was bootstrapped.

What’s NOT in the bootstrap (and why)

No model-side wrapper. The bootstrap emits a suite that calls your model_fn; you bring the wiring to your real model. No vendor lock-in.
There’s no “monitor production for me” loop. The bootstrap is a setup-time tool: once you have the suite, you run it with python eval_suite.py or wire it into CI via --fail-threshold.
No template marketplace, either. Picking your suite from a 20-template menu is the old approach; the bootstrap picks for you and tells you why.
No dashboard. Local-first by design. Use multivon-eval view report.json if you want a browsable HTML report.

When to use it (and when not)

Use it when:

You’re starting a new LLM feature and don’t yet have an eval suite.
You inherited an LLM product with no evals and need a credible starting point.
You’re switching evaluation tools and want a clean baseline.
You’re preparing for a launch and want a forwardable “what we eval and why” document for your team.

Don’t use it when:

You have a mature, hand-tuned eval suite and just want to add one more evaluator. Use add_evaluators() directly.
You only need a deterministic check (length, regex, schema). The bootstrap is overkill — write the assertion in two lines.

Configuration reference

multivon-eval bootstrap --help

Flag	Default	Notes
`--product`, `-p`	required	Path to product description markdown
`--traces`, `-t`	required	Path to traces JSONL
`--output`, `-o`	`./eval-bootstrap`	Output directory (created if absent)
`--judge-model`	`claude-haiku-4-5-20251001`	Judge model for the proposal call
`--judge-provider`	`anthropic`	`anthropic` \| `openai` \| `google` \| `ollama` \| `litellm`
`--judge-base-url`	unset	OpenAI-compatible base URL (vLLM, LM Studio, custom Ollama host)
`--n-seed-cases`	`30`	Adversarial cases to generate (set to 0 to skip)
`--pii-policy`	`redact`	`redact` \| `strict` \| `allow`
`--skip-seed-cases`	off	Save ~$0.02 if you don’t need adversarial cases
`--skip-calibration`	off	Use proposed thresholds without trace-side scoring
`--repo`	`.`	App repo to scan for prompt call sites — writes `prompt_baseline.json` there and stamps generated cases with repo-state provenance (`targets=[]`)

Local judge — run bootstrap fully offline

The whole pipeline routes through make_judge_call as of 0.9.4 / 0.9.6 — including the adversarial seed-case generator (auto.py). That means --judge-provider ollama (or litellm) runs end-to-end, not just at argparse level.

multivon-eval bootstrap \
  --judge-provider ollama \
  --judge-model qwen2.5:14b \
  --product product.md \
  --traces traces.jsonl

OLLAMA_HOST env var controls the daemon address (default 127.0.0.1:11434).
--judge-base-url http://localhost:8000/v1 overrides for vLLM, LM Studio, or a remote Ollama.
Wall-clock is roughly 5× the Haiku default at default --n-seed-cases 30. Cost: $0.00.
Cases generated under a local judge carry metadata['judge_used'] = "ollama:qwen2.5:14b" and metadata['prompt_version'] for downstream replay.

Programmatic API

If you’d rather invoke the bootstrap from Python (e.g. inside a Jupyter notebook or CI pipeline), the same pipeline lives in multivon_eval.bootstrap:

from multivon_eval import bootstrap, JudgeConfig

result = bootstrap(
    description_path="product.md",
    traces_path="traces.jsonl",
    output_dir="./eval-bootstrap",
    judge=JudgeConfig(provider="anthropic", model="claude-haiku-4-5-20251001"),
    pii_policy="redact",
    n_seed_cases=30,
)

print(f"shape: {result.shape}")
print(f"evaluators: {[r.name for r in result.evaluators]}")
print(f"cost: ${result.cost_usd:.4f}")
print(f"artifacts: {result.artifacts}")

​The whole flow

​Scaled, gated generation (0.12.0)

​What the inputs look like

​product.md — free-form markdown, under 5000 words

​traces.jsonl — newline-delimited JSON, max 10 000 rows

​How the recommendation works

​PII handling — no surprises

​The discovery report

​Drift detection comes free — --repo (0.10.0)

​What’s NOT in the bootstrap (and why)

​When to use it (and when not)

​Configuration reference

​Local judge — run bootstrap fully offline

​Programmatic API

​See also