> ## Documentation Index
> Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Bootstrap an eval suite from your product

> Cold-start your evals — `multivon-eval bootstrap` proposes a tuned suite from a product description + sample traces in a few minutes.

Writing assertions is the easy part of LLM evaluation. Knowing what to measure for your specific product is the hard part, and `multivon-eval bootstrap` answers that question. Hand it a one-paragraph product description and a few real traces, and it emits a runnable `EvalSuite` with metrics tuned to your shape, thresholds calibrated from your data, and adversarial seed cases targeting the most likely failure mode.

## The whole flow

```bash theme={null}
pip install multivon-eval
multivon-eval bootstrap \
  --product product.md \
  --traces traces.jsonl \
  --output ./eval-bootstrap/
```

You get five artifacts:

| File                   | What it is                                                                                                                                                |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `eval_suite.py`        | Runnable suite with 4–6 evaluators picked for your product shape                                                                                          |
| `seed_cases.jsonl`     | 30 adversarial seed cases targeting the primary failure mode                                                                                              |
| `thresholds.yaml`      | Per-evaluator thresholds calibrated from your traces (p25 of scores)                                                                                      |
| `DISCOVERY_REPORT.md`  | An eval design review you can forward to your team                                                                                                        |
| `prompt_baseline.json` | Snapshot of every prompt call site in your repo (written at the `--repo` root, default `.`) — the input to [`multivon-eval staleness`](/guides/staleness) |

Total cost: \~$0.12 per bootstrap on default settings, or **$0.00 with `--judge-provider ollama`\*\*. Latency: a few minutes on default settings — measured \~4 min, almost entirely judge-API wait, with progress lines printing to stderr as each stage starts. The local-judge path runs about 5× that wall-clock.

As of 0.9.4 the emitted `eval_suite.py` is genuinely runnable end-to-end: `python eval_suite.py --runs 1` executes the full suite without further edits. Two things to know before that first run. It makes real judge calls (30 cases × your LLM evaluators ≈ 120 calls, a few cents on Haiku), and its `main()` exits 1 when the pass rate is under 50% — which is expected against the placeholder `stub_model`. The all-fail run is the signal to wire in your real model. 0.9.4 also fixed a stale `suite.run(cases=...)` kwarg and `report.print_summary()` call that previously broke the generated file; if you hit those on a 0.9.3-and-earlier install, upgrade.

## Scaled, gated generation (0.12.0)

`--n-seed-cases` now goes to 500. Generation runs in batches of ≤30, with
later batches steered away from already-accepted inputs, and every case
passes gates before acceptance: well-formed (structural), duplicate
(NFC-normalized identity or token-Jaccard ≥ 0.85, across batches), and, with
`--validate-cases --baseline-model-file model.py`, the hardness band from
`auto.validate_adversarial_cases` (judge-priced, so opt-in).

**No silent caps.** The CLI summary and DISCOVERY\_REPORT.md print the full
accounting — `generated 500, accepted 431 — dropped 38 duplicates, 12
malformed, 19 outside hardness band [0.5, 1.0]` — and a skipped hardness gate
says so explicitly. Each accepted case's `metadata["generation"]` records its
batch, the gates it passed, and its hardness score. `--budget-usd` (default
2.00) is a pre-spend ceiling: the cost estimate is checked before the first
LLM call.

## What the inputs look like

### `product.md` — free-form markdown, under 5000 words

```markdown theme={null}
# Product

AcmeCorp customer support bot. Answers questions about return policies,
shipping windows, and refund timelines using a retrieval-augmented knowledge
base.

# Inputs / Outputs

Input: free-form question in English.
Output: 1-3 sentence answer grounded in the retrieved policy text.

# Known risks

- Hallucinating policy details not in the retrieved context
- Going off-topic into product recommendations (out of scope)
- Echoing customer PII back into the response unnecessarily
```

Suggested sections (none are required): `# Product`, `# Users`, `# Inputs / Outputs`, `# Known risks`. The LLM uses the description to ground its metric recommendations in your domain.

### `traces.jsonl` — newline-delimited JSON, max 10 000 rows

Only `input` is required. Optional keys are detected and used to infer product shape:

```jsonl theme={null}
{"input": "What's the refund window?", "context": "Refunds within 30 days.", "output": "30 days from purchase.", "expected_output": "30 days"}
{"input": "Can I return electronics after 20 days?", "context": "Electronics: 14-day window.", "output": "No, electronics have a 14-day return window."}
{"input": "How long does it take to receive my refund?", "context": "Refunds processed in 5-7 business days.", "output": "5-7 business days after we receive your item."}
```

| Optional key          | Triggers                                                       |
| --------------------- | -------------------------------------------------------------- |
| `context`             | RAG-shape detection → adds Faithfulness, Hallucination         |
| `expected_output`     | Enables threshold calibration + AnswerAccuracy / ContextRecall |
| `expected_tool_calls` | Agent-shape detection → adds ToolCallAccuracy                  |
| `conversation`        | Multi-turn detection → adds ConversationRelevance              |
| `metadata.image_url`  | Multimodal detection → adds VQAFaithfulness                    |

If you don't have output traces yet (pre-launch), the tool still proposes evaluators based on the product description alone — threshold calibration is the only step that's skipped.

<Note>
  **`load_traces` accepts your existing dump shape.** As of 0.9.4, the loader auto-aliases field names from LangSmith (`query` → `input`, `answer` → `output`, `retrieved_context` → `context`), LangFuse (`prompt` / `completion`), and Phoenix (`input` / `output`). It prints a one-line summary to stderr — `loaded 287/300 traces · renamed 814 fields · skipped 13 (missing input)` — so you can see exactly what was renamed and what was dropped. The silent-skip behavior from earlier releases is gone.
</Note>

## How the recommendation works

Three layers, in order:

1. **Heuristic anchor.** `auto_evaluators()` inspects the trace shape and picks a deterministic starting set — Faithfulness + Hallucination for RAG, ToolCallAccuracy for agents, etc. This runs in microseconds, no LLM call, and provides the safety net.

2. **LLM proposal.** A single call to Claude Haiku (configurable via `--judge-provider` / `--judge-model`) reads your product description + trace summary + a sample of traces, then proposes 4–6 evaluators with per-metric rationale and threshold suggestions. The LLM is constrained to an enumerated allow-list of evaluator names — it cannot invent metrics, and any proposal outside the list is silently dropped.

3. **Merge + threshold calibration.** LLM picks take priority for the same evaluator name (more contextual); heuristic picks fill in tiers the LLM missed (e.g. NotEmpty as a guardrail). Thresholds are then re-tuned by running each evaluator over a sample of your traces (50 for deterministic evaluators, 15 for LLM-judge evaluators) and setting the threshold to the 25th percentile of observed scores.

The whole pipeline runs in one Haiku call for proposal + one Haiku call for seed cases + minimal trace-side scoring. Total cost: roughly \$0.12 on default settings.

## PII handling — no surprises

Every trace is locally scanned for high-confidence secrets and PII **before** any data leaves your machine: AWS keys, OpenAI / Anthropic / GitHub tokens, JWTs, private-key PEM headers, US SSNs, emails, Luhn-valid credit cards, and more. Three policies:

| `--pii-policy`     | Behavior                                                                                                          |
| ------------------ | ----------------------------------------------------------------------------------------------------------------- |
| `redact` (default) | Detections are replaced with `[REDACTED:<label>]` before being sent to the judge. Logged in the report.           |
| `strict`           | Abort with a non-zero exit code if any high-confidence detection fires. For procurement / regulated environments. |
| `allow`            | Send raw traces, with an explicit terminal-prompt confirmation. Use only when you're sure your traces are clean.  |

If PII was detected and redacted, the report's `## Trace evidence` section surfaces the counts (`email=12, ssn=2`) and the bootstrap pipeline auto-adds `PIIEvaluator` as a guardrail in the generated suite.

## The discovery report

Codex's review of the design doc flagged this: *"The most valuable output may be `DISCOVERY_REPORT.md`, not the generated eval code."* The report is built to be forwarded inside your team as an eval design review: what shape your product is, which metrics matter, what was rejected, and why. Many users treat it as the artifact that gets a launch greenlit, even when they edit the generated `eval_suite.py` heavily afterward.

## Drift detection comes free — `--repo` (0.10.0)

As of [0.10.0](https://github.com/multivon-ai/multivon-eval/blob/main/CHANGELOG.md#0100--2026-06-11),
bootstrap also sets up [prompt-drift staleness detection](/guides/staleness),
at no extra cost or latency:

* **`--repo PATH`** (default `.`) tells bootstrap which app repo to scan for
  prompt call sites. It writes `prompt_baseline.json` at that root — the
  committed snapshot that `multivon-eval staleness` later diffs against.
* **Every generated case is stamped** with `metadata._provenance`:
  `authored_by="bootstrap"`, the repo SHA, and `targets=[]`. The empty
  targets list means "authored against this repo state", and nothing more.
  Bootstrap generates cases from your product description and traces; it
  knows nothing about call sites, so case→site bindings are never
  fabricated. Bind cases explicitly later with
  `multivon-eval staleness stamp`.
* The completion checklist gains one line confirming both:

```text theme={null}
  ✓ baseline + provenance stamped: 14 call site(s) @ a1b2c3d
```

Stamping flows through the existing case-serialization path and never
perturbs `suite.lock` (the lockfile's cases hash excludes metadata by
design). From that point on, a zero-arg `multivon-eval staleness .` in the
repo reports which prompts changed since the suite was bootstrapped.

## What's NOT in the bootstrap (and why)

* **No model-side wrapper.** The bootstrap emits a suite that calls *your* `model_fn`; you bring the wiring to your real model. No vendor lock-in.
* There's no "monitor production for me" loop. The bootstrap is a setup-time tool: once you have the suite, you run it with `python eval_suite.py` or wire it into CI via `--fail-threshold`.
* No template marketplace, either. Picking your suite from a 20-template menu is the old approach; the bootstrap picks for you and tells you why.
* **No dashboard.** Local-first by design. Use `multivon-eval view report.json` if you want a browsable HTML report.

## When to use it (and when not)

**Use it when:**

* You're starting a new LLM feature and don't yet have an eval suite.
* You inherited an LLM product with no evals and need a credible starting point.
* You're switching evaluation tools and want a clean baseline.
* You're preparing for a launch and want a forwardable "what we eval and why" document for your team.

**Don't use it when:**

* You have a mature, hand-tuned eval suite and just want to add one more evaluator. Use `add_evaluators()` directly.
* You only need a deterministic check (length, regex, schema). The bootstrap is overkill — write the assertion in two lines.

## Configuration reference

```text theme={null}
multivon-eval bootstrap --help
```

| Flag                 | Default                     | Notes                                                                                                                                             |
| -------------------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--product`, `-p`    | required                    | Path to product description markdown                                                                                                              |
| `--traces`, `-t`     | required                    | Path to traces JSONL                                                                                                                              |
| `--output`, `-o`     | `./eval-bootstrap`          | Output directory (created if absent)                                                                                                              |
| `--judge-model`      | `claude-haiku-4-5-20251001` | Judge model for the proposal call                                                                                                                 |
| `--judge-provider`   | `anthropic`                 | `anthropic` \| `openai` \| `google` \| `ollama` \| `litellm`                                                                                      |
| `--judge-base-url`   | unset                       | OpenAI-compatible base URL (vLLM, LM Studio, custom Ollama host)                                                                                  |
| `--n-seed-cases`     | `30`                        | Adversarial cases to generate (set to 0 to skip)                                                                                                  |
| `--pii-policy`       | `redact`                    | `redact` \| `strict` \| `allow`                                                                                                                   |
| `--skip-seed-cases`  | off                         | Save \~\$0.02 if you don't need adversarial cases                                                                                                 |
| `--skip-calibration` | off                         | Use proposed thresholds without trace-side scoring                                                                                                |
| `--repo`             | `.`                         | App repo to scan for prompt call sites — writes `prompt_baseline.json` there and stamps generated cases with repo-state provenance (`targets=[]`) |

## Local judge — run bootstrap fully offline

The whole pipeline routes through `make_judge_call` as of 0.9.4 / 0.9.6 — including the adversarial seed-case generator (`auto.py`). That means `--judge-provider ollama` (or `litellm`) runs end-to-end, not just at argparse level.

```bash theme={null}
multivon-eval bootstrap \
  --judge-provider ollama \
  --judge-model qwen2.5:14b \
  --product product.md \
  --traces traces.jsonl
```

* `OLLAMA_HOST` env var controls the daemon address (default `127.0.0.1:11434`).
* `--judge-base-url http://localhost:8000/v1` overrides for vLLM, LM Studio, or a remote Ollama.
* Wall-clock is roughly **5× the Haiku default** at default `--n-seed-cases 30`. Cost: \$0.00.
* Cases generated under a local judge carry `metadata['judge_used'] = "ollama:qwen2.5:14b"` and `metadata['prompt_version']` for downstream replay.

## Programmatic API

If you'd rather invoke the bootstrap from Python (e.g. inside a Jupyter notebook or CI pipeline), the same pipeline lives in `multivon_eval.bootstrap`:

```python theme={null}
from multivon_eval import bootstrap, JudgeConfig

result = bootstrap(
    description_path="product.md",
    traces_path="traces.jsonl",
    output_dir="./eval-bootstrap",
    judge=JudgeConfig(provider="anthropic", model="claude-haiku-4-5-20251001"),
    pii_policy="redact",
    n_seed_cases=30,
)

print(f"shape: {result.shape}")
print(f"evaluators: {[r.name for r in result.evaluators]}")
print(f"cost: ${result.cost_usd:.4f}")
print(f"artifacts: {result.artifacts}")
```

## See also

* [Intelligent eval primitives](/guides/intelligent-eval) — full reference for the three primitives the bootstrap composes (`auto_evaluators`, `generate_adversarial_cases`, `validate_adversarial_cases`). Use these directly when you want fine-grained control or want to compose them into your own pipeline.
* [Synthetic data generation](/guides/synthetic-data) — the higher-level `generate_from_file` / `generate_from_text` helpers, when you want generation without targeting a specific failure mode.
* [Prompt-drift staleness](/guides/staleness) — the drift report that consumes the `prompt_baseline.json` and provenance stamps bootstrap writes.
* [Quickstart](/quickstart) — the manual path: write `EvalCase` objects directly.
* [/eval-bootstrap Claude Code skill](/skills/eval-bootstrap) — the auto-invoking wrapper around this CLI for Claude Code users. Install with `multivon-eval install-skills`.
