Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
eval-bootstrap wraps the multivon-eval bootstrap CLI in a Claude Code workflow. A fresh project goes from “no eval scaffolding” to “first eval running” without you typing CLI flags. The skill picks the judge provider from your environment, finds sample traces in conventional locations, rewrites the generated stub_model to call your real model, and writes an EVALS.md so the next session picks up where this one left off.
Trigger phrases
The skill auto-invokes on any of these:- “add evals to this project”
- “set up evaluation”
- “eval this codebase”
- “evaluate this project”
- “what evaluators should I run”
anthropic, openai, google.genai, litellm, langchain, or llama_index AND has no eval/, evals/, tests/eval/, or evaluation/ directory.
It does NOT auto-invoke if the repo already has a working eval suite. In that case it suggests eval-audit instead.
allowed-tools
eval_suite.py plus EVALS.md. It has no Network or MCP access beyond what the CLI itself uses.
Step-by-step flow
Scope check
Reads
pyproject.toml or package.json to identify the LLM provider in use. If multiple providers are detected, the skill asks which one to target — it never silently picks for you.Trace collection
Looks for sample traces in this order:
traces/*.jsonl, data/traces/*.jsonl, notebooks/*/traces.jsonl. If none are found, it asks you to paste 5–20 sample (input, output) pairs into a temp file. If the project uses LangSmith / LangFuse / Phoenix, the skill prompts for a dump command — those are your secrets, the skill never runs the dump itself.Product description
Uses
PRODUCT.md, OVERVIEW.md, or the top-level README.md as the product description. Falls back to asking you for two or three sentences if none exist.Run bootstrap
Executes
multivon-eval bootstrap with the detected provider as --judge-provider, a sensible default model, and --pii-policy redact. The CLI emits four artifacts: eval_suite.py, seed_cases.jsonl, thresholds.yaml, DISCOVERY_REPORT.md.Rewrite stub_model
The generated
eval_suite.py ships with a placeholder stub_model(). The skill reads one or two existing LLM call sites in your repo and rewrites stub_model to use the same client setup — no reinvention.Write EVALS.md
A short doc the next Claude Code session reads first. It lists which evaluators were picked and why (one sentence each), the CLI command to re-run the suite, and a TODO link to
eval-audit for PR gating.Local-judge path — no API key required
If noANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY is in env, the skill checks for a running Ollama instance via curl -s http://localhost:11434/api/tags. If Ollama responds, it runs ollama list first to see what models you have actually pulled, then picks the strongest instruction-tuned model available. Rough order of judge quality:
qwen2.5:72bllama3.3:70b-instructdeepseek-r1:32bqwen2.5:14b
--judge-model:
Costs
| Path | Cost | Wall-clock |
|---|---|---|
| Cloud (claude-haiku-4-5 judge, default) | ~0.15 | under 60 seconds |
| Local (Ollama) | free | ~5× cloud wall-clock |
How to extend
The skill is aSKILL.md file under multivon_eval/_skills/eval-bootstrap/ inside the installed package. Two ways to customize:
- Edit the symlinked file.
install-skillswrites a directory symlink by default —~/.claude/skills/eval-bootstrap/SKILL.mdpoints at the package copy, andpip install -U multivon-evaloverwrites your edits. To customize safely, copy the directory out of the symlink chain first: - Fork the entire skills set. Vendor
_skills/into your own repo, point the symlinks at your fork, and PR upstream changes back to multivon-ai/multivon-eval when they’re broadly useful.
SKILL.md spec is documented in Anthropic’s Agent Skills reference — plain Markdown with YAML frontmatter, no DSL.
What it doesn’t do
- Doesn’t promise the generated suite covers everything you should evaluate. It covers the shape it can infer from traces plus the product description. Treat the output as a starting point.
- Doesn’t replace
pytestor your existing test suite. - Doesn’t ship a hosted dashboard — output is plain JSON, Markdown, and runnable Python.
- Doesn’t gate PRs by itself. Pair with
eval-audit(Claude Code) oreval-action(GitHub Actions) for CI gating.
See also
- Bootstrap CLI guide — the underlying CLI this skill wraps, with full flag reference and the PII handling policies.
- eval-audit — the PR-time counterpart that runs the suite this skill generates.
- eval-explain — the skill that explains why a particular evaluator was picked, right after bootstrap finishes.

