Quickstart

The fastest path: `multivon-eval bootstrap`

Don’t know what to eval for your specific LLM product? Describe it and hand over a few sample traces. multivon-eval bootstrap proposes a tuned suite in a few minutes.

pip install multivon-eval
multivon-eval bootstrap --product product.md --traces traces.jsonl --output ./eval-bootstrap/

Returns five files: eval_suite.py (runnable), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), DISCOVERY_REPORT.md (an eval design review), and prompt_baseline.json (a prompt call-site baseline written at the repo root for staleness tracking). Cost: ~$0.12 default, or free with --judge-provider ollama. Full walkthrough → Run it fully offline with a local judge:

multivon-eval bootstrap \
  --judge-provider ollama --judge-model qwen2.5:14b \
  --product product.md --traces traces.jsonl --output ./eval-bootstrap/

Wire it into Claude Code with `install-skills`

multivon-eval install-skills

Symlinks three bundled Claude Code skills into ~/.claude/skills/. From that point on:

Say “add evals to this project” → Claude Code auto-invokes /eval-bootstrap.
Ask “why did multivon recommend Faithfulness?” → /eval-explain answers.
Before /ship on a PR that touches prompts or tool defs → /eval-audit runs only the cases that exercise the changed surface and gates the PR.

See /guides/install-skills and /skills/index.

Or try the canned demo

pip install multivon-eval && python -m multivon_eval

Runs a self-contained customer-support eval; the deterministic tier needs no API key. If ANTHROPIC_API_KEY, OPENAI_API_KEY, or a local endpoint is detected, LLM-judge evaluators are added automatically.

Install

pip install multivon-eval

For LLM-judge evaluators, add your API key:

export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...
# or point at a local Ollama / LM Studio server
export OPENAI_BASE_URL=http://localhost:11434/v1
export DEMO_MODEL=llama3

The fastest start: plain-English checks

Don’t know which evaluator to use? Write what you want in English:

from multivon_eval import EvalSuite, EvalCase

def my_model(input: str) -> str:
    return call_my_llm(input)

suite = EvalSuite("return policy eval")
suite.add_check("Response should mention the return policy")
suite.add_check("Tone should be professional and not defensive")
suite.add_cases([EvalCase(input="What is your return policy?")])
report = suite.run(my_model)

add_check auto-generates yes/no questions from your criterion and scores with QAG. When you want to pin the exact questions, graduate to CustomRubric.

Option A — Generate cases from your docs

No labeled data? Point generate_from_file() at any text file and get eval cases immediately.

from multivon_eval import generate_from_file, EvalSuite, Faithfulness, Relevance

# Generate QA pairs from your docs
cases = generate_from_file("docs/faq.md", n=20)

def my_model(input: str) -> str:
    # Your model call here
    return call_my_llm(input)

suite = EvalSuite("FAQ Eval")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(my_model, verbose=True)

Option B — Define cases manually

from multivon_eval import EvalSuite, EvalCase, NotEmpty, ExactMatch, Contains

suite = EvalSuite("My First Eval")

suite.add_cases([
    EvalCase(
        input="What is the capital of France?",
        expected_output="Paris",
    ),
    EvalCase(
        input="Summarize this article.",
        context="The article discusses climate change and its effects on polar ice...",
    ),
])

suite.add_evaluators(
    NotEmpty(),
    Contains(["Paris"]),
    Faithfulness(),
)

report = suite.run(my_model, verbose=True)

Load cases from a file

from multivon_eval import load

cases = load("cases.jsonl")  # or .csv
suite.add_cases(cases)

cases.jsonl

{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Summarize this.", "context": "Long article text here..."}

Run in parallel

report = suite.run(my_model, workers=8)

Block CI on regression

report = suite.run(my_model, fail_threshold=0.85)
# exits with code 1 if pass rate < 85%

Use inside pytest

Drop a suite into an existing pytest test file — no special plugin required.

from multivon_eval import EvalSuite, EvalCase, NotEmpty, Faithfulness

def test_support_bot_quality():
    suite = EvalSuite("Support Bot")
    suite.add_cases([
        EvalCase(
            input="How do I reset my password?",
            context="Users reset passwords via the 'Forgot Password' link.",
        ),
    ])
    suite.add_evaluators(NotEmpty(), Faithfulness())

    report = suite.run(my_model, verbose=False)
    assert report.pass_rate >= 0.85, f"Pass rate dropped: {report.pass_rate:.1%}"

Run it like any other test: pytest tests/test_evals.py. Pair with fail_threshold if you prefer an exit-code approach over an assertion.

Track experiments across runs

from multivon_eval import Experiment

exp = Experiment("my-pipeline")
run_id = exp.record(report, tags={"model": "gpt-4o", "prompt_v": "3"})

# Later, compare two runs
exp.compare(old_run_id, run_id)

Next steps

Plain-English checks

Write criteria in English — SDK generates the questions

Synthetic dataset generation

Generate eval cases from your docs — no labels required

LLM judge evaluators

Faithfulness, hallucination, relevance, and more

Agent evaluation

Tool call accuracy and plan quality

Experiment tracking

Compare runs, catch regressions

CI/CD integration

Run evals as a quality gate

Prompt-drift staleness

Know which prompts changed since your cases were authored

Persona simulation

Drive adaptive multi-turn conversations with synthetic personas

Install Claude Code skills

Auto-invoke bootstrap, audit, and explain skills

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

The fastest path: `multivon-eval bootstrap`

Wire it into Claude Code with `install-skills`

Or try the canned demo

Install

The fastest start: plain-English checks

Option A — Generate cases from your docs

Option B — Define cases manually

Load cases from a file

Run in parallel

Block CI on regression

Use inside pytest

Track experiments across runs

Next steps

Plain-English checks

Synthetic dataset generation

LLM judge evaluators

Agent evaluation

Experiment tracking

CI/CD integration

Prompt-drift staleness

Persona simulation

Install Claude Code skills

​The fastest path: multivon-eval bootstrap

​Wire it into Claude Code with install-skills

​Or try the canned demo

​Install

​The fastest start: plain-English checks

​Option A — Generate cases from your docs

​Option B — Define cases manually

​Load cases from a file

​Run in parallel

​Block CI on regression

​Use inside pytest

​Track experiments across runs

​Next steps

Plain-English checks

Synthetic dataset generation

LLM judge evaluators

Agent evaluation

Experiment tracking

CI/CD integration

Prompt-drift staleness

Persona simulation

Install Claude Code skills

The fastest path: `multivon-eval bootstrap`

Wire it into Claude Code with `install-skills`

Or try the canned demo

Install

The fastest start: plain-English checks

Option A — Generate cases from your docs

Option B — Define cases manually

Load cases from a file

Run in parallel

Block CI on regression

Use inside pytest

Track experiments across runs

Next steps