Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

The fastest path: multivon-eval bootstrap

Don’t know what to eval for your specific LLM product? Describe it and hand over a few sample traces — multivon-eval bootstrap proposes a tuned suite in under 60 seconds.
pip install multivon-eval
multivon-eval bootstrap --product product.md --traces traces.jsonl --output ./eval-bootstrap/
Returns four files: eval_suite.py (runnable), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (an eval design review). Cost: ~$0.12 default (or free with --judge-provider ollama). Full walkthrough → Run it fully offline with a local judge:
multivon-eval bootstrap \
  --judge-provider ollama --judge-model qwen2.5:14b \
  --product product.md --traces traces.jsonl --output ./eval-bootstrap/

Wire it into Claude Code with install-skills

multivon-eval install-skills
Symlinks three bundled Claude Code skills into ~/.claude/skills/. From that point on:
  • Say “add evals to this project” → Claude Code auto-invokes /eval-bootstrap.
  • Ask “why did multivon recommend Faithfulness?”/eval-explain answers.
  • Before /ship on a PR that touches prompts or tool defs → /eval-audit runs only the cases that exercise the changed surface and gates the PR.
See /guides/install-skills and /skills/overview.

Or try the canned demo

pip install multivon-eval && python -m multivon_eval
Runs a self-contained customer-support eval — no API key required for the deterministic tier. If ANTHROPIC_API_KEY, OPENAI_API_KEY, or a local endpoint is detected, LLM-judge evaluators are added automatically.

Install

pip install multivon-eval
For LLM-judge evaluators, add your API key:
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...
# or point at a local Ollama / LM Studio server
export OPENAI_BASE_URL=http://localhost:11434/v1
export DEMO_MODEL=llama3

The fastest start: plain-English checks

Don’t know which evaluator to use? Write what you want in English:
from multivon_eval import EvalSuite, EvalCase

def my_model(input: str) -> str:
    return call_my_llm(input)

suite = EvalSuite("return policy eval")
suite.add_check("Response should mention the return policy")
suite.add_check("Tone should be professional and not defensive")
suite.add_cases([EvalCase(input="What is your return policy?")])
report = suite.run(my_model)
add_check auto-generates yes/no questions from your criterion and scores with QAG. Graduate to CustomRubric when you want to pin the exact questions.

Option A — Generate cases from your docs

No labeled data? Point generate_from_file() at any text file and get eval cases immediately.
from multivon_eval import generate_from_file, EvalSuite, Faithfulness, Relevance

# Generate QA pairs from your docs
cases = generate_from_file("docs/faq.md", n=20)

def my_model(input: str) -> str:
    # Your model call here
    return call_my_llm(input)

suite = EvalSuite("FAQ Eval")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
report = suite.run(my_model, verbose=True)

Option B — Define cases manually

from multivon_eval import EvalSuite, EvalCase, NotEmpty, ExactMatch, Contains

suite = EvalSuite("My First Eval")

suite.add_cases([
    EvalCase(
        input="What is the capital of France?",
        expected_output="Paris",
    ),
    EvalCase(
        input="Summarize this article.",
        context="The article discusses climate change and its effects on polar ice...",
    ),
])

suite.add_evaluators(
    NotEmpty(),
    Contains(["Paris"]),
    Faithfulness(),
)

report = suite.run(my_model, verbose=True)

Load cases from a file

from multivon_eval import load

cases = load("cases.jsonl")  # or .csv
suite.add_cases(cases)
cases.jsonl
{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Summarize this.", "context": "Long article text here..."}

Run in parallel

report = suite.run(my_model, workers=8)

Block CI on regression

report = suite.run(my_model, fail_threshold=0.85)
# exits with code 1 if pass rate < 85%

Use inside pytest

Drop a suite into an existing pytest test file — no special plugin required.
from multivon_eval import EvalSuite, EvalCase, NotEmpty, Faithfulness

def test_support_bot_quality():
    suite = EvalSuite("Support Bot")
    suite.add_cases([
        EvalCase(
            input="How do I reset my password?",
            context="Users reset passwords via the 'Forgot Password' link.",
        ),
    ])
    suite.add_evaluators(NotEmpty(), Faithfulness())

    report = suite.run(my_model, verbose=False)
    assert report.pass_rate >= 0.85, f"Pass rate dropped: {report.pass_rate:.1%}"
Run it like any other test: pytest tests/test_evals.py. Pair with fail_threshold if you prefer an exit-code approach over an assertion.

Track experiments across runs

from multivon_eval import Experiment

exp = Experiment("my-pipeline")
run_id = exp.record(report, tags={"model": "gpt-4o", "prompt_v": "3"})

# Later, compare two runs
exp.compare(old_run_id, run_id)

Next steps

Plain-English checks

Write criteria in English — SDK generates the questions

Synthetic dataset generation

Generate eval cases from your docs — no labels required

LLM judge evaluators

Faithfulness, hallucination, relevance, and more

Agent evaluation

Tool call accuracy and plan quality

Experiment tracking

Compare runs, catch regressions

CI/CD integration

Run evals as a quality gate

Install Claude Code skills

Auto-invoke bootstrap, audit, and explain skills