Don’t know what to eval for your specific LLM product? Describe it and hand over a few sample traces — multivon-eval bootstrap proposes a tuned suite in under 60 seconds.
Returns four files: eval_suite.py (runnable), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (an eval design review). Cost: ~$0.12 default (or free with --judge-provider ollama). Full walkthrough →Run it fully offline with a local judge:
Runs a self-contained customer-support eval — no API key required for the deterministic tier. If ANTHROPIC_API_KEY, OPENAI_API_KEY, or a local endpoint is detected, LLM-judge evaluators are added automatically.
export ANTHROPIC_API_KEY=sk-ant-...# orexport OPENAI_API_KEY=sk-...# or point at a local Ollama / LM Studio serverexport OPENAI_BASE_URL=http://localhost:11434/v1export DEMO_MODEL=llama3
Don’t know which evaluator to use? Write what you want in English:
from multivon_eval import EvalSuite, EvalCasedef my_model(input: str) -> str: return call_my_llm(input)suite = EvalSuite("return policy eval")suite.add_check("Response should mention the return policy")suite.add_check("Tone should be professional and not defensive")suite.add_cases([EvalCase(input="What is your return policy?")])report = suite.run(my_model)
add_check auto-generates yes/no questions from your criterion and scores with QAG. Graduate to CustomRubric when you want to pin the exact questions.
from multivon_eval import EvalSuite, EvalCase, NotEmpty, ExactMatch, Containssuite = EvalSuite("My First Eval")suite.add_cases([ EvalCase( input="What is the capital of France?", expected_output="Paris", ), EvalCase( input="Summarize this article.", context="The article discusses climate change and its effects on polar ice...", ),])suite.add_evaluators( NotEmpty(), Contains(["Paris"]), Faithfulness(),)report = suite.run(my_model, verbose=True)