Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
LLM judge evaluators use a secondary model to assess output quality. multivon-eval uses QAG scoring (Question-Answer Generation) — generating binary yes/no questions about the output instead of asking for a numeric 1-10 rating.
Why QAG? Binary questions are easier for LLMs to get right, fully auditable (you see which questions passed), and cheaper (shorter prompts).
Configuration
JudgeConfig
The judge model is fully decoupled from your pipeline model. Configure it once globally, override per-evaluator, or fall back to environment variables.
from multivon_eval import configure, JudgeConfig
# Set globally at startup — all evaluators use this unless overridden
configure(JudgeConfig(provider="openai", model="gpt-4o-mini"))
# Override for a specific evaluator
Faithfulness(judge=JudgeConfig(provider="anthropic", model="claude-opus-4-7"))
Resolution order (highest to lowest):
- Per-evaluator
judge= kwarg
configure() global
JUDGE_PROVIDER / JUDGE_MODEL environment variables
- Built-in default:
anthropic / claude-haiku-4-5
# Environment variable fallback
export ANTHROPIC_API_KEY=sk-ant-...
export JUDGE_PROVIDER=anthropic
export JUDGE_MODEL=claude-haiku-4-5
| JudgeConfig field | Default | Description |
|---|
provider | "anthropic" | "anthropic" or "openai" |
model | "claude-haiku-4-5" | Model name for the chosen provider |
base_url | "" | Custom endpoint for local/self-hosted servers (see below) |
temperature | 0.0 | Sampling temperature (0 = deterministic) |
max_tokens | 1024 | Token budget for judge responses |
timeout | 30 | Request timeout in seconds |
The model under test and the judge model can be different providers.
Local and self-hosted models
Any OpenAI-compatible server works as a judge — Ollama, LM Studio, vLLM, llama.cpp, or a self-hosted endpoint:
# Ollama running locally
configure(JudgeConfig(
provider="openai",
model="llama3",
base_url="http://localhost:11434/v1",
))
# LM Studio
configure(JudgeConfig(
provider="openai",
model="local-model",
base_url="http://localhost:1234/v1",
))
# Self-hosted vLLM or any OpenAI-compatible endpoint
configure(JudgeConfig(
provider="openai",
model="meta-llama/Meta-Llama-3-8B-Instruct",
base_url="https://my-inference-server.internal/v1",
))
base_url is also read from the OPENAI_BASE_URL environment variable, so no code changes are needed to switch between cloud and local judges in CI.
Calibrated thresholds
Faithfulness, Hallucination, and Relevance automatically apply the optimal threshold for the configured judge model, derived from benchmarks against human-labeled datasets. You don’t need to tune this manually.
| Judge | Hallucination | Faithfulness | Relevance |
|---|
claude-haiku-4-5-20251001 | 0.55 | 0.90 | 0.30 |
claude-sonnet-4-6 | 0.30 | 0.90 | 0.30 |
gpt-4o-mini | 0.30 | 0.90 | 0.30 |
| Other models | 0.70 | 0.70 | 0.70 |
Pass threshold= explicitly to override:
Faithfulness(threshold=0.8) # use your own threshold, skip calibration
To inspect the full calibration table:
from multivon_eval import threshold_table
print(threshold_table())
Faithfulness
Checks that the output is grounded in the provided context — no invented facts.
When to use: RAG pipelines, document Q&A, or any task where the model must answer strictly from a provided source.
from multivon_eval import EvalCase, Faithfulness
case = EvalCase(
input="What does the report say about Q3 revenue?",
context="Q3 revenue was $4.2M, up 18% YoY...",
)
Faithfulness()
Faithfulness(threshold=0.8)
Requires context on the EvalCase.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Hallucination
Checks that the output doesn’t introduce claims not supported by context.
When to use: Complement to Faithfulness — Faithfulness checks what’s present in the answer; Hallucination checks what shouldn’t be there.
from multivon_eval import EvalCase, Hallucination
case = EvalCase(
input="Summarize the contract terms",
context="Contract: 12-month term, $500/month, 30-day cancellation notice.",
)
Hallucination()
Hallucination(threshold=0.8)
Requires context.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Relevance
Checks that the output actually addresses the input question.
When to use: Catch evasive, off-topic, or boilerplate responses that technically answer but don’t address what was asked.
from multivon_eval import EvalCase, Relevance
case = EvalCase(input="How do I reset my password?")
Relevance()
Relevance(threshold=0.7)
No context required.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Coherence
Checks that the output is clear, well-structured, and logically sound.
When to use: Long-form generation (articles, reports, explanations) where logical flow and clarity matter.
from multivon_eval import EvalCase, Coherence
case = EvalCase(input="Explain how photosynthesis works")
Coherence()
Coherence(threshold=0.8)
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Toxicity
Checks that the output is safe, non-harmful, and appropriate.
When to use: User-facing products, content moderation pipelines, or any deployment where harmful output is unacceptable. Score 1.0 = not toxic; 0.0 = toxic.
from multivon_eval import EvalCase, Toxicity
case = EvalCase(input="Write a response to this angry customer message")
Toxicity()
Toxicity(threshold=0.9)
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.9 | Minimum score to pass (higher default reflects zero-tolerance) |
Bias
Checks that the output is free of demographic, political, or cultural bias.
When to use: HR tools, news summarizers, recommendation systems, or any application where systematic favoritism is a risk. Score 1.0 = no bias detected; 0.0 = significant bias.
from multivon_eval import EvalCase, Bias
case = EvalCase(input="Describe the ideal job candidate for this role")
Bias()
Bias(threshold=0.8)
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.8 | Minimum score to pass |
Summarization
Checks that a summary captures the key points of the source faithfully, without adding or omitting critical information.
When to use: Summarization pipelines — news, legal documents, meeting transcripts.
from multivon_eval import EvalCase, Summarization
case = EvalCase(
input="Summarize this article",
context="[Full source article text here...]",
)
Summarization()
Summarization(threshold=0.8)
Requires context (the source document).
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
AnswerAccuracy
Checks factual correctness of the output against expected_output. Uses judge comparison rather than string matching, so paraphrasing is handled correctly.
When to use: Knowledge QA, fact retrieval, or any task with a known correct answer where the phrasing may vary.
from multivon_eval import EvalCase, AnswerAccuracy
case = EvalCase(
input="What is the capital of France?",
expected_output="Paris",
)
AnswerAccuracy()
AnswerAccuracy(threshold=0.8)
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
ContextPrecision
For RAG systems: checks that retrieved context chunks are actually relevant to the question. High precision = low noise in retrieval.
When to use: Evaluating the retrieval stage of a RAG pipeline independently from generation.
from multivon_eval import EvalCase, ContextPrecision
case = EvalCase(
input="What is our refund policy?",
context=["Refund policy: 30 days...", "Shipping rates: ...", "Contact us at..."],
)
ContextPrecision()
ContextPrecision(threshold=0.8)
Accepts context as either a string or a list of strings (chunks). Evaluates up to 8 chunks.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
ContextRecall
For RAG systems: checks that the retrieved context contains everything needed to derive the expected answer.
When to use: Diagnosing retrieval gaps — cases where the model gave a wrong answer because the right chunk wasn’t retrieved.
from multivon_eval import EvalCase, ContextRecall
case = EvalCase(
input="What is the cancellation fee?",
context="Cancellation within 30 days: $50 fee applies.",
expected_output="$50",
)
ContextRecall()
ContextRecall(threshold=0.8)
Requires both context and expected_output.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
CustomRubric
Define your own yes/no criteria. Each criterion is a (question, expected_answer) tuple. Score = fraction of criteria where the judge’s answer matches expected_answer.
When to use: Domain-specific quality checks that don’t map to the built-in evaluators — support tone, legal disclaimers, brand voice.
from multivon_eval import EvalCase, CustomRubric
case = EvalCase(input="Handle this support ticket: 'My order hasn't arrived'")
CustomRubric(
name="support_quality",
criteria=[
("Does the response acknowledge the customer's problem?", True),
("Does the response provide a concrete next step?", True),
("Does the response use apologetic or defensive language?", False),
("Is the response under 150 words?", True),
],
threshold=0.75,
)
| Parameter | Type | Default | Description |
|---|
criteria | list[tuple[str, bool]] | required | List of (question, expect_yes) pairs |
name | str | "custom_rubric" | Display name for this evaluator in reports |
threshold | float | 0.7 | Minimum fraction of criteria to pass |
GEval
Holistic numeric scoring for qualities that don’t decompose well into yes/no questions (creativity, tone, polish). The judge returns a 0.0–1.0 score directly with reasoning.
When to use: Subjective qualities like writing style, creativity, or polish where binary questions don’t capture the nuance. Use sparingly — less auditable than QAG evaluators.
from multivon_eval import EvalCase, GEval
case = EvalCase(input="Write a product description for wireless headphones")
GEval(
name="writing_quality",
criteria="The response is engaging, concise, and professionally written.",
threshold=0.7,
)
| Parameter | Type | Default | Description |
|---|
criteria | str | required | Free-text description of what to evaluate |
name | str | "g_eval" | Display name for this evaluator in reports |
runs | int | 2 | Number of judge runs to average — reduces position and framing bias |
judge | JudgeConfig | None | Override the judge model for this evaluator |
threshold | float | 0.7 | Minimum score to pass |
GEval is the only evaluator that uses a numeric score directly from the judge rather than QAG aggregation.
CheckEvaluator
The fastest way to add a quality check. You write a plain-English criterion; CheckEvaluator auto-generates specific yes/no questions from it and scores with QAG. No need to pick an evaluator class or write questions manually.
from multivon_eval import EvalSuite, EvalCase
suite = EvalSuite("return policy eval")
suite.add_check("Response should mention the return policy")
suite.add_check("Tone should be professional and not defensive", threshold=0.8)
suite.add_cases([EvalCase(input="What is your return policy?")])
report = suite.run(my_model)
Questions are generated once at the start of suite.run() (eager warmup), so no case pays the generation cost and failures surface before the eval loop starts.
Escape hatch: pin questions for CI
Generated questions vary per run and per model. For reproducible CI runs, pin them explicitly:
suite.add_check(
"Response should mention the return policy",
questions=[
"Does the response mention a return window?",
"Does the response name the return policy by name?",
"Does the response provide a link or next step?",
],
)
When questions= is set, no LLM call is made during prepare().
Inspect generated questions
import logging
logging.basicConfig(level=logging.INFO) # prints generated questions to stdout
ev = suite._evaluators[0]
suite.run(my_model)
print(ev.resolved_questions) # ['Does the response...', ...]
Discrete scores for N questions
With the default num_questions=3, the only possible scores are 0.0, 0.33, 0.67, and 1.0. The default threshold of 0.7 therefore requires 3/3 questions to pass. Lower the threshold or use num_questions=5 if you want more granularity.
num_questions | Possible scores | Threshold 0.7 requires |
|---|
| 3 | 0.0, 0.33, 0.67, 1.0 | 3/3 |
| 5 | 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 | 4/5 |
| 10 | 0.0, 0.1, …, 1.0 | 7/10 |
Fallback behavior
If question generation fails after two attempts, CheckEvaluator issues a warnings.warn and falls back to using the criterion itself as a single yes/no question. The EvalResult reason will include a [⚠ question generation failed — using fallback] tag. Check ev._used_fallback programmatically.
| Parameter | Type | Default | Description |
|---|
criterion | str | required | Plain-English quality criterion (max 300 chars) |
threshold | float | 0.7 | Fraction of questions that must pass |
num_questions | int | 3 | Number of yes/no questions to generate (clamped 1–10) |
questions | list[str] | None | Pin specific questions — skips generation entirely |
name | str | derived | Evaluator name in reports (auto-derived from criterion if omitted) |
judge | JudgeConfig | None | Override the judge model for this evaluator |