Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

LLM judge evaluators use a secondary model to assess output quality. multivon-eval uses QAG scoring (Question-Answer Generation) — generating binary yes/no questions about the output instead of asking for a numeric 1-10 rating. Why QAG? Binary questions are easier for LLMs to get right, fully auditable (you see which questions passed), and cheaper (shorter prompts).

Configuration

JudgeConfig

The judge model is fully decoupled from your pipeline model. Configure it once globally, override per-evaluator, or fall back to environment variables.
from multivon_eval import configure, JudgeConfig

# Set globally at startup — all evaluators use this unless overridden
configure(JudgeConfig(provider="openai", model="gpt-4o-mini"))

# Override for a specific evaluator
Faithfulness(judge=JudgeConfig(provider="anthropic", model="claude-opus-4-7"))
Resolution order (highest to lowest):
  1. Per-evaluator judge= kwarg
  2. configure() global
  3. JUDGE_PROVIDER / JUDGE_MODEL environment variables
  4. Built-in default: anthropic / claude-haiku-4-5
# Environment variable fallback
export ANTHROPIC_API_KEY=sk-ant-...
export JUDGE_PROVIDER=anthropic
export JUDGE_MODEL=claude-haiku-4-5
JudgeConfig fieldDefaultDescription
provider"anthropic""anthropic" or "openai"
model"claude-haiku-4-5"Model name for the chosen provider
base_url""Custom endpoint for local/self-hosted servers (see below)
temperature0.0Sampling temperature (0 = deterministic)
max_tokens1024Token budget for judge responses
timeout30Request timeout in seconds
The model under test and the judge model can be different providers.

Local and self-hosted models

Any OpenAI-compatible server works as a judge — Ollama, LM Studio, vLLM, llama.cpp, or a self-hosted endpoint:
# Ollama running locally
configure(JudgeConfig(
    provider="openai",
    model="llama3",
    base_url="http://localhost:11434/v1",
))

# LM Studio
configure(JudgeConfig(
    provider="openai",
    model="local-model",
    base_url="http://localhost:1234/v1",
))

# Self-hosted vLLM or any OpenAI-compatible endpoint
configure(JudgeConfig(
    provider="openai",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    base_url="https://my-inference-server.internal/v1",
))
base_url is also read from the OPENAI_BASE_URL environment variable, so no code changes are needed to switch between cloud and local judges in CI.

Calibrated thresholds

Faithfulness, Hallucination, and Relevance automatically apply the optimal threshold for the configured judge model, derived from benchmarks against human-labeled datasets. You don’t need to tune this manually.
JudgeHallucinationFaithfulnessRelevance
claude-haiku-4-5-202510010.550.900.30
claude-sonnet-4-60.300.900.30
gpt-4o-mini0.300.900.30
Other models0.700.700.70
Pass threshold= explicitly to override:
Faithfulness(threshold=0.8)   # use your own threshold, skip calibration
To inspect the full calibration table:
from multivon_eval import threshold_table
print(threshold_table())

Faithfulness

Checks that the output is grounded in the provided context — no invented facts. When to use: RAG pipelines, document Q&A, or any task where the model must answer strictly from a provided source.
from multivon_eval import EvalCase, Faithfulness

case = EvalCase(
    input="What does the report say about Q3 revenue?",
    context="Q3 revenue was $4.2M, up 18% YoY...",
)
Faithfulness()
Faithfulness(threshold=0.8)
Requires context on the EvalCase.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

Hallucination

Checks that the output doesn’t introduce claims not supported by context. When to use: Complement to Faithfulness — Faithfulness checks what’s present in the answer; Hallucination checks what shouldn’t be there.
from multivon_eval import EvalCase, Hallucination

case = EvalCase(
    input="Summarize the contract terms",
    context="Contract: 12-month term, $500/month, 30-day cancellation notice.",
)
Hallucination()
Hallucination(threshold=0.8)
Requires context.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

Relevance

Checks that the output actually addresses the input question. When to use: Catch evasive, off-topic, or boilerplate responses that technically answer but don’t address what was asked.
from multivon_eval import EvalCase, Relevance

case = EvalCase(input="How do I reset my password?")
Relevance()
Relevance(threshold=0.7)
No context required.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

Coherence

Checks that the output is clear, well-structured, and logically sound. When to use: Long-form generation (articles, reports, explanations) where logical flow and clarity matter.
from multivon_eval import EvalCase, Coherence

case = EvalCase(input="Explain how photosynthesis works")
Coherence()
Coherence(threshold=0.8)
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

Toxicity

Checks that the output is safe, non-harmful, and appropriate. When to use: User-facing products, content moderation pipelines, or any deployment where harmful output is unacceptable. Score 1.0 = not toxic; 0.0 = toxic.
from multivon_eval import EvalCase, Toxicity

case = EvalCase(input="Write a response to this angry customer message")
Toxicity()
Toxicity(threshold=0.9)
ParameterTypeDefaultDescription
thresholdfloat0.9Minimum score to pass (higher default reflects zero-tolerance)

Bias

Checks that the output is free of demographic, political, or cultural bias. When to use: HR tools, news summarizers, recommendation systems, or any application where systematic favoritism is a risk. Score 1.0 = no bias detected; 0.0 = significant bias.
from multivon_eval import EvalCase, Bias

case = EvalCase(input="Describe the ideal job candidate for this role")
Bias()
Bias(threshold=0.8)
ParameterTypeDefaultDescription
thresholdfloat0.8Minimum score to pass

Summarization

Checks that a summary captures the key points of the source faithfully, without adding or omitting critical information. When to use: Summarization pipelines — news, legal documents, meeting transcripts.
from multivon_eval import EvalCase, Summarization

case = EvalCase(
    input="Summarize this article",
    context="[Full source article text here...]",
)
Summarization()
Summarization(threshold=0.8)
Requires context (the source document).
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

AnswerAccuracy

Checks factual correctness of the output against expected_output. Uses judge comparison rather than string matching, so paraphrasing is handled correctly. When to use: Knowledge QA, fact retrieval, or any task with a known correct answer where the phrasing may vary.
from multivon_eval import EvalCase, AnswerAccuracy

case = EvalCase(
    input="What is the capital of France?",
    expected_output="Paris",
)
AnswerAccuracy()
AnswerAccuracy(threshold=0.8)
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

ContextPrecision

For RAG systems: checks that retrieved context chunks are actually relevant to the question. High precision = low noise in retrieval. When to use: Evaluating the retrieval stage of a RAG pipeline independently from generation.
from multivon_eval import EvalCase, ContextPrecision

case = EvalCase(
    input="What is our refund policy?",
    context=["Refund policy: 30 days...", "Shipping rates: ...", "Contact us at..."],
)
ContextPrecision()
ContextPrecision(threshold=0.8)
Accepts context as either a string or a list of strings (chunks). Evaluates up to 8 chunks.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

ContextRecall

For RAG systems: checks that the retrieved context contains everything needed to derive the expected answer. When to use: Diagnosing retrieval gaps — cases where the model gave a wrong answer because the right chunk wasn’t retrieved.
from multivon_eval import EvalCase, ContextRecall

case = EvalCase(
    input="What is the cancellation fee?",
    context="Cancellation within 30 days: $50 fee applies.",
    expected_output="$50",
)
ContextRecall()
ContextRecall(threshold=0.8)
Requires both context and expected_output.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

CustomRubric

Define your own yes/no criteria. Each criterion is a (question, expected_answer) tuple. Score = fraction of criteria where the judge’s answer matches expected_answer. When to use: Domain-specific quality checks that don’t map to the built-in evaluators — support tone, legal disclaimers, brand voice.
from multivon_eval import EvalCase, CustomRubric

case = EvalCase(input="Handle this support ticket: 'My order hasn't arrived'")
CustomRubric(
    name="support_quality",
    criteria=[
        ("Does the response acknowledge the customer's problem?", True),
        ("Does the response provide a concrete next step?", True),
        ("Does the response use apologetic or defensive language?", False),
        ("Is the response under 150 words?", True),
    ],
    threshold=0.75,
)
ParameterTypeDefaultDescription
criterialist[tuple[str, bool]]requiredList of (question, expect_yes) pairs
namestr"custom_rubric"Display name for this evaluator in reports
thresholdfloat0.7Minimum fraction of criteria to pass

GEval

Holistic numeric scoring for qualities that don’t decompose well into yes/no questions (creativity, tone, polish). The judge returns a 0.0–1.0 score directly with reasoning. When to use: Subjective qualities like writing style, creativity, or polish where binary questions don’t capture the nuance. Use sparingly — less auditable than QAG evaluators.
from multivon_eval import EvalCase, GEval

case = EvalCase(input="Write a product description for wireless headphones")
GEval(
    name="writing_quality",
    criteria="The response is engaging, concise, and professionally written.",
    threshold=0.7,
)
ParameterTypeDefaultDescription
criteriastrrequiredFree-text description of what to evaluate
namestr"g_eval"Display name for this evaluator in reports
runsint2Number of judge runs to average — reduces position and framing bias
judgeJudgeConfigNoneOverride the judge model for this evaluator
thresholdfloat0.7Minimum score to pass
GEval is the only evaluator that uses a numeric score directly from the judge rather than QAG aggregation.

CheckEvaluator

The fastest way to add a quality check. You write a plain-English criterion; CheckEvaluator auto-generates specific yes/no questions from it and scores with QAG. No need to pick an evaluator class or write questions manually.
from multivon_eval import EvalSuite, EvalCase

suite = EvalSuite("return policy eval")
suite.add_check("Response should mention the return policy")
suite.add_check("Tone should be professional and not defensive", threshold=0.8)
suite.add_cases([EvalCase(input="What is your return policy?")])
report = suite.run(my_model)
Questions are generated once at the start of suite.run() (eager warmup), so no case pays the generation cost and failures surface before the eval loop starts.

Escape hatch: pin questions for CI

Generated questions vary per run and per model. For reproducible CI runs, pin them explicitly:
suite.add_check(
    "Response should mention the return policy",
    questions=[
        "Does the response mention a return window?",
        "Does the response name the return policy by name?",
        "Does the response provide a link or next step?",
    ],
)
When questions= is set, no LLM call is made during prepare().

Inspect generated questions

import logging
logging.basicConfig(level=logging.INFO)  # prints generated questions to stdout

ev = suite._evaluators[0]
suite.run(my_model)
print(ev.resolved_questions)  # ['Does the response...', ...]

Discrete scores for N questions

With the default num_questions=3, the only possible scores are 0.0, 0.33, 0.67, and 1.0. The default threshold of 0.7 therefore requires 3/3 questions to pass. Lower the threshold or use num_questions=5 if you want more granularity.
num_questionsPossible scoresThreshold 0.7 requires
30.0, 0.33, 0.67, 1.03/3
50.0, 0.2, 0.4, 0.6, 0.8, 1.04/5
100.0, 0.1, …, 1.07/10

Fallback behavior

If question generation fails after two attempts, CheckEvaluator issues a warnings.warn and falls back to using the criterion itself as a single yes/no question. The EvalResult reason will include a [⚠ question generation failed — using fallback] tag. Check ev._used_fallback programmatically.
ParameterTypeDefaultDescription
criterionstrrequiredPlain-English quality criterion (max 300 chars)
thresholdfloat0.7Fraction of questions that must pass
num_questionsint3Number of yes/no questions to generate (clamped 1–10)
questionslist[str]NonePin specific questions — skips generation entirely
namestrderivedEvaluator name in reports (auto-derived from criterion if omitted)
judgeJudgeConfigNoneOverride the judge model for this evaluator