LLM Judge Evaluators

LLM judge evaluators use a secondary model to assess output quality. multivon-eval uses QAG scoring (Question-Answer Generation) — generating binary yes/no questions about the output instead of asking for a numeric 1-10 rating. Why QAG? Binary questions are easier for LLMs to get right, fully auditable (you see which questions passed), and cheaper (shorter prompts).

Configuration

JudgeConfig

The judge model is fully decoupled from your pipeline model. Configure it once globally, override per-evaluator, or fall back to environment variables.

from multivon_eval import configure, JudgeConfig

# Set globally at startup — all evaluators use this unless overridden
configure(JudgeConfig(provider="openai", model="gpt-4o-mini"))

# Override for a specific evaluator
Faithfulness(judge=JudgeConfig(provider="anthropic", model="claude-opus-4-7"))

Resolution order (highest to lowest):

Per-evaluator judge= kwarg
configure() global
JUDGE_PROVIDER / JUDGE_MODEL environment variables
Built-in default: anthropic / claude-haiku-4-5

# Environment variable fallback
export ANTHROPIC_API_KEY=sk-ant-...
export JUDGE_PROVIDER=anthropic
export JUDGE_MODEL=claude-haiku-4-5

JudgeConfig field	Default	Description
`provider`	`"anthropic"`	`"anthropic"` or `"openai"`
`model`	`"claude-haiku-4-5"`	Model name for the chosen provider
`base_url`	`""`	Custom endpoint for local/self-hosted servers (see below)
`temperature`	`None` = inherit, effective `0.0`	Sampling temperature (0 = deterministic)
`max_tokens`	`None` = inherit, effective `1024`	Token budget for judge responses
`timeout`	`None` = inherit, effective `30`	Request timeout in seconds

Changed in 0.16.0: temperature, max_tokens, timeout, and reliability_sample now default to None, meaning “inherit from the global config” — they resolve to the same effective defaults as before. The old merge compared overrides against the default values, so an explicit JudgeConfig(temperature=0.0) was silently ignored whenever a nonzero global was configured. Explicit values — including 0.0 — now always win.

The model under test and the judge model can be different providers.

Local and self-hosted models

Any OpenAI-compatible server works as a judge — Ollama, LM Studio, vLLM, llama.cpp, or a self-hosted endpoint:

# Ollama running locally
configure(JudgeConfig(
    provider="openai",
    model="llama3",
    base_url="http://localhost:11434/v1",
))

# LM Studio
configure(JudgeConfig(
    provider="openai",
    model="local-model",
    base_url="http://localhost:1234/v1",
))

# Self-hosted vLLM or any OpenAI-compatible endpoint
configure(JudgeConfig(
    provider="openai",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    base_url="https://my-inference-server.internal/v1",
))

base_url is also read from the OPENAI_BASE_URL environment variable, so no code changes are needed to switch between cloud and local judges in CI.

Calibrated thresholds

Faithfulness, Hallucination, and Relevance automatically apply the optimal threshold for the configured judge model, derived from benchmarks against human-labeled datasets. You don’t need to tune this manually.

Judge	Hallucination	Faithfulness	Relevance
`claude-haiku-4-5-20251001`	0.55	0.90	0.30
`claude-sonnet-4-6`	0.30	0.90	0.30
`gpt-4o-mini`	0.30	0.90	0.30
Other models	0.70	0.70	0.70

Pass threshold= explicitly to override:

Faithfulness(threshold=0.8)   # use your own threshold, skip calibration

To inspect the full calibration table:

from multivon_eval import threshold_table
print(threshold_table())

UNKNOWN verdicts

New in 0.16.0. QAG scoring asks the judge binary yes/no questions — but judges hedge, and a hedge is not a verdict. The parser now has three outcomes per question:

A reply starting with “yes”/“no”, or containing exactly one unambiguous verdict word, parses as before — the calibrated thresholds above were fit under these semantics, and they are unchanged.
A reply with no unambiguous verdict is UNKNOWN: excluded from the score denominator entirely and disclosed in the result reason, e.g. 1 of 3 question(s) UNKNOWN — excluded from score denominator. An UNKNOWN never counts for or against the model.
If every verdict for a case is unparseable, the evaluator raises JudgeUnavailable and the case gets JUDGE_ERROR status — excluded from pass_rate, counted in report.errors.

One honest sentence about the past: before 0.16.0 the parser fell back to “does ‘yes’ appear in the first 50 characters”, so a judge replying “I cannot say yes or no with certainty” was scored as YES. That yes-bias is gone; the previously mis-scored tail is now UNKNOWN.

Error budget: `max_error_rate`

New in 0.16.0. pass_rate excludes errored cases by design — a judge outage is not a quality regression. The blind spot: 90 judge errors plus 10 passes is a 100% pass rate, and a fail_threshold gate would wave it through. max_error_rate closes it:

from multivon_eval import EvalSuite, EvalCase, EvalGateFailure
from multivon_eval.evaluators.deterministic import Contains

suite = EvalSuite("checkout-bot")
suite.add_cases([EvalCase(input=f"question {i}") for i in range(10)])
suite.add_evaluators(Contains(["ok"]))

def mostly_down(prompt: str) -> str:
    if prompt != "question 0":
        raise TimeoutError("upstream 504")   # stands in for judge/model outages
    return "ok"

try:
    suite.run(mostly_down, fail_threshold=0.85, max_error_rate=0.10, verbose=False)
except EvalGateFailure as e:
    print(e)
# Eval gate INDETERMINATE: error rate 90.0% exceeds error budget 10.0% —
# 9 of 10 case(s) errored before quality could be scored (model_error=9).
# pass_rate 100.0% covers only the 1 evaluated case(s); fix the errors
# before trusting this gate.

Available on suite.run, run_async, and run_on_cases. With max_error_rate unset, a fail_threshold gate still warns loudly on stderr when the error rate reaches 10% — the same threshold view --dir flags. report.error_rate exposes the number directly (denominator is total, not evaluated — this is exactly the metric pass_rate cannot see, so read the two together).

Faithfulness

Checks that the output is grounded in the provided context — no invented facts. When to use: RAG pipelines, document Q&A, or any task where the model must answer strictly from a provided source.

from multivon_eval import EvalCase, Faithfulness

case = EvalCase(
    input="What does the report say about Q3 revenue?",
    context="Q3 revenue was $4.2M, up 18% YoY...",
)
Faithfulness()
Faithfulness(threshold=0.8)

Requires context on the EvalCase.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

Hallucination

Checks that the output doesn’t introduce claims not supported by context. When to use: Complement to Faithfulness — Faithfulness checks what’s present in the answer; Hallucination checks what shouldn’t be there.

from multivon_eval import EvalCase, Hallucination

case = EvalCase(
    input="Summarize the contract terms",
    context="Contract: 12-month term, $500/month, 30-day cancellation notice.",
)
Hallucination()
Hallucination(threshold=0.8)

Requires context.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

Relevance

Checks that the output actually addresses the input question. When to use: Catch evasive, off-topic, or boilerplate responses that technically answer but don’t address what was asked.

from multivon_eval import EvalCase, Relevance

case = EvalCase(input="How do I reset my password?")
Relevance()
Relevance(threshold=0.7)

No context required.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

Coherence

Checks that the output is clear, well-structured, and logically sound. When to use: Long-form generation (articles, reports, explanations) where logical flow and clarity matter.

from multivon_eval import EvalCase, Coherence

case = EvalCase(input="Explain how photosynthesis works")
Coherence()
Coherence(threshold=0.8)

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

Toxicity

Checks that the output is safe, non-harmful, and appropriate. When to use: User-facing products, content moderation pipelines, or any deployment where harmful output is unacceptable. Score 1.0 = not toxic; 0.0 = toxic.

from multivon_eval import EvalCase, Toxicity

case = EvalCase(input="Write a response to this angry customer message")
Toxicity()
Toxicity(threshold=0.9)

Parameter	Type	Default	Description
`threshold`	`float`	`0.9`	Minimum score to pass (higher default reflects zero-tolerance)

Bias

Checks that the output is free of demographic, political, or cultural bias. When to use: HR tools, news summarizers, recommendation systems, or any application where systematic favoritism is a risk. Score 1.0 = no bias detected; 0.0 = significant bias.

from multivon_eval import EvalCase, Bias

case = EvalCase(input="Describe the ideal job candidate for this role")
Bias()
Bias(threshold=0.8)

Parameter	Type	Default	Description
`threshold`	`float`	`0.8`	Minimum score to pass

Summarization

Checks that a summary captures the key points of the source faithfully, without adding or omitting critical information. When to use: Summarization pipelines — news, legal documents, meeting transcripts.

from multivon_eval import EvalCase, Summarization

case = EvalCase(
    input="Summarize this article",
    context="[Full source article text here...]",
)
Summarization()
Summarization(threshold=0.8)

Requires context (the source document).

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

AnswerAccuracy

Checks factual correctness of the output against expected_output. Uses judge comparison rather than string matching, so paraphrasing is handled correctly. When to use: Knowledge QA, fact retrieval, or any task with a known correct answer where the phrasing may vary.

from multivon_eval import EvalCase, AnswerAccuracy

case = EvalCase(
    input="What is the capital of France?",
    expected_output="Paris",
)
AnswerAccuracy()
AnswerAccuracy(threshold=0.8)

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

ContextPrecision

For RAG systems: checks that retrieved context chunks are actually relevant to the question. High precision = low noise in retrieval. When to use: Evaluating the retrieval stage of a RAG pipeline independently from generation.

from multivon_eval import EvalCase, ContextPrecision

case = EvalCase(
    input="What is our refund policy?",
    context=["Refund policy: 30 days...", "Shipping rates: ...", "Contact us at..."],
)
ContextPrecision()
ContextPrecision(threshold=0.8)

Accepts context as either a string or a list of strings (chunks). Evaluates up to 8 chunks.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

ContextRecall

For RAG systems: checks that the retrieved context contains everything needed to derive the expected answer. When to use: Diagnosing retrieval gaps — cases where the model gave a wrong answer because the right chunk wasn’t retrieved.

from multivon_eval import EvalCase, ContextRecall

case = EvalCase(
    input="What is the cancellation fee?",
    context="Cancellation within 30 days: $50 fee applies.",
    expected_output="$50",
)
ContextRecall()
ContextRecall(threshold=0.8)

Requires both context and expected_output.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

CustomRubric

Define your own yes/no criteria. Each criterion is a (question, expected_answer) tuple. Score = fraction of criteria where the judge’s answer matches expected_answer. When to use: Domain-specific quality checks that don’t map to the built-in evaluators — support tone, legal disclaimers, brand voice.

from multivon_eval import EvalCase, CustomRubric

case = EvalCase(input="Handle this support ticket: 'My order hasn't arrived'")
CustomRubric(
    name="support_quality",
    criteria=[
        ("Does the response acknowledge the customer's problem?", True),
        ("Does the response provide a concrete next step?", True),
        ("Does the response use apologetic or defensive language?", False),
        ("Is the response under 150 words?", True),
    ],
    threshold=0.75,
)

Parameter	Type	Default	Description
`criteria`	`list[tuple[str, bool]]`	required	List of `(question, expect_yes)` pairs
`name`	`str`	`"custom_rubric"`	Display name for this evaluator in reports
`threshold`	`float`	`0.7`	Minimum fraction of criteria to pass

GEval

Holistic numeric scoring for qualities that don’t decompose well into yes/no questions (creativity, tone, polish). The judge returns a 0.0–1.0 score directly with reasoning. When to use: Subjective qualities like writing style, creativity, or polish where binary questions don’t capture the nuance. Use sparingly — less auditable than QAG evaluators.

from multivon_eval import EvalCase, GEval

case = EvalCase(input="Write a product description for wireless headphones")
GEval(
    name="writing_quality",
    criteria="The response is engaging, concise, and professionally written.",
    threshold=0.7,
)

Parameter	Type	Default	Description
`criteria`	`str`	required	Free-text description of what to evaluate
`name`	`str`	`"g_eval"`	Display name for this evaluator in reports
`runs`	`int`	`2`	Number of judge runs to average — reduces position and framing bias
`judge`	`JudgeConfig`	`None`	Override the judge model for this evaluator
`threshold`	`float`	`0.7`	Minimum score to pass

GEval is the only evaluator that uses a numeric score directly from the judge rather than QAG aggregation.

CheckEvaluator

The fastest way to add a quality check. You write a plain-English criterion; CheckEvaluator auto-generates specific yes/no questions from it and scores with QAG. No need to pick an evaluator class or write questions manually.

from multivon_eval import EvalSuite, EvalCase

suite = EvalSuite("return policy eval")
suite.add_check("Response should mention the return policy")
suite.add_check("Tone should be professional and not defensive", threshold=0.8)
suite.add_cases([EvalCase(input="What is your return policy?")])
report = suite.run(my_model)

Questions are generated once at the start of suite.run() (eager warmup), so no case pays the generation cost and failures surface before the eval loop starts.

Escape hatch: pin questions for CI

Generated questions vary per run and per model. For reproducible CI runs, pin them explicitly:

suite.add_check(
    "Response should mention the return policy",
    questions=[
        "Does the response mention a return window?",
        "Does the response name the return policy by name?",
        "Does the response provide a link or next step?",
    ],
)

When questions= is set, no LLM call is made during prepare().

Inspect generated questions

import logging
logging.basicConfig(level=logging.INFO)  # prints generated questions to stdout

ev = suite._evaluators[0]
suite.run(my_model)
print(ev.resolved_questions)  # ['Does the response...', ...]

Discrete scores for N questions

With the default num_questions=3, the only possible scores are 0.0, 0.33, 0.67, and 1.0. The default threshold of 0.7 therefore requires 3/3 questions to pass. Lower the threshold or use num_questions=5 if you want more granularity.

`num_questions`	Possible scores	Threshold 0.7 requires
3	0.0, 0.33, 0.67, 1.0	3/3
5	0.0, 0.2, 0.4, 0.6, 0.8, 1.0	4/5
10	0.0, 0.1, …, 1.0	7/10

Fallback behavior

If question generation fails after two attempts, CheckEvaluator issues a warnings.warn and falls back to using the criterion itself as a single yes/no question. The EvalResult reason will include a [⚠ question generation failed — using fallback] tag. Check ev._used_fallback programmatically.

Parameter	Type	Default	Description
`criterion`	`str`	required	Plain-English quality criterion (max 300 chars)
`threshold`	`float`	`0.7`	Fraction of questions that must pass
`num_questions`	`int`	`3`	Number of yes/no questions to generate (clamped 1–10)
`questions`	`list[str]`	`None`	Pin specific questions — skips generation entirely
`name`	`str`	derived	Evaluator name in reports (auto-derived from criterion if omitted)
`judge`	`JudgeConfig`	`None`	Override the judge model for this evaluator

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

Configuration

JudgeConfig

Local and self-hosted models

Calibrated thresholds

UNKNOWN verdicts

Error budget: `max_error_rate`

Faithfulness

Hallucination

Relevance

Coherence

Toxicity

Bias

Summarization

AnswerAccuracy

ContextPrecision

ContextRecall

CustomRubric

GEval

CheckEvaluator

Escape hatch: pin questions for CI

Inspect generated questions

Discrete scores for N questions

Fallback behavior

​Configuration

​JudgeConfig

​Local and self-hosted models

​Calibrated thresholds

​UNKNOWN verdicts

​Error budget: max_error_rate

​Faithfulness

​Hallucination

​Relevance

​Coherence

​Toxicity

​Bias

​Summarization

​AnswerAccuracy

​ContextPrecision

​ContextRecall

​CustomRubric

​GEval

​CheckEvaluator

​Escape hatch: pin questions for CI

​Inspect generated questions

​Discrete scores for N questions

​Fallback behavior

Configuration

JudgeConfig

Local and self-hosted models

Calibrated thresholds

UNKNOWN verdicts

Error budget: `max_error_rate`

Faithfulness

Hallucination

Relevance

Coherence

Toxicity

Bias

Summarization

AnswerAccuracy

ContextPrecision

ContextRecall

CustomRubric

GEval

CheckEvaluator

Escape hatch: pin questions for CI

Inspect generated questions

Discrete scores for N questions

Fallback behavior