Deterministic Evaluators

Deterministic evaluators run in milliseconds with no API calls. Use them as a first pass — if an output fails a string check, there’s no need to send it to a judge.

NotEmpty

Passes if the output is non-empty after stripping whitespace. When to use: As the first guard in any eval suite — there’s no point running other evaluators on an empty string.

from multivon_eval import EvalCase, NotEmpty

case = EvalCase(input="Describe the product")
NotEmpty()

Parameter	Type	Default	Description
`threshold`	`float`	`1.0`	Minimum score to pass (inherited from base)

ExactMatch

Passes if the output exactly matches expected_output (stripped). Case-insensitive by default. When to use: Classification outputs, yes/no questions, or any task where the valid answer is one of a small fixed set.

from multivon_eval import EvalCase, ExactMatch

case = EvalCase(
    input="Is this review positive or negative?",
    expected_output="positive",
)
ExactMatch()                        # case-insensitive
ExactMatch(case_sensitive=True)     # strict

Requires expected_output on the EvalCase.

Parameter	Type	Default	Description
`case_sensitive`	`bool`	`False`	If `True`, match is case-sensitive
`threshold`	`float`	`1.0`	Minimum score to pass

Contains

Passes if the output contains all required substrings. Score is the fraction found. When to use: When the output must include certain keywords, section headers, or phrases — but you don’t care about exact wording.

from multivon_eval import EvalCase, Contains

case = EvalCase(input="Explain our refund policy")
Contains(["refund policy", "contact us"])               # all required
Contains(["Paris"], threshold=1.0)                      # must find all
Contains(["red", "blue", "green"], threshold=0.67)      # 2/3 is enough

Parameter	Type	Default	Description
`substrings`	`list[str]`	required	List of strings the output must contain
`case_sensitive`	`bool`	`False`	If `True`, matching is case-sensitive
`threshold`	`float`	`1.0`	Minimum fraction of substrings found to pass

RegexMatch

Passes if the output matches a regex pattern anywhere in the text. When to use: Structured format checks — phone numbers, dates, citation patterns, JSON keys, code blocks.

from multivon_eval import EvalCase, RegexMatch

case = EvalCase(input="What year was Python created?")
RegexMatch(r"\d{4}")          # matches any 4-digit number
RegexMatch(r"^(yes|no)$")     # exact yes/no

Parameter	Type	Default	Description
`pattern`	`str`	required	Regex pattern string
`flags`	`int`	`re.IGNORECASE`	Python `re` flags (e.g. `re.IGNORECASE`, `re.DOTALL`)
`threshold`	`float`	`1.0`	Minimum score to pass

JSONSchemaEval

Passes if the output is valid JSON that conforms to a JSON Schema. When to use: Structured output tasks where the model must return well-typed JSON (e.g. API response generation, extraction pipelines).

from multivon_eval import EvalCase, JSONSchemaEval

case = EvalCase(input="Classify this review")
JSONSchemaEval({
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["sentiment", "score"],
})

Parameter	Type	Default	Description
`schema`	`dict`	required	A valid JSON Schema dict
`threshold`	`float`	`1.0`	Minimum score to pass

WordCount

Passes if the word count is within [min_words, max_words]. When to use: Enforcing response length — summaries that must be concise, reports that must have a minimum length.

from multivon_eval import EvalCase, WordCount

case = EvalCase(input="Summarize this article in 50-200 words")
WordCount(min_words=50, max_words=200)   # must be 50-200 words
WordCount(max_words=100)                  # at most 100 words
WordCount(min_words=10)                   # at least 10 words

Parameter	Type	Default	Description
`min_words`	`int`	`0`	Minimum number of words required
`max_words`	`int`	`10000`	Maximum number of words allowed
`threshold`	`float`	`1.0`	Minimum score to pass

Latency

Passes if response latency is under max_ms milliseconds. Requires latency_ms to be passed to evaluate() — the suite handles this automatically. When to use: SLA enforcement in production — catching regressions where a model or pipeline exceeds your latency budget.

from multivon_eval import EvalCase, Latency

case = EvalCase(input="What is 2+2?")
Latency(max_ms=2000)    # must respond within 2 seconds

Score degrades linearly above the limit rather than hard-failing.

Parameter	Type	Default	Description
`max_ms`	`float`	required	Maximum allowed latency in milliseconds
`threshold`	`float`	`1.0`	Minimum score to pass

MaxLatency

Alias for Latency. Passes if response latency is under max_ms milliseconds. When to use: Use whichever name reads better in your suite. MaxLatency emphasizes the upper bound; Latency reads more naturally inline.

from multivon_eval import EvalCase, MaxLatency

case = EvalCase(input="What is 2+2?")
MaxLatency(max_ms=2000)

Score degrades linearly above the limit rather than hard-failing.

Parameter	Type	Default	Description
`max_ms`	`float`	required	Maximum allowed latency in milliseconds
`threshold`	`float`	`1.0`	Minimum score to pass

BLEU

BLEU-n score between output and expected_output. Pure Python, no dependencies. When to use: Translation evaluation, constrained generation where phrasing matters, or any task with a canonical reference output.

from multivon_eval import EvalCase, BLEU

case = EvalCase(
    input="Translate to French: 'Good morning'",
    expected_output="Bonjour",
)
BLEU()          # BLEU-4 (default), threshold 0.5
BLEU(n=2)       # bigram BLEU
BLEU(n=1, threshold=0.7)

Score of 1.0 = perfect match. Includes brevity penalty.

Parameter	Type	Default	Description
`n`	`int`	`4`	Maximum n-gram order (BLEU-1 through BLEU-4)
`threshold`	`float`	`0.5`	Minimum score to pass

ROUGE

ROUGE-L F1 score (longest common subsequence) between output and expected_output. When to use: Summarization tasks where recall of key content matters more than exact wording.

from multivon_eval import EvalCase, ROUGE

case = EvalCase(
    input="Summarize this article",
    expected_output="The article discusses climate change impacts on coastal cities.",
)
ROUGE()                  # threshold 0.5
ROUGE(threshold=0.7)

Score of 1.0 = perfect recall and precision on LCS.

Parameter	Type	Default	Description
`threshold`	`float`	`0.5`	Minimum score to pass

StartsWith

Passes if output starts with the given prefix. Case-insensitive by default. When to use: Enforcing response format conventions — code blocks that must open with a fence, structured outputs that must begin with a specific token.

from multivon_eval import EvalCase, StartsWith

case = EvalCase(input="Generate a JSON response")
StartsWith("Sure")
StartsWith("```json", case_sensitive=True)

Parameter	Type	Default	Description
`prefix`	`str`	required	Expected prefix string
`case_sensitive`	`bool`	`False`	If `True`, match is case-sensitive
`threshold`	`float`	`1.0`	Minimum score to pass

Combining evaluators

Evaluators are independent — all run and each contributes its own score and pass/fail.

suite.add_evaluators(
    NotEmpty(),
    WordCount(min_words=20, max_words=300),
    RegexMatch(r"\d{4}"),      # must mention a year
    Contains(["Paris"]),
)

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

Deterministic Evaluators

NotEmpty

ExactMatch

Contains

RegexMatch

JSONSchemaEval

WordCount

Latency

MaxLatency

BLEU

ROUGE

StartsWith

Combining evaluators

​NotEmpty

​ExactMatch

​Contains

​RegexMatch

​JSONSchemaEval

​WordCount

​Latency

​MaxLatency

​BLEU

​ROUGE

​StartsWith

​Combining evaluators

NotEmpty

ExactMatch

Contains

RegexMatch

JSONSchemaEval

WordCount

Latency

MaxLatency

BLEU

ROUGE

StartsWith

Combining evaluators