Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Deterministic evaluators run in milliseconds with no API calls. Use them as a first pass — if an output fails a string check, there’s no need to send it to a judge.

NotEmpty

Passes if the output is non-empty after stripping whitespace. When to use: As the first guard in any eval suite — there’s no point running other evaluators on an empty string.
from multivon_eval import EvalCase, NotEmpty

case = EvalCase(input="Describe the product")
NotEmpty()
ParameterTypeDefaultDescription
thresholdfloat1.0Minimum score to pass (inherited from base)

ExactMatch

Passes if the output exactly matches expected_output (stripped). Case-insensitive by default. When to use: Classification outputs, yes/no questions, or any task where the valid answer is one of a small fixed set.
from multivon_eval import EvalCase, ExactMatch

case = EvalCase(
    input="Is this review positive or negative?",
    expected_output="positive",
)
ExactMatch()                        # case-insensitive
ExactMatch(case_sensitive=True)     # strict
Requires expected_output on the EvalCase.
ParameterTypeDefaultDescription
case_sensitiveboolFalseIf True, match is case-sensitive
thresholdfloat1.0Minimum score to pass

Contains

Passes if the output contains all required substrings. Score is the fraction found. When to use: When the output must include certain keywords, section headers, or phrases — but you don’t care about exact wording.
from multivon_eval import EvalCase, Contains

case = EvalCase(input="Explain our refund policy")
Contains(["refund policy", "contact us"])               # all required
Contains(["Paris"], threshold=1.0)                      # must find all
Contains(["red", "blue", "green"], threshold=0.67)      # 2/3 is enough
ParameterTypeDefaultDescription
substringslist[str]requiredList of strings the output must contain
case_sensitiveboolFalseIf True, matching is case-sensitive
thresholdfloat1.0Minimum fraction of substrings found to pass

RegexMatch

Passes if the output matches a regex pattern anywhere in the text. When to use: Structured format checks — phone numbers, dates, citation patterns, JSON keys, code blocks.
from multivon_eval import EvalCase, RegexMatch

case = EvalCase(input="What year was Python created?")
RegexMatch(r"\d{4}")          # matches any 4-digit number
RegexMatch(r"^(yes|no)$")     # exact yes/no
ParameterTypeDefaultDescription
patternstrrequiredRegex pattern string
flagsintre.IGNORECASEPython re flags (e.g. re.IGNORECASE, re.DOTALL)
thresholdfloat1.0Minimum score to pass

JSONSchemaEval

Passes if the output is valid JSON that conforms to a JSON Schema. When to use: Structured output tasks where the model must return well-typed JSON (e.g. API response generation, extraction pipelines).
from multivon_eval import EvalCase, JSONSchemaEval

case = EvalCase(input="Classify this review")
JSONSchemaEval({
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["sentiment", "score"],
})
ParameterTypeDefaultDescription
schemadictrequiredA valid JSON Schema dict
thresholdfloat1.0Minimum score to pass

WordCount

Passes if the word count is within [min_words, max_words]. When to use: Enforcing response length — summaries that must be concise, reports that must have a minimum length.
from multivon_eval import EvalCase, WordCount

case = EvalCase(input="Summarize this article in 50-200 words")
WordCount(min_words=50, max_words=200)   # must be 50-200 words
WordCount(max_words=100)                  # at most 100 words
WordCount(min_words=10)                   # at least 10 words
ParameterTypeDefaultDescription
min_wordsint0Minimum number of words required
max_wordsint10000Maximum number of words allowed
thresholdfloat1.0Minimum score to pass

Latency

Passes if response latency is under max_ms milliseconds. Requires latency_ms to be passed to evaluate() — the suite handles this automatically. When to use: SLA enforcement in production — catching regressions where a model or pipeline exceeds your latency budget.
from multivon_eval import EvalCase, Latency

case = EvalCase(input="What is 2+2?")
Latency(max_ms=2000)    # must respond within 2 seconds
Score degrades linearly above the limit rather than hard-failing.
ParameterTypeDefaultDescription
max_msfloatrequiredMaximum allowed latency in milliseconds
thresholdfloat1.0Minimum score to pass

MaxLatency

Alias for Latency. Passes if response latency is under max_ms milliseconds. When to use: Use whichever name reads better in your suite. MaxLatency emphasizes the upper bound; Latency reads more naturally inline.
from multivon_eval import EvalCase, MaxLatency

case = EvalCase(input="What is 2+2?")
MaxLatency(max_ms=2000)
Score degrades linearly above the limit rather than hard-failing.
ParameterTypeDefaultDescription
max_msfloatrequiredMaximum allowed latency in milliseconds
thresholdfloat1.0Minimum score to pass

BLEU

BLEU-n score between output and expected_output. Pure Python, no dependencies. When to use: Translation evaluation, constrained generation where phrasing matters, or any task with a canonical reference output.
from multivon_eval import EvalCase, BLEU

case = EvalCase(
    input="Translate to French: 'Good morning'",
    expected_output="Bonjour",
)
BLEU()          # BLEU-4 (default), threshold 0.5
BLEU(n=2)       # bigram BLEU
BLEU(n=1, threshold=0.7)
Score of 1.0 = perfect match. Includes brevity penalty.
ParameterTypeDefaultDescription
nint4Maximum n-gram order (BLEU-1 through BLEU-4)
thresholdfloat0.5Minimum score to pass

ROUGE

ROUGE-L F1 score (longest common subsequence) between output and expected_output. When to use: Summarization tasks where recall of key content matters more than exact wording.
from multivon_eval import EvalCase, ROUGE

case = EvalCase(
    input="Summarize this article",
    expected_output="The article discusses climate change impacts on coastal cities.",
)
ROUGE()                  # threshold 0.5
ROUGE(threshold=0.7)
Score of 1.0 = perfect recall and precision on LCS.
ParameterTypeDefaultDescription
thresholdfloat0.5Minimum score to pass

StartsWith

Passes if output starts with the given prefix. Case-insensitive by default. When to use: Enforcing response format conventions — code blocks that must open with a fence, structured outputs that must begin with a specific token.
from multivon_eval import EvalCase, StartsWith

case = EvalCase(input="Generate a JSON response")
StartsWith("Sure")
StartsWith("```json", case_sensitive=True)
ParameterTypeDefaultDescription
prefixstrrequiredExpected prefix string
case_sensitiveboolFalseIf True, match is case-sensitive
thresholdfloat1.0Minimum score to pass

Combining evaluators

Evaluators are independent — all run and each contributes its own score and pass/fail.
suite.add_evaluators(
    NotEmpty(),
    WordCount(min_words=20, max_words=300),
    RegexMatch(r"\d{4}"),      # must mention a year
    Contains(["Paris"]),
)