NotEmpty
Passes if the output is non-empty after stripping whitespace. When to use: As the first guard in any eval suite — there’s no point running other evaluators on an empty string.| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 1.0 | Minimum score to pass (inherited from base) |
ExactMatch
Passes if the output exactly matchesexpected_output (stripped). Case-insensitive by default.
When to use: Classification outputs, yes/no questions, or any task where the valid answer is one of a small fixed set.
expected_output on the EvalCase.
| Parameter | Type | Default | Description |
|---|---|---|---|
case_sensitive | bool | False | If True, match is case-sensitive |
threshold | float | 1.0 | Minimum score to pass |
Contains
Passes if the output contains all required substrings. Score is the fraction found. When to use: When the output must include certain keywords, section headers, or phrases — but you don’t care about exact wording.| Parameter | Type | Default | Description |
|---|---|---|---|
substrings | list[str] | required | List of strings the output must contain |
case_sensitive | bool | False | If True, matching is case-sensitive |
threshold | float | 1.0 | Minimum fraction of substrings found to pass |
RegexMatch
Passes if the output matches a regex pattern anywhere in the text. When to use: Structured format checks — phone numbers, dates, citation patterns, JSON keys, code blocks.| Parameter | Type | Default | Description |
|---|---|---|---|
pattern | str | required | Regex pattern string |
flags | int | re.IGNORECASE | Python re flags (e.g. re.IGNORECASE, re.DOTALL) |
threshold | float | 1.0 | Minimum score to pass |
JSONSchemaEval
Passes if the output is valid JSON that conforms to a JSON Schema. When to use: Structured output tasks where the model must return well-typed JSON (e.g. API response generation, extraction pipelines).| Parameter | Type | Default | Description |
|---|---|---|---|
schema | dict | required | A valid JSON Schema dict |
threshold | float | 1.0 | Minimum score to pass |
WordCount
Passes if the word count is within[min_words, max_words].
When to use: Enforcing response length — summaries that must be concise, reports that must have a minimum length.
| Parameter | Type | Default | Description |
|---|---|---|---|
min_words | int | 0 | Minimum number of words required |
max_words | int | 10000 | Maximum number of words allowed |
threshold | float | 1.0 | Minimum score to pass |
Latency
Passes if response latency is undermax_ms milliseconds. Requires latency_ms to be passed to evaluate() — the suite handles this automatically.
When to use: SLA enforcement in production — catching regressions where a model or pipeline exceeds your latency budget.
| Parameter | Type | Default | Description |
|---|---|---|---|
max_ms | float | required | Maximum allowed latency in milliseconds |
threshold | float | 1.0 | Minimum score to pass |
MaxLatency
Alias forLatency. Passes if response latency is under max_ms milliseconds.
When to use: Use whichever name reads better in your suite. MaxLatency emphasizes the upper bound; Latency reads more naturally inline.
| Parameter | Type | Default | Description |
|---|---|---|---|
max_ms | float | required | Maximum allowed latency in milliseconds |
threshold | float | 1.0 | Minimum score to pass |
BLEU
BLEU-n score between output andexpected_output. Pure Python, no dependencies.
When to use: Translation evaluation, constrained generation where phrasing matters, or any task with a canonical reference output.
| Parameter | Type | Default | Description |
|---|---|---|---|
n | int | 4 | Maximum n-gram order (BLEU-1 through BLEU-4) |
threshold | float | 0.5 | Minimum score to pass |
ROUGE
ROUGE-L F1 score (longest common subsequence) between output andexpected_output.
When to use: Summarization tasks where recall of key content matters more than exact wording.
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.5 | Minimum score to pass |
StartsWith
Passes if output starts with the given prefix. Case-insensitive by default. When to use: Enforcing response format conventions — code blocks that must open with a fence, structured outputs that must begin with a specific token.| Parameter | Type | Default | Description |
|---|---|---|---|
prefix | str | required | Expected prefix string |
case_sensitive | bool | False | If True, match is case-sensitive |
threshold | float | 1.0 | Minimum score to pass |

