Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
For regulated industries (healthcare, finance, legal, government), your eval traces can’t leave your environment. multivon-eval’s compliance tools run entirely locally: no cloud, no LLM calls for PII detection.
PII Detection
PIIEvaluator scans LLM outputs for personally identifiable information using regex patterns. Zero API calls — suitable for air-gapped environments.
Basic usage
from multivon_eval import EvalSuite, PIIEvaluator
suite = EvalSuite("Patient Intake Bot Eval")
suite.add_evaluators(PIIEvaluator())
report = suite.run(model_fn)
A case fails if any PII is detected in the output. The failure reason lists each PII type and example matches.
Jurisdiction-specific patterns
# All patterns (default)
PIIEvaluator()
# GDPR (EU) — adds EU VAT numbers
PIIEvaluator(jurisdiction="gdpr")
# CCPA (California) — adds bank account numbers
PIIEvaluator(jurisdiction="ccpa")
# PIPEDA (Canada) — base patterns
PIIEvaluator(jurisdiction="pipeda")
# HIPAA — adds MRN, health plan numbers, VINs, fax numbers,
# admission/discharge dates, device IDs, NPI/DEA numbers, URLs
PIIEvaluator(jurisdiction="hipaa")
HIPAA coverage note: This evaluator detects 13 of 18 HIPAA Safe Harbor PHI identifiers via regex. The remaining 5 (patient names, geographic subdivisions below state, photographs, biometric data, and arbitrary unique identifiers) cannot be reliably detected from text output and require de-identification before the text reaches the evaluator. For full HIPAA Safe Harbor compliance, combine PIIEvaluator(jurisdiction="hipaa") with an upstream de-identification step.
Custom patterns
PIIEvaluator(patterns={
"employee_id": r"EMP-\d{6}",
"case_number": r"CASE-[A-Z]{2}\d{8}",
})
Redacting PII from reports
By default, matched PII is shown in the reason field. To mask it in audit logs:
PIIEvaluator(redact=True)
# reason: PII detected (2 type(s)):
# email: "[REDACTED-EMAIL]"
# phone_us: "[REDACTED-PHONE_US]"
What’s detected
| Pattern | Examples |
|---|
email | [email protected] |
phone_us | 555-123-4567, (800) 555-0100 |
phone_intl | +44 7911 123456 |
ssn | 123-45-6789 |
credit_card | 4111 1111 1111 1111 |
iban | DE89370400440532013000 |
ip_address | 192.168.1.1 |
date_of_birth | DOB: 12/05/1985 |
passport | AB1234567 |
address | 123 Main Street |
eu_vat (GDPR) | DE123456789 |
bank_account (CCPA) | 12345678901234 |
Structured Output Validation
SchemaEvaluator validates that LLM outputs conform to a defined structure. Works with Pydantic models and JSON Schema dicts. Reports per-field failures — not just valid/invalid.
StructEval (2025) found GPT-4 fails complex structured extraction ~12% of the time. This evaluator catches those failures in your specific pipeline.
Pydantic model
from pydantic import BaseModel
from multivon_eval import SchemaEvaluator
class InvoiceExtraction(BaseModel):
vendor: str
amount: float
currency: str
invoice_date: str
line_items: list[str]
suite.add_evaluators(SchemaEvaluator(InvoiceExtraction))
Supports Pydantic v1 and v2. Field-level error messages:
Schema validation failed:
amount: Input should be a valid number, unable to parse string as a number
currency: Field required
JSON Schema
suite.add_evaluators(SchemaEvaluator({
"type": "object",
"required": ["title", "score", "category"],
"properties": {
"title": {"type": "string", "maxLength": 100},
"score": {"type": "number", "minimum": 0, "maximum": 1},
"category": {"type": "string", "enum": ["positive", "negative", "neutral"]},
}
}))
Handling markdown code fences
SchemaEvaluator automatically strips markdown code fences from outputs:
```json
{"title": "Great product", "score": 0.9, "category": "positive"}
```
This is valid — the schema evaluator strips the fence before parsing.
Compliance Audit Trail
ComplianceReporter writes a hash-chained, tamper-evident NDJSON log of every eval run, with Article-level regulatory control annotations on each evaluator result.
Basic usage
from multivon_eval import EvalSuite, ComplianceReporter
suite = EvalSuite("HR Bot Eval")
reporter = ComplianceReporter(
output_dir="./audit-logs",
framework="eu-ai-act",
)
report = suite.run(model_fn)
record_id = reporter.record(report, tags={"version": "2.1", "env": "staging"})
# [compliance] audit record → a3f9b2c1 (hr_bot_eval.audit.ndjson)
# [compliance] framework: eu-ai-act
EU AI Act high-risk factory
For high-risk systems under Annex III, use the factory — it wires the standard
measurable controls with calibrated thresholds:
from multivon_eval import EvalSuite, ComplianceReporter
suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
suite.add_cases(cases)
reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn, runs=5)
reporter.record(report, tags={"system": "triage-bot"})
The factory wires: NotEmpty, Faithfulness, Hallucination, Relevance,
Toxicity, Bias, PIIEvaluator (plus optional SchemaEvaluator if you pass
schema=). Pair with runs=5 to surface flakiness — Art. 15(2) robustness
isn’t a single-run claim.
Coverage report
Before you run, check which Articles the suite actually exercises:
print(reporter.coverage(suite))
eu-ai-act coverage for suite 'EU AI Act High-Risk Eval'
───────────────────────────────────────────────────────
[x] Art. 9(2)(b) Foreseeable misuse risk identification
covered by: toxicity
[x] Art. 10(2)(f-g) Examination and mitigation of possible biases
covered by: bias
[x] Art. 10(5) Processing of personal data
covered by: pii_detection
[x] Art. 15(1) Accuracy
covered by: faithfulness, hallucination, relevance
[x] Art. 15(2) Robustness
covered by: not_empty
Process controls (not satisfiable by evaluators alone):
Art. 11 Technical documentation
Art. 12 Record-keeping (satisfied by this reporter)
Art. 13 Transparency and information to deployers
Art. 14 Human oversight
Art. 15(4-5) Cybersecurity and resilience
Coverage: 5/5 measurable controls exercised.
Process controls (Art. 11, 13, 14, 15(4-5)) require organizational measures
outside of model evaluation; the reporter surfaces them so you don’t mistake
“high coverage” for “fully compliant.”
Framework mappings
ComplianceReporter(framework="eu-ai-act") # Article-level EU AI Act controls
ComplianceReporter(framework="nist-ai-rmf") # NIST AI RMF subcategories
ComplianceReporter(framework="none") # raw scores only
EU AI Act mappings (Regulation (EU) 2024/1689):
| Evaluator | Control |
|---|
toxicity | Art. 9(2)(b) — Foreseeable misuse risk identification |
bias | Art. 10(2)(f-g) — Examination & mitigation of possible biases |
pii_detection | Art. 10(5) — Processing of personal data |
faithfulness, hallucination, relevance, answer_accuracy, context_precision, context_recall, summarization, coherence, task_completion, tool_call_accuracy, plan_quality, step_faithfulness, … | Art. 15(1) — Accuracy |
not_empty, schema_compliance, json_schema, self_consistency, turn_consistency, latency, max_latency, agent_memory, … | Art. 15(2) — Robustness |
NIST AI RMF subcategories: accuracy evaluators → MEASURE 2.3, robustness → MEASURE 2.5, toxicity → MEASURE 2.6, pii_detection → MEASURE 2.10, bias → MEASURE 2.11.
Verifying integrity (hash chain)
Each record stores prev_hash pointing at the previous record’s record_hash,
forming a SHA-256 chain. verify() walks the chain end-to-end:
ok = reporter.verify("HR Bot Eval")
# OK a3f9b2c1 2026-05-13T09:23:11
# OK b7d1e4f2 2026-05-14T14:07:42
# Verification: PASS — all records intact
If a record is edited in place: TAMPERED. If a middle record is deleted
(undetectable under per-record hashing): CHAIN BROKEN on the next record.
Each NDJSON line:
{
"record_id": "a3f9b2c1ef20",
"suite_name": "HR Bot Eval",
"model_id": "claude-sonnet-4-6",
"timestamp": "2026-05-13T09:23:11.821Z",
"framework": "eu-ai-act",
"chain_version": 1,
"prev_hash": "0000…0000",
"summary": {
"total": 50,
"passed": 46,
"pass_rate": 0.92,
"tags": {"version": "2.1", "env": "staging"}
},
"evaluator_results": [
{
"evaluator": "faithfulness",
"avg_score": 0.89,
"pass_rate": 0.88,
"controls": [
{"id": "Art. 15(1)", "description": "Accuracy"}
]
}
],
"record_hash": "e3b0c44298fc1c149afb…"
}
controls is a list because some evaluators may map to multiple controls in
future framework versions. prev_hash is "0" * 64 for the first record in
the log (genesis).
Full compliance pipeline
from multivon_eval import (
EvalSuite, EvalCase,
Faithfulness, PIIEvaluator, SchemaEvaluator,
ComplianceReporter,
)
from pydantic import BaseModel
class ClinicalSummary(BaseModel):
diagnosis: str
recommended_action: str
urgency: str
suite = EvalSuite("Clinical AI Eval")
suite.add_cases(load("tests/clinical_cases.jsonl"))
suite.add_evaluators(
Faithfulness(),
PIIEvaluator(jurisdiction="gdpr", redact=True),
SchemaEvaluator(ClinicalSummary),
)
reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn)
reporter.record(report, tags={"regulatory_period": "Q4-2025"})
# Fail CI if PII detected or schema invalid
if report.pass_rate < 1.0:
raise SystemExit(f"Compliance check failed: {report.failed} case(s) failed")
CI/CD Integration
# .github/workflows/compliance-eval.yml
jobs:
compliance:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install multivon-eval
- run: python evals/compliance_check.py
# No API key needed for PIIEvaluator + SchemaEvaluator
- uses: actions/upload-artifact@v4
with:
name: audit-logs
path: ./audit-logs/
The audit logs in ./audit-logs/ are the compliance artifacts — store them alongside your release artifacts.