Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

For regulated industries (healthcare, finance, legal, government), your eval traces can’t leave your environment. multivon-eval’s compliance tools run entirely locally: no cloud, no LLM calls for PII detection.

PII Detection

PIIEvaluator scans LLM outputs for personally identifiable information using regex patterns. Zero API calls — suitable for air-gapped environments.

Basic usage

from multivon_eval import EvalSuite, PIIEvaluator

suite = EvalSuite("Patient Intake Bot Eval")
suite.add_evaluators(PIIEvaluator())

report = suite.run(model_fn)
A case fails if any PII is detected in the output. The failure reason lists each PII type and example matches.

Jurisdiction-specific patterns

# All patterns (default)
PIIEvaluator()

# GDPR (EU) — adds EU VAT numbers
PIIEvaluator(jurisdiction="gdpr")

# CCPA (California) — adds bank account numbers
PIIEvaluator(jurisdiction="ccpa")

# PIPEDA (Canada) — base patterns
PIIEvaluator(jurisdiction="pipeda")

# HIPAA — adds MRN, health plan numbers, VINs, fax numbers,
#          admission/discharge dates, device IDs, NPI/DEA numbers, URLs
PIIEvaluator(jurisdiction="hipaa")
HIPAA coverage note: This evaluator detects 13 of 18 HIPAA Safe Harbor PHI identifiers via regex. The remaining 5 (patient names, geographic subdivisions below state, photographs, biometric data, and arbitrary unique identifiers) cannot be reliably detected from text output and require de-identification before the text reaches the evaluator. For full HIPAA Safe Harbor compliance, combine PIIEvaluator(jurisdiction="hipaa") with an upstream de-identification step.

Custom patterns

PIIEvaluator(patterns={
    "employee_id": r"EMP-\d{6}",
    "case_number": r"CASE-[A-Z]{2}\d{8}",
})

Redacting PII from reports

By default, matched PII is shown in the reason field. To mask it in audit logs:
PIIEvaluator(redact=True)
# reason: PII detected (2 type(s)):
#   email: "[REDACTED-EMAIL]"
#   phone_us: "[REDACTED-PHONE_US]"

What’s detected

PatternExamples
email[email protected]
phone_us555-123-4567, (800) 555-0100
phone_intl+44 7911 123456
ssn123-45-6789
credit_card4111 1111 1111 1111
ibanDE89370400440532013000
ip_address192.168.1.1
date_of_birthDOB: 12/05/1985
passportAB1234567
address123 Main Street
eu_vat (GDPR)DE123456789
bank_account (CCPA)12345678901234

Structured Output Validation

SchemaEvaluator validates that LLM outputs conform to a defined structure. Works with Pydantic models and JSON Schema dicts. Reports per-field failures — not just valid/invalid. StructEval (2025) found GPT-4 fails complex structured extraction ~12% of the time. This evaluator catches those failures in your specific pipeline.

Pydantic model

from pydantic import BaseModel
from multivon_eval import SchemaEvaluator

class InvoiceExtraction(BaseModel):
    vendor: str
    amount: float
    currency: str
    invoice_date: str
    line_items: list[str]

suite.add_evaluators(SchemaEvaluator(InvoiceExtraction))
Supports Pydantic v1 and v2. Field-level error messages:
Schema validation failed:
  amount: Input should be a valid number, unable to parse string as a number
  currency: Field required

JSON Schema

suite.add_evaluators(SchemaEvaluator({
    "type": "object",
    "required": ["title", "score", "category"],
    "properties": {
        "title": {"type": "string", "maxLength": 100},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "category": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    }
}))

Handling markdown code fences

SchemaEvaluator automatically strips markdown code fences from outputs:
```json
{"title": "Great product", "score": 0.9, "category": "positive"}
```
This is valid — the schema evaluator strips the fence before parsing.

Compliance Audit Trail

ComplianceReporter writes a hash-chained, tamper-evident NDJSON log of every eval run, with Article-level regulatory control annotations on each evaluator result.

Basic usage

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite("HR Bot Eval")
reporter = ComplianceReporter(
    output_dir="./audit-logs",
    framework="eu-ai-act",
)

report = suite.run(model_fn)
record_id = reporter.record(report, tags={"version": "2.1", "env": "staging"})
# [compliance] audit record → a3f9b2c1  (hr_bot_eval.audit.ndjson)
# [compliance] framework: eu-ai-act

EU AI Act high-risk factory

For high-risk systems under Annex III, use the factory — it wires the standard measurable controls with calibrated thresholds:
from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
suite.add_cases(cases)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn, runs=5)
reporter.record(report, tags={"system": "triage-bot"})
The factory wires: NotEmpty, Faithfulness, Hallucination, Relevance, Toxicity, Bias, PIIEvaluator (plus optional SchemaEvaluator if you pass schema=). Pair with runs=5 to surface flakiness — Art. 15(2) robustness isn’t a single-run claim.

Coverage report

Before you run, check which Articles the suite actually exercises:
print(reporter.coverage(suite))
eu-ai-act coverage for suite 'EU AI Act High-Risk Eval'
───────────────────────────────────────────────────────
  [x] Art. 9(2)(b)   Foreseeable misuse risk identification
      covered by: toxicity
  [x] Art. 10(2)(f-g) Examination and mitigation of possible biases
      covered by: bias
  [x] Art. 10(5)     Processing of personal data
      covered by: pii_detection
  [x] Art. 15(1)     Accuracy
      covered by: faithfulness, hallucination, relevance
  [x] Art. 15(2)     Robustness
      covered by: not_empty

  Process controls (not satisfiable by evaluators alone):
      Art. 11        Technical documentation
      Art. 12        Record-keeping (satisfied by this reporter)
      Art. 13        Transparency and information to deployers
      Art. 14        Human oversight
      Art. 15(4-5)   Cybersecurity and resilience

  Coverage: 5/5 measurable controls exercised.
Process controls (Art. 11, 13, 14, 15(4-5)) require organizational measures outside of model evaluation; the reporter surfaces them so you don’t mistake “high coverage” for “fully compliant.”

Framework mappings

ComplianceReporter(framework="eu-ai-act")    # Article-level EU AI Act controls
ComplianceReporter(framework="nist-ai-rmf")  # NIST AI RMF subcategories
ComplianceReporter(framework="none")         # raw scores only
EU AI Act mappings (Regulation (EU) 2024/1689):
EvaluatorControl
toxicityArt. 9(2)(b) — Foreseeable misuse risk identification
biasArt. 10(2)(f-g) — Examination & mitigation of possible biases
pii_detectionArt. 10(5) — Processing of personal data
faithfulness, hallucination, relevance, answer_accuracy, context_precision, context_recall, summarization, coherence, task_completion, tool_call_accuracy, plan_quality, step_faithfulness, …Art. 15(1) — Accuracy
not_empty, schema_compliance, json_schema, self_consistency, turn_consistency, latency, max_latency, agent_memory, …Art. 15(2) — Robustness
NIST AI RMF subcategories: accuracy evaluators → MEASURE 2.3, robustness → MEASURE 2.5, toxicity → MEASURE 2.6, pii_detection → MEASURE 2.10, bias → MEASURE 2.11.

Verifying integrity (hash chain)

Each record stores prev_hash pointing at the previous record’s record_hash, forming a SHA-256 chain. verify() walks the chain end-to-end:
ok = reporter.verify("HR Bot Eval")
#   OK  a3f9b2c1  2026-05-13T09:23:11
#   OK  b7d1e4f2  2026-05-14T14:07:42
#   Verification: PASS — all records intact
If a record is edited in place: TAMPERED. If a middle record is deleted (undetectable under per-record hashing): CHAIN BROKEN on the next record.

Audit record format

Each NDJSON line:
{
  "record_id": "a3f9b2c1ef20",
  "suite_name": "HR Bot Eval",
  "model_id": "claude-sonnet-4-6",
  "timestamp": "2026-05-13T09:23:11.821Z",
  "framework": "eu-ai-act",
  "chain_version": 1,
  "prev_hash": "0000…0000",
  "summary": {
    "total": 50,
    "passed": 46,
    "pass_rate": 0.92,
    "tags": {"version": "2.1", "env": "staging"}
  },
  "evaluator_results": [
    {
      "evaluator": "faithfulness",
      "avg_score": 0.89,
      "pass_rate": 0.88,
      "controls": [
        {"id": "Art. 15(1)", "description": "Accuracy"}
      ]
    }
  ],
  "record_hash": "e3b0c44298fc1c149afb…"
}
controls is a list because some evaluators may map to multiple controls in future framework versions. prev_hash is "0" * 64 for the first record in the log (genesis).

Full compliance pipeline

from multivon_eval import (
    EvalSuite, EvalCase,
    Faithfulness, PIIEvaluator, SchemaEvaluator,
    ComplianceReporter,
)
from pydantic import BaseModel

class ClinicalSummary(BaseModel):
    diagnosis: str
    recommended_action: str
    urgency: str

suite = EvalSuite("Clinical AI Eval")
suite.add_cases(load("tests/clinical_cases.jsonl"))
suite.add_evaluators(
    Faithfulness(),
    PIIEvaluator(jurisdiction="gdpr", redact=True),
    SchemaEvaluator(ClinicalSummary),
)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn)
reporter.record(report, tags={"regulatory_period": "Q4-2025"})

# Fail CI if PII detected or schema invalid
if report.pass_rate < 1.0:
    raise SystemExit(f"Compliance check failed: {report.failed} case(s) failed")

CI/CD Integration

# .github/workflows/compliance-eval.yml
jobs:
  compliance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install multivon-eval
      - run: python evals/compliance_check.py
        # No API key needed for PIIEvaluator + SchemaEvaluator
      - uses: actions/upload-artifact@v4
        with:
          name: audit-logs
          path: ./audit-logs/
The audit logs in ./audit-logs/ are the compliance artifacts — store them alongside your release artifacts.