Compliance & Privacy

For regulated industries (healthcare, finance, legal, government), your eval traces can’t leave your environment. multivon-eval’s compliance tools run entirely locally: no cloud, no LLM calls for PII detection.

PII Detection

PIIEvaluator scans LLM outputs for personally identifiable information using regex patterns. Zero API calls, so it works in air-gapped environments.

Basic usage

from multivon_eval import EvalSuite, PIIEvaluator

suite = EvalSuite("Patient Intake Bot Eval")
suite.add_evaluators(PIIEvaluator())

report = suite.run(model_fn)

A case fails if any PII is detected in the output. The failure reason lists each PII type and example matches.

Jurisdiction-specific patterns

# All patterns (default)
PIIEvaluator()

# GDPR (EU) — adds EU VAT plus ~13 national-ID patterns: UK NI and NHS
# numbers, Spain DNI/NIE, Italy codice fiscale, France NIR, Germany tax
# ID, Netherlands BSN, Poland PESEL, Sweden personnummer, Denmark CPR,
# Ireland PPSN, Finland HETU
PIIEvaluator(jurisdiction="gdpr")

# CCPA (California) — adds bank account numbers
PIIEvaluator(jurisdiction="ccpa")

# PIPEDA (Canada) — base patterns
PIIEvaluator(jurisdiction="pipeda")

# HIPAA — adds MRN, health plan numbers, VINs, fax numbers,
#          admission/discharge dates, device IDs, NPI/DEA numbers, URLs
PIIEvaluator(jurisdiction="hipaa")

HIPAA coverage note: This evaluator detects 13 of 18 HIPAA Safe Harbor PHI identifiers via regex. The remaining 5 (patient names, geographic subdivisions below state, photographs, biometric data, and arbitrary unique identifiers) cannot be reliably detected from text output and require de-identification before the text reaches the evaluator. For full HIPAA Safe Harbor compliance, combine PIIEvaluator(jurisdiction="hipaa") with an upstream de-identification step.

Custom patterns

PIIEvaluator(patterns={
    "employee_id": r"EMP-\d{6}",
    "case_number": r"CASE-[A-Z]{2}\d{8}",
})

Redacting PII from reports

By default, matched PII is shown in the reason field. To mask it in audit logs:

PIIEvaluator(redact=True)
# reason: PII detected (2 type(s)):
#   email: "[REDACTED-EMAIL]"
#   phone_us: "[REDACTED-PHONE_US]"

What’s detected

Pattern	Examples
`email`	[email protected]
`phone_us`	(800) 555-0100, 212-555-0198
`phone_intl`	+44 7911 123456
`ssn`	123-45-6789
`credit_card`	4111 1111 1111 1111
`iban`	DE89370400440532013000
`ip_address`	192.168.1.1
`date_of_birth`	DOB: 12/05/1985
`passport`	AB1234567
`address`	123 Main Street
`eu_vat` (GDPR)	DE123456789
`bank_account` (CCPA)	12345678901234

Structured Output Validation

SchemaEvaluator validates that LLM outputs conform to a defined structure. Works with Pydantic models and JSON Schema dicts, and reports per-field failures rather than a bare valid/invalid. StructEval (2025) found GPT-4 fails complex structured extraction ~12% of the time. This evaluator catches those failures in your specific pipeline.

Pydantic model

from pydantic import BaseModel
from multivon_eval import SchemaEvaluator

class InvoiceExtraction(BaseModel):
    vendor: str
    amount: float
    currency: str
    invoice_date: str
    line_items: list[str]

suite.add_evaluators(SchemaEvaluator(InvoiceExtraction))

Supports Pydantic v1 and v2. Field-level error messages:

Schema validation failed:
  amount: Input should be a valid number, unable to parse string as a number
  currency: Field required

JSON Schema

suite.add_evaluators(SchemaEvaluator({
    "type": "object",
    "required": ["title", "score", "category"],
    "properties": {
        "title": {"type": "string", "maxLength": 100},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "category": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    }
}))

Handling markdown code fences

SchemaEvaluator automatically strips markdown code fences from outputs:

```json
{"title": "Great product", "score": 0.9, "category": "positive"}
```

This is valid — the schema evaluator strips the fence before parsing.

Compliance Audit Trail

ComplianceReporter writes a hash-chained, tamper-evident NDJSON log of every eval run, with Article-level regulatory control annotations on each evaluator result.

Basic usage

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite("HR Bot Eval")
reporter = ComplianceReporter(
    output_dir="./audit-logs",
    framework="eu-ai-act",
)

report = suite.run(model_fn)
record_id = reporter.record(report, tags={"version": "2.1", "env": "staging"})
# [compliance] summary record → a3f9b2c1  (hr_bot_eval.audit.ndjson)
# [compliance] framework: eu-ai-act

EU AI Act high-risk factory

For high-risk systems under Annex III, use the factory — it wires the standard measurable controls with calibrated thresholds:

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
suite.add_cases(cases)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn, runs=5)
reporter.record(report, tags={"system": "triage-bot"})

The factory wires: NotEmpty, Faithfulness, Hallucination, Relevance, Toxicity, Bias, PIIEvaluator (plus optional SchemaEvaluator if you pass schema=). Pair with runs=5 to surface flakiness — Art. 15(2) robustness isn’t a single-run claim.

Coverage report

Before you run, check which Articles the suite actually exercises:

print(reporter.coverage(suite))

eu-ai-act coverage for suite 'EU AI Act High-Risk Eval'
───────────────────────────────────────────────────────
  [x] Art. 9(2)(b)   Foreseeable misuse risk identification
      covered by: toxicity
  [x] Art. 10(2)(f-g) Examination and mitigation of possible biases
      covered by: bias
  [x] Art. 10(5)     Processing of personal data
      covered by: pii_detection
  [x] Art. 15(1)     Accuracy
      covered by: faithfulness, hallucination, relevance
  [x] Art. 15(2)     Robustness
      covered by: not_empty

  Process controls (not satisfiable by evaluators alone):
      Art. 11        Technical documentation
      Art. 12        Record-keeping (satisfied by this reporter)
      Art. 13        Transparency and information to deployers
      Art. 14        Human oversight
      Art. 15(4-5)   Cybersecurity and resilience

  Coverage: 5/5 measurable controls exercised.

Process controls (Art. 11, 13, 14, 15(4-5)) require organizational measures outside of model evaluation; the reporter surfaces them so you don’t mistake “high coverage” for “fully compliant.”

Framework mappings

ComplianceReporter(framework="eu-ai-act")    # Article-level EU AI Act controls
ComplianceReporter(framework="nist-ai-rmf")  # NIST AI RMF subcategories
ComplianceReporter(framework="none")         # raw scores only

EU AI Act mappings (Regulation (EU) 2024/1689):

Evaluator	Control
`toxicity`	Art. 9(2)(b) — Foreseeable misuse risk identification
`bias`	Art. 10(2)(f-g) — Examination & mitigation of possible biases
`pii_detection`	Art. 10(5) — Processing of personal data
`faithfulness`, `hallucination`, `relevance`, `answer_accuracy`, `context_precision`, `context_recall`, `summarization`, `coherence`, `task_completion`, `tool_call_accuracy`, `plan_quality`, `step_faithfulness`, …	Art. 15(1) — Accuracy
`not_empty`, `schema_compliance`, `json_schema`, `self_consistency`, `turn_consistency`, `latency`, `max_latency`, `agent_memory`, …	Art. 15(2) — Robustness

NIST AI RMF subcategories: accuracy evaluators → MEASURE 2.3, robustness → MEASURE 2.5, toxicity → MEASURE 2.6, pii_detection → MEASURE 2.10, bias → MEASURE 2.11.

Verifying integrity (hash chain)

Each record stores prev_hash pointing at the previous record’s record_hash, forming a SHA-256 chain. verify() walks the chain end-to-end:

ok = reporter.verify("HR Bot Eval")
#   OK  a3f9b2c1  2026-05-13T09:23:11
#   OK  b7d1e4f2  2026-05-14T14:07:42
#   Verification: PASS — all records intact

If a record is edited in place: TAMPERED. If a middle record is deleted (undetectable under per-record hashing): CHAIN BROKEN on the next record.

Audit record format

Each NDJSON line:

{
  "record_id": "a3f9b2c1ef20",
  "suite_name": "HR Bot Eval",
  "model_id": "claude-sonnet-4-6",
  "timestamp": "2026-05-13T09:23:11.821Z",
  "framework": "eu-ai-act",
  "chain_version": 1,
  "prev_hash": "0000…0000",
  "summary": {
    "total": 50,
    "passed": 46,
    "pass_rate": 0.92,
    "tags": {"version": "2.1", "env": "staging"}
  },
  "evaluator_results": [
    {
      "evaluator": "faithfulness",
      "avg_score": 0.89,
      "pass_rate": 0.88,
      "controls": [
        {"id": "Art. 15(1)", "description": "Accuracy"}
      ]
    }
  ],
  "record_hash": "e3b0c44298fc1c149afb…"
}

controls is a list because some evaluators may map to multiple controls in future framework versions. prev_hash is "0" * 64 for the first record in the log (genesis).

Full compliance pipeline

from multivon_eval import (
    EvalSuite, EvalCase,
    Faithfulness, PIIEvaluator, SchemaEvaluator,
    ComplianceReporter,
)
from pydantic import BaseModel

class ClinicalSummary(BaseModel):
    diagnosis: str
    recommended_action: str
    urgency: str

suite = EvalSuite("Clinical AI Eval")
suite.add_cases(load("tests/clinical_cases.jsonl"))
suite.add_evaluators(
    Faithfulness(),
    PIIEvaluator(jurisdiction="gdpr", redact=True),
    SchemaEvaluator(ClinicalSummary),
)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn)
reporter.record(report, tags={"regulatory_period": "Q4-2025"})

# Fail CI if PII detected or schema invalid
if report.pass_rate < 1.0:
    raise SystemExit(f"Compliance check failed: {report.failed} case(s) failed")

CI/CD Integration

# .github/workflows/compliance-eval.yml
jobs:
  compliance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install multivon-eval
      - run: python evals/compliance_check.py
        # No API key needed for PIIEvaluator + SchemaEvaluator
      - uses: actions/upload-artifact@v4
        with:
          name: audit-logs
          path: ./audit-logs/

The audit logs in ./audit-logs/ are the compliance artifacts — store them alongside your release artifacts.

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

PII Detection

Basic usage

Jurisdiction-specific patterns

Custom patterns

Redacting PII from reports

What’s detected

Structured Output Validation

Pydantic model

JSON Schema

Handling markdown code fences

Compliance Audit Trail

Basic usage

EU AI Act high-risk factory

Coverage report

Framework mappings

Verifying integrity (hash chain)

Audit record format

Full compliance pipeline

CI/CD Integration

​PII Detection

​Basic usage

​Jurisdiction-specific patterns

​Custom patterns

​Redacting PII from reports

​What’s detected

​Structured Output Validation

​Pydantic model

​JSON Schema

​Handling markdown code fences

​Compliance Audit Trail

​Basic usage

​EU AI Act high-risk factory

​Coverage report

​Framework mappings

​Verifying integrity (hash chain)

​Audit record format

​Full compliance pipeline

​CI/CD Integration

PII Detection

Basic usage

Jurisdiction-specific patterns

Custom patterns

Redacting PII from reports

What’s detected

Structured Output Validation

Pydantic model

JSON Schema

Handling markdown code fences

Compliance Audit Trail

Basic usage

EU AI Act high-risk factory

Coverage report

Framework mappings

Verifying integrity (hash chain)

Audit record format

Full compliance pipeline

CI/CD Integration