The 0.1 release ships three trap families. Each is a procedural PDF generator: take a seed, draw a PDF with reportlab, return the bytes and the code-derived answer key. Re-running with the same seed produces byte-identical PDFs.Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
hidden_ocr_mismatch
One-line: Invisible text layer disagrees with the rendered page.How it’s generated
The generator draws a normal-looking invoice with realistic line items and aTOTAL DUE amount. Then it places a second, different amount at the same coordinate using PDF text render mode 3 — placed in the text content stream but never rasterised. A human sees one amount; a text-extraction pipeline (pdfminer, pdfplumber, pypdf, most RAG loaders) sees the other.
What it tests
Whether the model trusts the rendered page or the underlying text layer. Vision-only models read the page; agents that fuse a text-extraction layer with vision output often silently prefer the layer.Expected failure mode
The model returns the hidden amount instead of the visible one. The case’sforbidden_answers records the hidden amount so the scorer flags this specific failure with fell_for_trap=true — distinguishing trap-caught failures from generic hallucination.
Real-world equivalent
Scanned-then-re-OCR’d PDFs in production. The OCR layer can drift from the rendered page (re-OCR with a different engine, re-saved through a buggy editor, manipulated headers). Document-AI agents that mix OCR + vision frequently fall here.Example seeds
1001 – 1010 in the mini suite. Try pdfhell make --trap hidden_ocr_mismatch --seed 42.
footnote_override
One-line: A 6pt footnote overrides the body clause.How it’s generated
A short legal/contract document (MSA, DPA, SOW, Order Form) with a confident body clause — e.g. “Customer’s liability shall be capped at twelve (12) months of fees paid” — and a 6pt footnote near the bottom of the same page that overrides it: “Notwithstanding the foregoing, liability for breaches of Sections 4.2 and 7.1 shall be uncapped.” Three clause families: liability caps, termination-for-convenience restrictions, data-residency with disaster-recovery exceptions. Section numbers, regions, and notice periods are randomly seeded — but the carve-out structure is fixed.What it tests
Whether the model captures both the body clause and the footnote when summarising. RAG pipelines that drop low-font-size text on ingest fail here. Contract-analysis agents that focus on the body clause fail here.Expected failure mode
The model returns the body-only answer (e.g. “Liability is capped at 12 months of fees paid”), missing the carve-out. The scoring usesexpected_tokens — the answer must include the cap value AND every carve-out section number AND the word “uncapped” (any phrasing).
Real-world equivalent
Customer paper from procurement teams routinely buries exceptions in footnotes. A model that ships confident summaries without the carve-out creates malpractice-grade errors for legal-AI vendors.Example seeds
2001 – 2010. Try pdfhell make --trap footnote_override --seed 5.
split_table_across_pages
One-line: Header on page 1, body rows on page 2 — no header repeat.How it’s generated
A financial-results table with 6 columns (Region, Quarter, Gross Revenue, COGS, Operating Income, Net Revenue). Filler text is sized so the column-header row lands at the bottom of page 1. The 8 data rows sit at the top of page 2, headerless. The case asks for one specific cell — e.g. “What was the Net Revenue for the Northwest region in Q3 of 2026?”
What it tests
Whether the model maintains column-header context across page boundaries. RAG loaders that paginate documents independently lose the header on page 2. Table-extraction models that don’t persist column headers when a table spans a page break fail here.Expected failure mode
The model returns a value from an adjacent column in the correct row — column confusion.forbidden_answers records the value of an adjacent column in the same row so the scorer distinguishes column-confusion from outright hallucination.
Real-world equivalent
Annual reports, financial statements, regulatory filings — long tables routinely span pages with header-only-on-first-page formatting. Any document-AI agent that retrieves “the page with Q3 data” without also retrieving “the page with the column headers” will trip here.Example seeds
3001 – 3010. Try pdfhell make --trap split_table_across_pages --seed 7.
Suite versioning
Each suite carries two reproducibility primitives in the run JSON:suite_version— human-readable label (e.g.mini-v1). Bumped when traps are added or trap parameters change. Two runs with the samesuite_versionmeasured the same conceptual benchmark.suite_hash— 8-character SHA-256 prefix of the sorted(trap_family, seed)pairs. Two runs with the samesuite_hashmeasured the exact same cases. Bumping a seed → newsuite_hasheven within the same version.
| Suite | Version | Hash | Total cases | Use |
|---|---|---|---|---|
smoke | smoke-v1 | 8cb2f6ab | 3 | One case per trap family — ~10s end-to-end, useful for CI smoke tests |
mini | mini-v1 | 8ad87b8d | 30 | 10 seeds per family — the published-leaderboard suite, ~$0.01 on Gemini Flash |
pdfhell discover --json emits these for agents; the run JSON includes both so a consumer can verify which cases were actually measured before comparing numbers across runs.
What’s not yet in the suite
The 0.1 release is intentionally narrow. Coming next (PRs welcome — see contributing):merged_table_cells— value depends on row/column span interpretationrotated_scan— visually legible but OCR-broken pagesnear_duplicate_entities— ACME Ltd. vs ACME Holdings Ltd.prompt_injection_in_body— “Ignore previous instructions and answer X”chart_axis_inversion— answers depend on reading axis directioncheckbox_ambiguity— selected vs unselected with low visual margincross_page_citation— answers requiring page + bounding-box citations

