Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
How is this different from DocVQA / MMMU / ChartQA?
Existing document benchmarks measure correctness on clean, naturally-occurring PDFs. PDF Hell measures correctness under adversarial PDF structures that are common in production but absent in academic benchmarks (hidden OCR layers, footnote overrides, page-broken tables). It’s a stress test, not a sufficient eval. Pair it with a domain benchmark (DocVQA, your in-house regression suite) for full coverage.Why no LLM-as-judge?
Because the same complexity that fools a document AI also fools a judge model. Customers paying for “Evidence Pack” reports from other vendors routinely discover their judge passed an answer that contradicted the source. PDF Hell removes the judge from the scoring path. The answer is set by code; the model’s output is matched against a known string (or a list of required token substrings, for prose answers). Wrong answers can be diagnosed why they’re wrong via theforbidden_answers mechanism — but the score never depends on another LLM’s opinion.
Can a model train on the pdfhell repo and game the score?
The procedural generators are open source — that’s deliberate. Three mitigations:- Procedural, not fixed. A model could memorise the 30 mini-suite seeds, but the generator is parameterised. The mini suite uses seeds 1001–3010; seeds 5000+ are unused and reserved as private holdouts for follow-up evaluation. Test-leakage on the public set doesn’t transfer to held-out seeds.
- Code-based ground truth. Even if a model has seen a specific seed during training, the answer key is derived from the generator’s parameters — those are random per-seed. Memorising the question without the answer key buys nothing.
-
Trap families generalise. The point isn’t to overfit on
hidden_ocr_mismatch-1001; it’s to test whether the model handles the failure mode (hidden OCR layers in general). A model that passes seed 1001 by memorisation but fails on seed 9001 (a held-out seed) has revealed the contamination.
How statistically powerful is a 30-case run?
The 95% Wilson confidence interval on a 30-case run withpass_rate = 0.93 is approximately [0.78, 0.98]. Differences of less than ~5pp between models are not statistically distinguishable at this sample size.
For the published claim “GPT-4o falls for hidden_ocr_mismatch 100% of the time” — that’s 10 cases at 100% fell-for-trap rate. Wilson 95% CI is [0.72, 1.0]. Strong signal but not zero-uncertainty.
If you need tighter CIs, generate more seeds with pdfhell make --trap X --seed N for N in 5000–5100 (or similar) and add them to a custom suite. The 0.2 release will likely bump the mini suite to 50 cases / 5 families.
Is the methodology robust to prompt engineering?
The prompt for each trap family is fixed in the generator (the question included with each case). Deployers could re-engineer the prompt before sending to the model — e.g. inject “Make sure to read both visible text AND any invisible OCR layers.” That’s a feature, not a bug: if a deployer can prompt-engineer their way around a trap, the trap measures fragility-without-prompt-engineering, which is itself useful information. For the official leaderboard we use the exact prompt the case carries, so all models are evaluated under the same prompt.What about on-prem models (Qwen-VL, Llama-3.2-Vision, etc.)?
PDF Hell works with any OpenAI-compatible endpoint:Why is GPT-4o so much worse than Claude/Gemini on hidden OCR?
Honest answer: we don’t know definitively. The pattern looks like GPT-4o’s vision pipeline is trusting an internal text-extraction layer when one exists in the PDF, while Claude and Gemini are reading the rendered pixels. GPT-5.4 fixes most of it (80% pass on hidden OCR), but GPT-5.4-mini still falls 90%. We’ve reported the finding to OpenAI; no comment yet.Can I add my own trap family?
Yes. See CONTRIBUTING.md for the trap-family contract (deterministic seed, code-based answer key, named failure mode). One trap family per PR. Tests run withpytest tests/.
Can I get pdfhell against my own document templates?
The OSS mini suite tests pdfhell’s three trap families against your model. If you need adversarial variants of your own templates — MSAs, claim forms, EOBs, medical records — procedurally generated with the same methodology against the same trap families, drop a line to[email protected]. We’re inbound-only on this and pricing depends on volume + integration scope.
See commercial offerings for the full inbound flow.

