Quickstart - Multivon Docs

PDF Hell is an adversarial benchmark for AI document readers. Every test case is a PDF generated from code, so the correct answer is exactly known — no LLM judges another LLM.

Install

pip install pdfhell

The bare install pulls multivon-eval (the engine), reportlab (PDF generation), pypdf, and the three frontier-provider SDKs (anthropic, openai, google-genai). No GPU required, no provider extras to remember. If you use uv, skip the install — uvx pdfhell <cmd> works zero-setup.

Set an API key

PDF Hell sends each PDF to a vision-capable model. Bring your own key for at least one of:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-proj-...
export GOOGLE_API_KEY=AIza...

Run the smoke suite (3 cases, ~10 seconds)

pdfhell run --model google:gemini-2.5-flash --suite smoke

You should see something like:

PDF Hell smoke suite — n=3
model: google:gemini-2.5-flash
pass: 3/3  (100.0%)
refused: 0.0%

per-trap pass rate:
  footnote_override               pass=100%  fell-for-trap=0%
  hidden_ocr_mismatch             pass=100%  fell-for-trap=0%
  split_table_across_pages        pass=100%  fell-for-trap=0%

The smoke suite is one case per trap family. The full mini suite is 30 cases (10 per family) — same command, --suite mini.

Try a specific trap

Generate one PDF and inspect it visually:

pdfhell make --trap hidden_ocr_mismatch --seed 42
open ./cases/hidden_ocr_mismatch-0042.pdf

Open the PDF — you’ll see what looks like a normal invoice. The trap is invisible: a second copy of the total amount is written into the PDF’s text content stream using PDF render mode 3 (placed in the text stream but never rasterised). A vision model reads the visible amount; a text-extraction pipeline reads the hidden one. Every case JSON (cases/<case_id>.json) carries the expected answer, the forbidden answer (the value the trap was specifically designed to elicit), and a description of the failure mode.

What’s next

See the Trap families reference to understand what each trap tests.
Wire pdfhell into your CI with the CI integration guide.
Full CLI reference.
The FAQ covers methodology questions (test leakage, prompt sensitivity, statistical power).

​Install

​Set an API key

​Run the smoke suite (3 cases, ~10 seconds)

​Try a specific trap

​What’s next

Install

Set an API key

Run the smoke suite (3 cases, ~10 seconds)

Try a specific trap

What’s next