Multimodal evaluators score LLM outputs that are supposed to describe an image or a multi-page document. Same QAG philosophy as the LLM-judge tier, with one constraint: the judge must be vision-capable.Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Experimental — first shipped in 0.7.3 (2026-05-16). These are the seed evaluators for the Document Agent Acceptance Protocol. Calibrated thresholds ship in a follow-up release; until then the standard calibration-fallback policy applies (default
0.7).Vision-capable judges
| Provider | Models the evaluator will accept |
|---|---|
anthropic | claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-7, claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus |
openai | gpt-4o, gpt-4o-mini, gpt-4.1, gpt-5, gpt-5-mini, gpt-5.5, gpt-5.5-mini |
google | gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-1.5-pro, gemini-1.5-flash |
google:gemini-2.5-flash for cost reasons. Wire a text-only judge into a vision evaluator and you’ll get a JudgeUnavailable with a hint to pick a vision model.
Image input format
Both evaluators read images fromcase.metadata:
case.metadata["image_url"]— single HTTP(S) URL ordata:URI.case.metadata["image_path"]— single local filesystem path.case.metadata["images"]— list of any of the above (multi-page documents).
DocumentGrounding, case.metadata["images"] is the expected key (one entry per page).
VQAFaithfulness
Image-grounded faithfulness. The judge extracts up to 3 factual claims from the response, then verifies each claim against the image. Score is the fraction of claims confirmed. Use this when an LLM has produced text purporting to describe what’s in an image — chart reading, scan interpretation, image captioning, visual QA.Signature
threshold— explicit override. Falls through to the calibration-fallback policy whenNone(default0.7).judge— vision-capableJudgeConfig. Falls throughconfigure()→ environment if omitted.
Result
EvalResult.reason includes a per-claim trace:
✗ lines tell you precisely which claims the vision judge found unsupported.
DocumentGrounding
Multi-page document-grounded faithfulness. Score is the fraction of three strict yes/no questions answered positively against the assembled document pages:- Q1: Is every factual claim in the answer supported by content visible in at least one of the document pages?
- Q2: Does the answer avoid inventing any entity (name, date, number, amount, clause) that does not appear in the pages?
- Q3: Does the answer correctly handle the most important exception, caveat, or carve-out visible on the pages?
Signature
Result
The reason field reports per-question outcomes:How the multimodal evaluators relate to pdfhell
VQAFaithfulness and DocumentGrounding use an LLM judge with vision input. pdfhell uses code-derived ground truth against adversarial PDF structures — no judge in the scoring path.
Use them together:
- pdfhell to stress-test a document-AI pipeline against known failure modes (hidden OCR layers, footnote overrides, page-broken tables).
DocumentGroundingto grade open-ended summaries / Q&A on real customer documents where the ground truth isn’t code-derivable.

