These are the prompts you’d actually type to a Claude Desktop / Cursor / Claude Code session that hasDocumentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
multivon-mcp configured. The agent figures out which tool to call from the natural-language request — these recipes just show the shape of the resulting flow.
Recipe 1: evaluate a RAG output for hallucination
Use case: you just shipped a RAG endpoint and want to confirm the model didn’t invent facts not present in the retrieved context. Prompt the agent:I just got this output from my RAG endpoint. Can you check it for hallucinations? Question: “What is the auto-renewal policy?” Retrieved context: “Section 12.4: This Agreement shall automatically renew for successive one-year terms unless either party gives 30 days written notice prior to the renewal date.” Output: “The contract auto-renews annually with 30-day written notice, and includes automatic price increases capped at 5%.”What happens:
- Agent calls
eval_faithfulness(input, context, output)— and optionallyeval_hallucination(output, context)to confirm. - JSON comes back:
{"score": 0.667, "passed": false, "threshold": 0.9, "reason": "2/3 claims grounded ..."}. - Agent reads the reason field, sees the unsupported claim (“price increases capped at 5%”), and tells you precisely which span hallucinated.
Recipe 2: score an agent tool call
Use case: you’re building an agent and want to verify it called the right tool with the right arguments on a known trace. Prompt the agent:Score this trace. The expected tool wasWhat happens:search_productswithquery="organic flour"andmax_results=5. The actual call wassearch_productswithquery="organic flour"andmax_results=10.
- Agent calls
eval_tool_call_accuracy(expected_tool="search_products", actual_tool="search_products", expected_arguments={"query": "organic flour", "max_results": 5}, actual_arguments={"query": "organic flour", "max_results": 10}). - JSON comes back deterministically — no LLM judge involved:
eval_tool_call_accuracy is pure string comparison.
Recipe 3: run a pdfhell benchmark + ship the audit pack
Use case: procurement wants evidence that your document-AI vendor’s model holds up against adversarial PDFs. Prompt the agent:
Run the pdfhell mini suite against anthropic:claude-sonnet-4-6, then package the result as a procurement-ready audit pack.
What happens:
- Agent calls
pdfhell_run(model="anthropic:claude-sonnet-4-6", suite="mini")— returns a full report dict withpass_rate, per-trap CIs, per-case details, suite hash. - Agent inspects the report, notes any trap family below threshold (e.g.
footnote_override: 0.7), and surfaces the specific case IDs that failed. - If the run JSON was written to disk, the agent calls
eval_audit_pack(run_json_path, cases_dir, output_zip_path)to build the hash-chained ZIP. The manifest in the response confirms suite hash + file count.
footnote_override cases — model output vs expected — so you can see whether the model dropped the carve-out clause specifically.
Recipe 4: did my fix actually help? (multivon-mcp 0.3.0)
Use case: you just refactored a RAG pipeline and want to know whether the new version regressed on any case the old version got right. Prompt the agent:CompareWhat happens:runs/before.jsonagainstruns/after.jsonand tell me what got worse.
- Agent calls
eval_compare_runs(baseline_json_path="runs/before.json", new_json_path="runs/after.json"). - Response includes
pass_rate_delta, per-caseregressions(cases that passed before but fail now),improvements, and a McNemar p-value over paired cases. - Agent flags the regressions case-by-case so you can decide whether to ship.
Recipe 5: synthesise an eval suite from your docs (multivon-mcp 0.3.0)
Use case: you want to bootstrap an eval suite from a FAQ or product doc and don’t want to handwrite cases. Prompt the agent:
Generate 20 RAG cases from docs/faq.md so I can score my pipeline against them.
What happens:
- Agent reads the doc.
- Calls
eval_generate_cases(from_text=<file content>, n=20, task="qa"). - Gets back a list of
{input, expected_output, context}dicts — drop straight intoEvalCase(...)and you have a runnable suite.
Recipe 6: score the trace your agent just executed (multivon-mcp 0.3.0)
Use case: the agent ran a multi-step task and wants to evaluate its own trajectory. Prompt the agent:Score the trace from the run you just executed.What happens:
- Agent dumps its trace as a JSON dict (supported shapes: LangGraph, OpenAI Agents SDK, or canonical manual format).
- Calls
eval_ingest_trace(trace_json=<dict>, framework="langgraph"). - Receives an
EvalCasewith input + steps + tool_calls. Can then pipe througheval_tool_call_accuracy/eval_step_faithfulnessfor the actual scoring.
Discovering tools mid-session
If the agent isn’t sure which evaluator to use, it can calleval_discover to dump the full catalog. The response includes every shipped evaluator, its tier (deterministic / llm_judge_qag / agent_trace / compliance / safety / multimodal / consistency), the import path for direct SDK use, and which (evaluator, judge_model) pairs have shipped calibration data.
For agents that don’t speak MCP, the same catalog is available via the CLI: multivon-eval discover --json and pdfhell discover --json.
This is what makes multivon-mcp self-describing — the agent never has to read documentation to know what’s available.
