Most LLM evals stop at the final string. Agent evals can’t — the same output can come from a clean three-step plan or a thirty-call death spiral, and you want to fail the death spiral.Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
multivon-eval ships a framework-agnostic trace model, three tracer
adapters, and eight evaluators that score the trajectory itself.
This page covers the data model, the tracers, the evaluators, and a runnable
LangGraph example.
The trace data model
Two dataclasses, defined inmultivon_eval/case.py:
list[AgentStep], attached to an EvalCase as agent_trace.
The semantic unit is one AgentStep per LLM/agent turn, not per framework
event. A ReAct loop that thinks, decides to call three tools in parallel, and then
incorporates the results produces ONE step with three ToolCall entries — not three
steps. This choice is enforced by every shipped tracer and is what the evaluators
assume.
EvalCase.expected_tool_calls is the assertion surface. ToolCall.arguments and
ToolCall.result are captured automatically by the tracers — you don’t write
them by hand unless you’re using ManualTracer.Tracer adapters
A tracer wraps yourmodel_fn so the suite gets back not just a string but a
list[AgentStep]. All three implement the AgentTracer ABC from
multivon_eval/integrations/base.py.
LangGraphTracer
Callback-based tracer for LangGraph compiled graphs. Step boundary = each
on_llm_start, so tools fire into the step that decided to call them.OpenAIAgentsTracer
Two modes: post-hoc
capture(result) parses RunResult.new_items; live
run_hooks() uses per-run isolated buffers.ManualTracer
For custom agents and frameworks without integration. You call
tracer.step(...) inside your agent code.LangGraphTracer
ACallbackTracer subclass — it builds a BaseCallbackHandler and injects it via
config={"callbacks": [...]} on the compiled graph’s invoke. Implementation in
integrations/langgraph.py.
langgraph_node, langgraph_checkpoint_ns,
graph:step:N tags) to attribute calls to the right node. The tools node of a
ReAct graph collapses into the preceding LLM turn — that’s the semantic unit
evaluators score against.
Install with pip install 'multivon-eval[langgraph]'. Compatible with LangGraph
≥ 0.2, verified through 0.5+.
OpenAIAgentsTracer
For the OpenAI Agents SDK. Two integration paths, both shipped fromintegrations/openai_agents.py.
Post-hoc (default, recommended): the tracer reads RunResult.new_items after
the run completes. No global state, no thread-safety concerns.
run_hooks() call returns a RunHooksBase with its own buffer — concurrent
runs do not interleave. merge is idempotent: the second call is a no-op because
the first clears the buffer.
Install with pip install 'multivon-eval[openai-agents]'. Items the SDK ships but
the tracer doesn’t fully model yet (CompactionItem, ToolApprovalItem, MCP /
ComputerCall / CodeInterpreter / ToolSearch items) are preserved as visible
[ItemClassName] markers in the trace rather than silently dropped.
ManualTracer
Your fallback for any agent that isn’t LangGraph or the OpenAI SDK. You record steps explicitly. Source inintegrations/manual.py.
step() returns a _StepRecorder context manager that flushes on exit.
record_tool_call and record_output are also available without a step block —
they create implicit single-call steps. ManualTracer is what powers most custom
agents in production: it’s the lowest-friction adapter and has zero framework
dependencies.
The eight agent-trace evaluators
All eight live inmultivon_eval/evaluators/agent.py.
Two are deterministic; six use an LLM judge (the QAG eval pattern documented in
Evaluators). Every evaluator returns a skipped pass — not a
0.0 — when the case shape doesn’t fit it (e.g. no agent_trace, no
expected_tool_calls). This was hardened in
0.9.0.
| Evaluator | Deterministic? | Requires |
|---|---|---|
ToolCallAccuracy | yes | expected_tool_calls (and agent_trace when non-empty) |
ToolArgumentAccuracy | no (judge) | agent_trace |
PlanQuality | no (judge) | agent_trace |
TaskCompletion | no (judge) | output (and agent_trace if available) |
StepFaithfulness | no (judge) | agent_trace |
ToolCallNecessity | no (judge) | agent_trace |
TrajectoryEfficiency | no (judge) | agent_trace |
AgentMemoryEval | no (judge) | context (prior session) + current input |
ToolCallAccuracy and its three input shapes
The most-used evaluator on the list. The shape ofexpected_tool_calls triggers
three different behaviors, formalized in
0.9.0:
None — skip
expected_tool_calls=None returns a skipped pass with the reason
“Requires case.expected_tool_calls — set it (or [] to assert no tools) to enable
ToolCallAccuracy.” You haven’t asserted anything, so the evaluator doesn’t
score.Empty list — assert no tools
expected_tool_calls=[] says “the agent must NOT call any tools.” If the
trace has zero calls → score 1.0 (“Correctly called no tools”). If any tool
fires → score 0.0 with the unexpected calls listed. This is the trivial-question
assertion: “for what’s 2 + 2, don’t reach for a calculator.”Populated list — real expectation
expected_tool_calls=["lookup_order", "issue_refund"] is the standard case.
Requires agent_trace; skips if missing. Default scoring is
fraction of expected tools called. Two modifiers tune the strictness:require_order=True— positional match instead of set match. Use when call order matters (auth before query, validate before commit).penalize_unexpected=True—score = matched / (expected ∪ unexpected), so every extra tool drags the score down. Use for negative cases like “the agent must NOT callrefund_orderon an already-refunded order” where extra calls are the failure mode you care about.
The judge-driven six
The remaining seven are sketched here; for the judge resolution logic, see Calibration and Statistical rigor.ToolArgumentAccuracy— per-call yes/no judge prompt: “Are these arguments appropriate and well-formed for<tool>given the task?” Capped at the first 8 calls to bound judge cost.PlanQuality— 5-question QAG over the full trace: addresses task, logical order, no redundancy, follows from prior steps, expert-efficient.TaskCompletion— 4-question QAG. Includes a negated check (“Did the agent fail / error?”) so an error trace can’t slip through with a confident partial-credit score.StepFaithfulness— per-step judge: “Does this step follow logically from the task and prior steps, without introducing contradictions or hallucinated information?” Catches reasoning drift that other evaluators miss.ToolCallNecessity— per-call: “Is this call strictly necessary, or redundant given prior calls?” Detects the “call tools just in case” failure mode. An empty trace is a PASS (nothing to flag as redundant) — not a skip — distinguishing “no expectation set” from “agent correctly did nothing.”TrajectoryEfficiency— 3-question QAG plus an error-recovery bonus: if any tool result containserror, an extra judge call asks whether the agent recovered well, penalizing 0.2 if not.AgentMemoryEval— multi-session memory check. Requirescase.context(prior session summary) and scores correct recall, no hallucination, no use of stale info. Aligned with AMA-Bench (2025).
Runnable example: LangGraph + ToolCallAccuracy + TrajectoryEfficiency
End-to-end: a tiny LangGraph agent, two cases (one positive, one negative), a ManualTracer fallback for an adversarial case that bypasses LangGraph.agent_eval.py
ManualTracer:
ToolCallAccuracy and TrajectoryEfficiency instances score both —
the evaluators don’t care which tracer captured the trace.
Inspecting and debugging traces
AgentTracer.format_trace() and tracer.print_trace() pretty-print a captured
trace. Use them inside a notebook when a case fails and you want to see what the
agent actually did:
CaseResult.agent_trace was added in
0.7.0
so notebooks can iterate steps from the report without reaching back into the
suite.
Importing existing traces
If your traces already live in LangSmith, LangFuse, Phoenix, Datadog, or any other observability store, you don’t have to re-run the agent.AgentTracer’s sibling abstraction CaseImporter (defined in the same
integrations/base.py) pulls runs as EvalCase objects with agent_trace
populated, and importer.as_model_fn(cases) gives you a passthrough that
replays the stored outputs in order:
load_traces
accepts field aliases for the common platforms — LangSmith’s
query/answer/retrieved_context, LangFuse’s prompt/completion,
Phoenix’s input/output all auto-rename to the canonical shape. See
Bootstrap workflow for the end-to-end import → score → ship
loop.
Related
- Evaluators — full evaluator catalog and the QAG eval pattern
- Bootstrap workflow — generate a tuned agent suite from product + traces
- Statistical rigor — Wilson + bootstrap CIs on agent pass rates
- Compliance — hash-chained audit logs over agent runs
multivon_eval/integrations/— tracer sourcemultivon_eval/evaluators/agent.py— evaluator source

