Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Agent evaluators work with agent_trace — a structured record of what your agent did. They’re framework-agnostic: works with LangChain, LlamaIndex, CrewAI, or any custom agent.
Setting up an agent trace
from multivon_eval import EvalCase, AgentStep, ToolCall
case = EvalCase(
input="Search for recent AI papers and write a summary",
agent_trace=[
AgentStep(
thought="I need to search for recent papers first",
tool_calls=[
ToolCall(
name="search",
arguments={"query": "AI papers 2025"},
result=["Paper A", "Paper B", "Paper C"],
)
],
),
AgentStep(
thought="Now I'll summarize what I found",
tool_calls=[ToolCall(name="summarize")],
output="Here are the key AI papers from 2025...",
),
],
expected_tool_calls=["search", "summarize"],
)
Checks that the agent called the expected tools. By default, order doesn’t matter (set match).
When to use: Regression testing — confirm that new model versions still call the right tools for standard tasks.
from multivon_eval import ToolCallAccuracy
ToolCallAccuracy() # unordered set match
ToolCallAccuracy(require_order=True) # must match in exact order
Score = fraction of expected tools that were called.
require_order=False | Score |
|---|
| All expected tools called | 1.0 |
| Half expected tools called | 0.5 |
| No expected tools called | 0.0 |
When require_order=True, uses sequence alignment — partially correct order scores between 0 and 1.
Requires case.agent_trace and case.expected_tool_calls.
| Parameter | Type | Default | Description |
|---|
require_order | bool | False | If True, tools must be called in the exact listed order |
threshold | float | 0.7 | Minimum score to pass |
LLM judge that evaluates whether the arguments passed to tools were appropriate and well-formed.
When to use: Catching argument-level bugs — wrong field names, missing required params, or semantically incorrect values — that tool call name checks miss.
from multivon_eval import EvalCase, AgentStep, ToolCall, ToolArgumentAccuracy
case = EvalCase(
input="Search for quarterly reports from 2024",
agent_trace=[
AgentStep(tool_calls=[
ToolCall(name="search", arguments={"query": "quarterly reports 2024", "limit": 10})
])
],
)
ToolArgumentAccuracy()
ToolArgumentAccuracy(threshold=0.8)
Evaluates up to 8 tool calls. Requires case.agent_trace.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
PlanQuality
LLM judge that evaluates the overall quality of the agent’s plan — logic, completeness, and efficiency.
When to use: Evaluating the agent’s reasoning process, not just the outcome. Useful when debugging why an agent succeeds or fails on complex multi-step tasks.
from multivon_eval import EvalCase, AgentStep, ToolCall, PlanQuality
case = EvalCase(
input="Research competitors and draft a comparison table",
agent_trace=[
AgentStep(thought="I'll search for each competitor first", tool_calls=[ToolCall(name="search")]),
AgentStep(thought="Now I'll compile the results", tool_calls=[ToolCall(name="format_table")]),
],
)
PlanQuality()
PlanQuality(threshold=0.8)
Assesses:
- Does the plan address the task?
- Are the steps in a logical order?
- Are there unnecessary or redundant steps?
- Is anything missing?
Requires case.agent_trace.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
TaskCompletion
LLM judge that evaluates whether the agent’s final output actually satisfies the original task.
When to use: End-to-end success metric — the primary evaluator for whether the agent delivered. Use alongside trajectory evaluators to distinguish “right answer, bad process” from “wrong answer”.
from multivon_eval import EvalCase, AgentStep, ToolCall, TaskCompletion
case = EvalCase(
input="Book a meeting for Monday at 2pm",
agent_trace=[
AgentStep(tool_calls=[ToolCall(name="create_event", arguments={"day": "Monday", "time": "14:00"})]),
],
)
TaskCompletion()
TaskCompletion(threshold=0.9)
Works with or without agent_trace — if no trace is attached, evaluates the final output alone.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
StepFaithfulness
LLM judge that checks whether each step follows logically from the prior steps and the original task.
When to use: Catching hallucinated reasoning steps — agents that invent observations, skip over failures, or take actions that contradict earlier tool results.
from multivon_eval import EvalCase, AgentStep, ToolCall, StepFaithfulness
case = EvalCase(
input="Find the cheapest flight to NYC and book it",
agent_trace=[
AgentStep(thought="Searching for flights", tool_calls=[ToolCall(name="search_flights")]),
AgentStep(thought="Booking the cheapest result found", tool_calls=[ToolCall(name="book_flight")]),
],
)
StepFaithfulness()
StepFaithfulness(threshold=0.8)
Evaluates up to 8 steps. Requires case.agent_trace.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Evaluates whether each tool call was actually needed, or if it was redundant given the context and what had already been done.
from multivon_eval import ToolCallNecessity
ToolCallNecessity()
For each tool call, the judge sees all prior calls and asks: was this strictly necessary? Catches agents that over-call tools, re-fetch data they already have, or take “defensive” actions that add no value.
Scoring
score = count(tool calls judged necessary) / count(all tool calls)
Returns 1.0 if the agent made no tool calls. Capped at 8 tool calls per trace to control cost.
Each tool call is evaluated independently with full prior-call context. The judge prompt is:
“Given what the agent already knows from prior tool calls, was calling {tool_name} with these arguments strictly necessary to make progress on the task?”
| Score | Meaning |
|---|
| 1.0 | All tool calls were necessary |
| 0.5 | Half of tool calls were redundant |
| 0.0 | Every tool call was unnecessary |
Example
# Agent calls search() twice with the same query — second call is redundant
case = EvalCase(
input="Find the latest Python release",
agent_trace=[
AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "Python latest release"})]),
AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "Python latest release"})]),
AgentStep(output="Python 3.13 was released in October 2024."),
],
)
# Expected ToolCallNecessity score: 0.5 (1 of 2 calls necessary)
TrajectoryEfficiency
Evaluates whether the agent took the most efficient path to the answer, and whether it recovered correctly from tool failures.
from multivon_eval import TrajectoryEfficiency
TrajectoryEfficiency()
Scoring
Step 1 — Base score (average of 3 binary QAG questions):
| Question | What it catches |
|---|
| Did the agent complete the task without unnecessary detours? | Off-topic steps, tangential tool calls |
| Is the step count proportionate to task complexity? | Over-engineered solutions |
| Did the agent avoid repeating steps it had already completed? | Duplicate fetches, re-running the same computation |
Each question is answered yes/no by an LLM judge. Base score = mean of the three answers (0.0–1.0).
Step 2 — Error recovery penalty (runs only when failed tool calls are detected):
“Did the agent handle tool failures by retrying with different arguments, switching to an alternative approach, or signalling a clear failure — rather than silently continuing as if the call succeeded?”
If no: score = max(0.0, base_score - 0.2)
Example
# Agent hits a 404 on the first tool call and tries the same call again unchanged
case = EvalCase(
input="Get the current stock price of AAPL",
agent_trace=[
AgentStep(tool_calls=[ToolCall(name="get_price", arguments={"ticker": "AAPL"}, result="Error: 404")]),
AgentStep(tool_calls=[ToolCall(name="get_price", arguments={"ticker": "AAPL"}, result="Error: 404")]),
AgentStep(output="I was unable to retrieve the price."),
],
)
# Base score might be 0.67 (completes task, proportionate steps, but repeats a failed call)
# Recovery penalty: -0.2 (agent re-ran the same failing call without modification)
# Final TrajectoryEfficiency score: 0.47
AgentMemoryEval
Evaluates whether a multi-session agent uses prior context correctly — retrieving accurately, not hallucinating past context, and forgetting appropriately.
When to use: Multi-session assistants, long-running agents, or any system that must carry state across separate conversations.
Requires case.context (prior session summary or log) and case.input (current query that needs memory).
from multivon_eval import AgentMemoryEval, EvalCase
case = EvalCase(
input="What did I ask you to prioritize last session?",
context="Prior session (2025-11-10): User asked to prioritize the auth module. They mentioned the deadline is end of November.",
expected_output="auth module",
)
suite.add_evaluators(AgentMemoryEval())
suite.add_evaluators(AgentMemoryEval(threshold=0.8))
Assesses:
- Does the response correctly use information from the prior context?
- Does it avoid hallucinating facts not in the prior context?
- Does it ignore superseded or stale information?
- If
expected_output is provided, does the response include it?
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
Aligned with AMA-Bench (2026), a benchmark for evaluating long-horizon memory in agentic applications.
Full agent eval example
from multivon_eval import (
EvalSuite, EvalCase, AgentStep, ToolCall,
ToolCallAccuracy, ToolArgumentAccuracy,
PlanQuality, TaskCompletion,
)
def run_agent(task: str) -> str:
# Your agent here — returns final output
...
# Build traces from your agent framework
# then wrap in EvalCase
suite = EvalSuite("Agent Eval")
suite.add_cases(cases)
suite.add_evaluators(
ToolCallAccuracy(require_order=False),
ToolArgumentAccuracy(),
ToolCallNecessity(),
TrajectoryEfficiency(),
PlanQuality(),
TaskCompletion(threshold=0.85),
)
report = suite.run(run_agent)