Agent Evaluators

Agent evaluators work with agent_trace — a structured record of what your agent did. They’re framework-agnostic: works with LangChain, LlamaIndex, CrewAI, or any custom agent.

Setting up an agent trace

from multivon_eval import EvalCase, AgentStep, ToolCall

case = EvalCase(
    input="Search for recent AI papers and write a summary",
    agent_trace=[
        AgentStep(
            thought="I need to search for recent papers first",
            tool_calls=[
                ToolCall(
                    name="search",
                    arguments={"query": "AI papers 2025"},
                    result=["Paper A", "Paper B", "Paper C"],
                )
            ],
        ),
        AgentStep(
            thought="Now I'll summarize what I found",
            tool_calls=[ToolCall(name="summarize")],
            output="Here are the key AI papers from 2025...",
        ),
    ],
    expected_tool_calls=["search", "summarize"],
)

ToolCallAccuracy

Checks that the agent called the expected tools. By default, order doesn’t matter (set match). When to use: Regression testing — confirm that new model versions still call the right tools for standard tasks.

from multivon_eval import ToolCallAccuracy

ToolCallAccuracy()                    # unordered set match
ToolCallAccuracy(require_order=True)  # must match in exact order

Score = fraction of expected tools that were called.

`require_order=False`	Score
All expected tools called	1.0
Half expected tools called	0.5
No expected tools called	0.0

When require_order=True, uses sequence alignment — partially correct order scores between 0 and 1. Requires case.agent_trace and case.expected_tool_calls.

Parameter	Type	Default	Description
`require_order`	`bool`	`False`	If `True`, tools must be called in the exact listed order
`threshold`	`float`	`0.7`	Minimum score to pass

ToolArgumentAccuracy

LLM judge that evaluates whether the arguments passed to tools were appropriate and well-formed. When to use: Catching argument-level bugs — wrong field names, missing required params, or semantically incorrect values — that tool call name checks miss.

from multivon_eval import EvalCase, AgentStep, ToolCall, ToolArgumentAccuracy

case = EvalCase(
    input="Search for quarterly reports from 2024",
    agent_trace=[
        AgentStep(tool_calls=[
            ToolCall(name="search", arguments={"query": "quarterly reports 2024", "limit": 10})
        ])
    ],
)
ToolArgumentAccuracy()
ToolArgumentAccuracy(threshold=0.8)

Evaluates up to 8 tool calls. Requires case.agent_trace.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

PlanQuality

LLM judge that evaluates the overall quality of the agent’s plan — logic, completeness, and efficiency. When to use: Evaluating the agent’s reasoning process, not just the outcome. Useful when debugging why an agent succeeds or fails on complex multi-step tasks.

from multivon_eval import EvalCase, AgentStep, ToolCall, PlanQuality

case = EvalCase(
    input="Research competitors and draft a comparison table",
    agent_trace=[
        AgentStep(thought="I'll search for each competitor first", tool_calls=[ToolCall(name="search")]),
        AgentStep(thought="Now I'll compile the results", tool_calls=[ToolCall(name="format_table")]),
    ],
)
PlanQuality()
PlanQuality(threshold=0.8)

Assesses:

Does the plan address the task?
Are the steps in a logical order?
Are there unnecessary or redundant steps?
Is anything missing?

Requires case.agent_trace.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

TaskCompletion

LLM judge that evaluates whether the agent’s final output actually satisfies the original task. When to use: End-to-end success metric — the primary evaluator for whether the agent delivered. Use alongside trajectory evaluators to distinguish “right answer, bad process” from “wrong answer”.

from multivon_eval import EvalCase, AgentStep, ToolCall, TaskCompletion

case = EvalCase(
    input="Book a meeting for Monday at 2pm",
    agent_trace=[
        AgentStep(tool_calls=[ToolCall(name="create_event", arguments={"day": "Monday", "time": "14:00"})]),
    ],
)
TaskCompletion()
TaskCompletion(threshold=0.9)

Works with or without agent_trace — if no trace is attached, evaluates the final output alone.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

StepFaithfulness

LLM judge that checks whether each step follows logically from the prior steps and the original task. When to use: Catching hallucinated reasoning steps — agents that invent observations, skip over failures, or take actions that contradict earlier tool results.

from multivon_eval import EvalCase, AgentStep, ToolCall, StepFaithfulness

case = EvalCase(
    input="Find the cheapest flight to NYC and book it",
    agent_trace=[
        AgentStep(thought="Searching for flights", tool_calls=[ToolCall(name="search_flights")]),
        AgentStep(thought="Booking the cheapest result found", tool_calls=[ToolCall(name="book_flight")]),
    ],
)
StepFaithfulness()
StepFaithfulness(threshold=0.8)

Evaluates up to 8 steps. Requires case.agent_trace.

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

ToolCallNecessity

Evaluates whether each tool call was actually needed, or if it was redundant given the context and what had already been done.

from multivon_eval import ToolCallNecessity

ToolCallNecessity()

For each tool call, the judge sees all prior calls and asks: was this strictly necessary? Catches agents that over-call tools, re-fetch data they already have, or take “defensive” actions that add no value.

Scoring

score = count(tool calls judged necessary) / count(all tool calls)

Returns 1.0 if the agent made no tool calls. Capped at 8 tool calls per trace to control cost. Each tool call is evaluated independently with full prior-call context. The judge prompt is:

“Given what the agent already knows from prior tool calls, was calling {tool_name} with these arguments strictly necessary to make progress on the task?”

Score	Meaning
1.0	All tool calls were necessary
0.5	Half of tool calls were redundant
0.0	Every tool call was unnecessary

Example

# Agent calls search() twice with the same query — second call is redundant
case = EvalCase(
    input="Find the latest Python release",
    agent_trace=[
        AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "Python latest release"})]),
        AgentStep(tool_calls=[ToolCall(name="search", arguments={"query": "Python latest release"})]),
        AgentStep(output="Python 3.13 was released in October 2024."),
    ],
)
# Expected ToolCallNecessity score: 0.5 (1 of 2 calls necessary)

TrajectoryEfficiency

Evaluates whether the agent took the most efficient path to the answer, and whether it recovered correctly from tool failures.

from multivon_eval import TrajectoryEfficiency

TrajectoryEfficiency()

Scoring

Step 1 — Base score (average of 3 binary QAG questions):

Question	What it catches
Did the agent complete the task without unnecessary detours?	Off-topic steps, tangential tool calls
Is the step count proportionate to task complexity?	Over-engineered solutions
Did the agent avoid repeating steps it had already completed?	Duplicate fetches, re-running the same computation

Each question is answered yes/no by an LLM judge. Base score = mean of the three answers (0.0–1.0). Step 2 — Error recovery penalty (runs only when failed tool calls are detected):

“Did the agent handle tool failures by retrying with different arguments, switching to an alternative approach, or signalling a clear failure — rather than silently continuing as if the call succeeded?”

If no: score = max(0.0, base_score - 0.2)

final_score ∈ [0.0, 1.0]

Example

# Agent hits a 404 on the first tool call and tries the same call again unchanged
case = EvalCase(
    input="Get the current stock price of AAPL",
    agent_trace=[
        AgentStep(tool_calls=[ToolCall(name="get_price", arguments={"ticker": "AAPL"}, result="Error: 404")]),
        AgentStep(tool_calls=[ToolCall(name="get_price", arguments={"ticker": "AAPL"}, result="Error: 404")]),
        AgentStep(output="I was unable to retrieve the price."),
    ],
)
# Base score might be 0.67 (completes task, proportionate steps, but repeats a failed call)
# Recovery penalty: -0.2 (agent re-ran the same failing call without modification)
# Final TrajectoryEfficiency score: 0.47

AgentMemoryEval

Evaluates whether a multi-session agent uses prior context correctly — retrieving accurately, not hallucinating past context, and forgetting appropriately. When to use: Multi-session assistants, long-running agents, or any system that must carry state across separate conversations. Requires case.context (prior session summary or log) and case.input (current query that needs memory).

from multivon_eval import AgentMemoryEval, EvalCase

case = EvalCase(
    input="What did I ask you to prioritize last session?",
    context="Prior session (2025-11-10): User asked to prioritize the auth module. They mentioned the deadline is end of November.",
    expected_output="auth module",
)
suite.add_evaluators(AgentMemoryEval())
suite.add_evaluators(AgentMemoryEval(threshold=0.8))

Assesses:

Does the response correctly use information from the prior context?
Does it avoid hallucinating facts not in the prior context?
Does it ignore superseded or stale information?
If expected_output is provided, does the response include it?

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass

Aligned with AMA-Bench (2026), a benchmark for evaluating long-horizon memory in agentic applications.

Full agent eval example

from multivon_eval import (
    EvalSuite, EvalCase, AgentStep, ToolCall,
    ToolCallAccuracy, ToolArgumentAccuracy,
    PlanQuality, TaskCompletion,
)

def run_agent(task: str) -> str:
    # Your agent here — returns final output
    ...

# Build traces from your agent framework
# then wrap in EvalCase
suite = EvalSuite("Agent Eval")
suite.add_cases(cases)
suite.add_evaluators(
    ToolCallAccuracy(require_order=False),
    ToolArgumentAccuracy(),
    ToolCallNecessity(),
    TrajectoryEfficiency(),
    PlanQuality(),
    TaskCompletion(threshold=0.85),
)

report = suite.run(run_agent)

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

Setting up an agent trace

ToolCallAccuracy

ToolArgumentAccuracy

PlanQuality

TaskCompletion

StepFaithfulness

ToolCallNecessity

Scoring

Example

TrajectoryEfficiency

Scoring

Example

AgentMemoryEval

Full agent eval example

​Setting up an agent trace

​ToolCallAccuracy

​ToolArgumentAccuracy

​PlanQuality

​TaskCompletion

​StepFaithfulness

​ToolCallNecessity

​Scoring

​Example

​TrajectoryEfficiency

​Scoring

​Example

​AgentMemoryEval

​Full agent eval example

Setting up an agent trace

ToolCallAccuracy

ToolArgumentAccuracy

PlanQuality

TaskCompletion

StepFaithfulness

ToolCallNecessity

Scoring

Example

TrajectoryEfficiency

Scoring

Example

AgentMemoryEval

Full agent eval example