Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Conversation evaluators assess quality across a full conversation, not just a single response. They use case.conversation — a list of {"role", "content"} message dicts.
Setting up a conversation case
from multivon_eval import EvalCase
case = EvalCase(
input="Help me plan a trip to Japan",
conversation=[
{"role": "user", "content": "I want to visit Japan in April"},
{"role": "assistant", "content": "April is perfect for cherry blossoms. What cities interest you?"},
{"role": "user", "content": "Tokyo and Kyoto"},
{"role": "assistant", "content": "Great choices. Tokyo for 4 days, Kyoto for 3 — here's an itinerary..."},
{"role": "user", "content": "What's my budget for this?"},
{"role": "assistant", "content": "For 7 days in Japan, budget around $150-250/day..."},
],
)
All conversation evaluators require case.conversation. The latest response (the assistant turn being evaluated) is passed in as the model’s output.
ConversationRelevance
Checks that the latest assistant response stays on topic relative to the conversation history.
When to use: Long support sessions, multi-turn assistants, or any chat where the model must track an ongoing thread instead of resetting context each turn.
from multivon_eval import ConversationRelevance
ConversationRelevance()
ConversationRelevance(threshold=0.8)
Catches assistants that go off-topic, bring up unrelated information, or lose the thread of the conversation. Requires case.conversation.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
KnowledgeRetention
Checks that the assistant correctly recalls and applies information from earlier in the conversation.
When to use: Personal assistants, onboarding flows, or any session where the user provides facts (preferences, constraints, identifiers) that the model must respect later.
from multivon_eval import KnowledgeRetention
KnowledgeRetention()
KnowledgeRetention(threshold=0.8)
Example: if the user mentioned “I’m vegetarian” in turn 2, and the assistant recommends a steakhouse in turn 6, this fails. Requires case.conversation.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
ConversationCompleteness
Checks that the conversation, taken as a whole, resolves the user’s original goal.
When to use: Support bots, task-completion agents, or any session whose success is measured by whether the user got what they came for — not just whether individual turns were helpful.
from multivon_eval import ConversationCompleteness
ConversationCompleteness()
ConversationCompleteness(threshold=0.9)
Infers the user’s original goal from the first user turn and assesses whether the final response brings the dialogue to a satisfying resolution. Requires case.conversation.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.7 | Minimum score to pass |
TurnConsistency
Checks for contradictions between turns — the assistant shouldn’t say one thing and then contradict it later.
When to use: Long sessions where the model’s position can drift, or factual chat where flip-flopping erodes user trust.
from multivon_eval import TurnConsistency
TurnConsistency()
TurnConsistency(threshold=0.9)
Catches cases where the model’s stated facts, recommendations, or persona drift across the session. Requires case.conversation.
| Parameter | Type | Default | Description |
|---|
threshold | float | 0.8 | Minimum score to pass (higher default reflects that contradictions are a hard quality bar) |
Full conversation eval example
from multivon_eval import (
EvalSuite, EvalCase,
ConversationRelevance, KnowledgeRetention,
ConversationCompleteness, TurnConsistency,
)
suite = EvalSuite("Chatbot Eval")
suite.add_cases(conversation_cases)
suite.add_evaluators(
ConversationRelevance(),
KnowledgeRetention(),
ConversationCompleteness(threshold=0.85),
TurnConsistency(),
)
report = suite.run(my_chatbot_fn)