Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Conversation evaluators assess quality across a full conversation, not just a single response. They use case.conversation — a list of {"role", "content"} message dicts.

Setting up a conversation case

from multivon_eval import EvalCase

case = EvalCase(
    input="Help me plan a trip to Japan",
    conversation=[
        {"role": "user", "content": "I want to visit Japan in April"},
        {"role": "assistant", "content": "April is perfect for cherry blossoms. What cities interest you?"},
        {"role": "user", "content": "Tokyo and Kyoto"},
        {"role": "assistant", "content": "Great choices. Tokyo for 4 days, Kyoto for 3 — here's an itinerary..."},
        {"role": "user", "content": "What's my budget for this?"},
        {"role": "assistant", "content": "For 7 days in Japan, budget around $150-250/day..."},
    ],
)
All conversation evaluators require case.conversation. The latest response (the assistant turn being evaluated) is passed in as the model’s output.

ConversationRelevance

Checks that the latest assistant response stays on topic relative to the conversation history. When to use: Long support sessions, multi-turn assistants, or any chat where the model must track an ongoing thread instead of resetting context each turn.
from multivon_eval import ConversationRelevance

ConversationRelevance()
ConversationRelevance(threshold=0.8)
Catches assistants that go off-topic, bring up unrelated information, or lose the thread of the conversation. Requires case.conversation.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

KnowledgeRetention

Checks that the assistant correctly recalls and applies information from earlier in the conversation. When to use: Personal assistants, onboarding flows, or any session where the user provides facts (preferences, constraints, identifiers) that the model must respect later.
from multivon_eval import KnowledgeRetention

KnowledgeRetention()
KnowledgeRetention(threshold=0.8)
Example: if the user mentioned “I’m vegetarian” in turn 2, and the assistant recommends a steakhouse in turn 6, this fails. Requires case.conversation.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

ConversationCompleteness

Checks that the conversation, taken as a whole, resolves the user’s original goal. When to use: Support bots, task-completion agents, or any session whose success is measured by whether the user got what they came for — not just whether individual turns were helpful.
from multivon_eval import ConversationCompleteness

ConversationCompleteness()
ConversationCompleteness(threshold=0.9)
Infers the user’s original goal from the first user turn and assesses whether the final response brings the dialogue to a satisfying resolution. Requires case.conversation.
ParameterTypeDefaultDescription
thresholdfloat0.7Minimum score to pass

TurnConsistency

Checks for contradictions between turns — the assistant shouldn’t say one thing and then contradict it later. When to use: Long sessions where the model’s position can drift, or factual chat where flip-flopping erodes user trust.
from multivon_eval import TurnConsistency

TurnConsistency()
TurnConsistency(threshold=0.9)
Catches cases where the model’s stated facts, recommendations, or persona drift across the session. Requires case.conversation.
ParameterTypeDefaultDescription
thresholdfloat0.8Minimum score to pass (higher default reflects that contradictions are a hard quality bar)

Full conversation eval example

from multivon_eval import (
    EvalSuite, EvalCase,
    ConversationRelevance, KnowledgeRetention,
    ConversationCompleteness, TurnConsistency,
)

suite = EvalSuite("Chatbot Eval")
suite.add_cases(conversation_cases)
suite.add_evaluators(
    ConversationRelevance(),
    KnowledgeRetention(),
    ConversationCompleteness(threshold=0.85),
    TurnConsistency(),
)

report = suite.run(my_chatbot_fn)