Seven concrete use cases showing what an eval looks like from setup to CI block.Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Support Bot
Scenario: A customer support bot answers questions from a help docs knowledge base. You ship a new knowledge base update and want to make sure responses stay faithful — the bot should never tell users to “contact billing” when the answer is in the docs.RAG Pipeline
Scenario: You change your retrieval chunk size from 512 to 256 tokens to reduce latency. The change looks fine in manual testing. But experiment tracking reveals a 17-point drop in faithfulness — the smaller chunks miss critical context.Coding Agent
Scenario: You’re building a coding agent that calls external APIs (Slack, GitHub, Google Calendar). It looks fine on single runs. But users report intermittent failures. Multi-run evaluation reveals 3 of 5 tool call scenarios are flaky — the Slack and scheduling tools don’t have retry logic.runs=5 — stability goes to 100%.
Multi-Run + Flakiness Detection
Scenario: A summarization model is non-deterministic. Most of the time it stays faithful to the source document, but sometimes it adds details that were never in the text. You want to run each case 5 times, flag flaky summaries, and compare a prompt change with statistical significance before you ship the update.runs=5 surfaces cases that would look fine on a lucky single run. The FLAKY lines show exactly which prompts are unstable, and exp.compare() tells you whether the prompt change fixed a real regression or just moved noise around.
Document Intelligence with Schema Validation
Scenario: An invoice extractor should return JSON that matches a strict schema. Most invoices parse correctly, but one layout occasionally causes malformed JSON. You want schema validation to catch the bad output immediately so you can separate parsing failures from downstream faithfulness issues.schema_compliance with a concrete JSON parse error, so you can fix formatting before investigating semantic extraction quality. This keeps parse bugs, schema drift, and faithfulness errors from getting mixed together in the same debugging pass.
Regulated AI (HIPAA)
Scenario: A clinical triage assistant drafts intake summaries for a nurse review queue. The content still needs to be medically useful, but it also cannot leak regulated identifiers. You want HIPAA-specific PII checks, an audit trail for every run, and tamper-evident records you can hand to compliance.PIIEvaluator flags the leaked MRN before the response reaches production, and the redacted reason is safe to store in logs. ComplianceReporter then writes an append-only NDJSON record plus a SHA-256 hash line, giving you both a human-readable audit trail and a tamper check.
CI/CD Quality Gate
Scenario: You want GitHub Actions to block pull requests when model quality regresses. Some checks should fail immediately viafail_threshold, some should use direct assertions on pass rate, and one should compare the current run to last week’s baseline before merge.
fail_threshold is the fastest way to turn a suite into a deployment gate, while direct assertions give you room for custom policies. The experiment comparison test adds a historical baseline, so you are not just checking “good enough” in isolation, you are checking whether the pull request made the model worse than the last known good run.

