Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Experiment records every suite run locally and lets you compare results across model versions, prompt changes, or time. No cloud, no account — stored as JSONL in ~/.multivon/experiments/.
Basic usage
from multivon_eval import EvalSuite, Experiment, Faithfulness, Relevance
suite = EvalSuite("rag-pipeline")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())
exp = Experiment("rag-pipeline")
# Baseline run
report_v1 = suite.run(model_v1)
run_v1 = exp.record(report_v1, tags={"model": "gpt-4o", "prompt_v": "2"})
# After your prompt or model change
report_v2 = suite.run(model_v2)
run_v2 = exp.record(report_v2, tags={"model": "gpt-4o", "prompt_v": "3"})
# Compare
exp.compare(run_v1, run_v2)
Compare output
============================================================
Experiment comparison: a1b2c3d4 → e5f6g7h8
============================================================
Metric Before After
------------------------------------------------------------
Model gpt-4o gpt-4o
Pass rate 84.0% → 91.0% ↑ +7.0%
Avg score 0.8210 → 0.8890 ↑ +0.0680
Passed 42 → 46
Failed 8 → 4
Evaluator scores Before After
------------------------------------------------------------
faithfulness 0.7800 → 0.8600 ↑ +0.0800
relevance 0.9100 → 0.9300 ↑ +0.0200
Verdict: IMPROVED — pass rate up +7.0%
View run history
Experiment: rag-pipeline
Run ID Timestamp Model Pass rate Avg score Tags
------------------------------------------------------------------------------------------
e5f6g7h8 2026-04-25 15:30:12 gpt-4o 91.0% 0.8890 prompt_v=3
a1b2c3d4 2026-04-24 10:15:44 gpt-4o 84.0% 0.8210 prompt_v=2
Tags are free-form key-value pairs — use them to track anything meaningful:
exp.record(report, tags={
"model": "claude-sonnet-4-6",
"prompt_v": "5",
"dataset": "v2",
"deployed": "false",
})
CLI
# List all experiments
multivon-eval experiments list
# Show run history
multivon-eval experiments history rag-pipeline
# Compare two runs
multivon-eval experiments compare rag-pipeline a1b2c3d4 e5f6g7h8
Storage
Runs are stored at ~/.multivon/experiments/<name>.jsonl. Each line is a JSON object with the run summary — not the full case-by-case output. Use report.save_json() separately if you want the full results.
# Override storage location
export MULTIVON_HOME=/your/custom/path
CI/CD integration
Track every CI run automatically:
import os
from multivon_eval import Experiment
exp = Experiment("prod-eval")
report = suite.run(model_fn, fail_threshold=0.85)
exp.record(report, tags={
"git_sha": os.getenv("GITHUB_SHA", "local"),
"branch": os.getenv("GITHUB_REF_NAME", "local"),
"run_number": os.getenv("GITHUB_RUN_NUMBER", "0"),
})
This gives you a full history of pass rates across every CI run, queryable by branch or commit.