Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Experiment records every suite run locally and lets you compare results across model versions, prompt changes, or time. No cloud, no account — stored as JSONL in ~/.multivon/experiments/.

Basic usage

from multivon_eval import EvalSuite, Experiment, Faithfulness, Relevance

suite = EvalSuite("rag-pipeline")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness(), Relevance())

exp = Experiment("rag-pipeline")

# Baseline run
report_v1 = suite.run(model_v1)
run_v1 = exp.record(report_v1, tags={"model": "gpt-4o", "prompt_v": "2"})

# After your prompt or model change
report_v2 = suite.run(model_v2)
run_v2 = exp.record(report_v2, tags={"model": "gpt-4o", "prompt_v": "3"})

# Compare
exp.compare(run_v1, run_v2)

Compare output

============================================================
Experiment comparison: a1b2c3d4 → e5f6g7h8
============================================================

Metric                   Before           After
------------------------------------------------------------
Model                    gpt-4o           gpt-4o
Pass rate                  84.0%  →   91.0%  ↑   +7.0%
Avg score                 0.8210  →   0.8890  ↑  +0.0680
Passed                        42  →       46
Failed                         8  →        4

Evaluator scores         Before           After
------------------------------------------------------------
faithfulness             0.7800  →   0.8600  ↑  +0.0800
relevance                0.9100  →   0.9300  ↑  +0.0200

Verdict: IMPROVED — pass rate up +7.0%

View run history

exp.print_history(n=10)
Experiment: rag-pipeline
Run ID       Timestamp              Model                 Pass rate  Avg score Tags
------------------------------------------------------------------------------------------
e5f6g7h8     2026-04-25 15:30:12    gpt-4o                   91.0%     0.8890 prompt_v=3
a1b2c3d4     2026-04-24 10:15:44    gpt-4o                   84.0%     0.8210 prompt_v=2

Tags

Tags are free-form key-value pairs — use them to track anything meaningful:
exp.record(report, tags={
    "model": "claude-sonnet-4-6",
    "prompt_v": "5",
    "dataset": "v2",
    "deployed": "false",
})

CLI

# List all experiments
multivon-eval experiments list

# Show run history
multivon-eval experiments history rag-pipeline

# Compare two runs
multivon-eval experiments compare rag-pipeline a1b2c3d4 e5f6g7h8

Storage

Runs are stored at ~/.multivon/experiments/<name>.jsonl. Each line is a JSON object with the run summary — not the full case-by-case output. Use report.save_json() separately if you want the full results.
# Override storage location
export MULTIVON_HOME=/your/custom/path

CI/CD integration

Track every CI run automatically:
import os
from multivon_eval import Experiment

exp = Experiment("prod-eval")
report = suite.run(model_fn, fail_threshold=0.85)

exp.record(report, tags={
    "git_sha": os.getenv("GITHUB_SHA", "local"),
    "branch": os.getenv("GITHUB_REF_NAME", "local"),
    "run_number": os.getenv("GITHUB_RUN_NUMBER", "0"),
})
This gives you a full history of pass rates across every CI run, queryable by branch or commit.