Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
multivon-eval attribution is the structured-diff substrate that the
/eval-audit skill uses to figure out which prompts
actually changed in a PR before deciding what to re-run. It walks a Python
repo with an AST visitor, finds every anthropic.messages.create /
openai.chat.completions.create / litellm.completion call site, extracts
the string-literal prompts, fingerprints them, and emits a
PR-comment-ready Markdown diff.
Shipped in 0.9.4.
Implementation lives in
multivon_eval/attribution/
— five small modules, no extra dependencies, pure stdlib.
This is Phase 1 — descriptive, not causal. Output says “these N prompts
changed between base and head”; it never says “this prompt change caused this
regression.” The hardened calibration spike of 2026-05-30 showed Haiku-based
causal attribution failing catastrophically on mixed-cause regressions
(14% HIGH-confidence-and-wrong rate). Causal attribution is gated on a
future non-prompt-change sidecar signal — see the package docstring for
the gate.
When to use it
- In CI, on every PR: run
attribution diff base headand post the Markdown output as a PR comment. Reviewers see exactly which prompts changed without scrolling the unified diff. - From
/eval-audit: the skill callsscanto identify which call sites the PR touched, then runs only the eval cases that stress those surfaces. - One-off audits:
attribution scan .lists every prompt literal in your codebase. Useful when onboarding to a repo you didn’t write.
What it captures
The AST extractor inast_extractor.py
matches three call shapes:
system or messages kwarg are
silently dropped.
Literals captured as static:
- Plain string literals:
system="..." - f-strings with zero runtime interpolation (their content is fully known at parse time)
Name references, Attribute lookups, runtime
f-strings, .join(...) calls — is recorded as a PromptRecord with
is_dynamic=True and a placeholder text. The count is preserved and the
gap is visible in the diff, but the actual text isn’t compared.
attribution scan
Walks a Python repo and lists every prompt call site it finds.
[dynamic] marker flags call sites where the prompt is built at
runtime; the extractor records the location but can’t compare the text
across refs.
JSON output is a flat list of records — fingerprint, file/line, sdk,
role, role position, qualname, and a 200-char text preview. Suitable for
piping into jq or feeding into a CI step that wants to fail on
unfingerprinted prompts.
Walk skips .venv/, node_modules/, __pycache__/, and other build
directories automatically.
attribution diff
Compute the structured prompt diff between two repo checkouts.
--format markdown emits a PR-comment-ready block. The
render.py
output truncates each prompt body to the first six lines or 400 chars
(whichever hits first) with an explicit … (truncated) marker — so the
PR comment stays readable on a long system prompt.
Diffs are ordered: modified first, then added, then removed, then
dynamic, sorted by call_site_id within each group. That order is
stable across runs and surfaces the highest-signal changes (modified
literals) first.
How identity works
TwoPromptRecords are considered “the same call site across refs” iff
(file_path, line, sdk, role, role_position) all match. This is the
Tier-1 identity used by diff_records. A renamed file or a shifted line
will register as removed + added rather than modified — Tier-2
file-move / call-site-shift detection is a future addition (see
schema.py).
change_type values
| Type | Meaning |
|---|---|
modified | Both refs have a static record at the same call site; fingerprints differ. |
added | The call site exists only in head. |
removed | The call site exists only in base. |
dynamic | Either record has is_dynamic=True and the recorded text differs. The actual prompt cannot be reliably compared. |
Programmatic API
Same surface as the CLI, importable from any Python:scan returns list[PromptRecord]. diff_records returns
list[PromptDiff]. render_markdown emits the same string the CLI
prints with --format markdown. The dataclasses are frozen and exposed
from the package root, so you can build your own renderer if the bundled
one doesn’t fit your PR-comment style.
Wiring it into CI
The two-line GitHub Actions step (assuming the workflow already checks out base and head into separate directories):.github/workflows/eval-pr.yml
attribution scan|diff with the suite runner and the
calibrated threshold gate.
What’s next
- The Phase 2 sidecar that gates causal attribution on a non-prompt-change
signal is under design — see the package docstring in
attribution/__init__.py. - Tier-2 identity (file-move / line-shift detection) is on the roadmap; until it lands, a refactor that moves a prompt to a different line will show up as removed + added.
See also
/eval-auditskill — the PR-gating workflow that consumesattribution scanto scope which cases to re-run.- eval-action on GitHub — the GitHub Action that wraps this CLI plus the suite runner.
- Bootstrap workflow — generate the suite that
/eval-auditruns. - Source on GitHub — five small files, ~400 lines total.

