multivon-eval staleness is the detection layer: a committed
baseline snapshot of every prompt call site in your repo, a read-only report
that tells you exactly which prompts changed since your cases were authored,
and an opt-in provenance layer binding cases to the prompts they exercise.
Shipped in 0.10.0.
Built on the attribution scanner — attribution is the
scanner; staleness is the drift report built on it.
The design rule that survived every round of the pre-implementation
adversarial review: the tool never overclaims what static analysis can
know. Every report opens with a determinacy headline and closes with a
standing blind-spots footer. Those aren’t disclaimers bolted on — they’re
enforced in the output code.
The five drift modes (and which this covers)
Eval suites rot in five distinct ways:- Prompt drift — a prompt changed since the cases testing it were authored.
- Coverage gaps — new prompts shipped with no cases covering them.
- Dead cases — cases point at prompts that no longer exist.
- Shape drift — the suite’s structure (cases, evaluators) changed out from under a pinned run.
- Threshold staleness — calibrated thresholds aging out as models or data shift.
staleness covers modes 1–3 only. Shape drift and threshold staleness
are suite.lock territory (verify_suite_against_lock), and the CLI never
claims them. The two drift detectors stay orthogonal by construction:
stamping provenance never perturbs suite.lock, because the lockfile’s
cases hash excludes metadata by design — pinned by a regression test.
The three commands
multivon-eval staleness — the report
Read-only, zero-arg in a bootstrapped repo. Diffs a live attribution scan
against the committed prompt_baseline.json and joins in per-case
provenance:
--baseline FILE, --cases F.jsonl (repeatable), --suite module:attr
(reads runtime metadata from Python-inline cases), --format text|json|markdown,
--fail-on changed,removed,added, --include-tests, --ignore DIR
(repeatable). By default the scan skips tests/, examples/, vendor/, and
third_party/ — fixture SDK calls would flood the report.
multivon-eval staleness baseline — bless a snapshot
os.replace). It’s named baseline, not .lock, deliberately:
a blessed snapshot you consciously refresh, not a regenerated fingerprint that
must verify. Bootstrap writes one automatically.
multivon-eval staleness stamp — bind cases to sites
--site spec (FILE[::QUALNAME][.ROLE[#POS]]) is resolved against a live
scan — zero or multiple matches is an error listing candidates, never a
guess, and a duplicated prompt fingerprint requires an explicit qualname
anchor. Select cases with --index N (repeatable), --tag T, or --all;
--dry-run and --force (overwrite a malformed/newer existing stamp) round
it out.
The two artifacts
prompt_baseline.json — the repo-level scan snapshot, committed at the
repo root:
git show <sha>:<file> recovers it;
a stored copy would be a second prompt that itself drifts.
metadata["_provenance"] — per-case, inline in the existing free-form
metadata dict under a library-reserved underscore key (no new EvalCase
field):
targets: [] means “authored against repo state SHA X” — honest and
unbound. bound is always "manual" in v1: auto-binding is rejected,
because confidently-wrong links poison every downstream verdict. A
schema_version from a newer release makes the case unreadable — counted,
never fatal, exit code unaffected: a newer teammate’s stamp must not break
an older teammate’s CI.
How matching works, in plain words
- Content-first. A prompt’s identity is its fingerprint (a hash of its text). Line numbers and git SHAs are display-only, never matching inputs — a whitespace refactor of surrounding code or a rebase produces zero false staleness, and a reverted prompt is automatically unchanged again.
- The dynamic gate fires first. A prompt the scanner can’t statically read is UNKNOWN forever rather than fake-fresh — placeholder fingerprints prove only call shape, not content, so comparing them would report a totally rewritten constant as “fresh”. A formerly-static site that became dynamic is UNKNOWN (“prompt moved out of static reach”), never CHANGED, never REMOVED.
- Structural rescue, then honesty. If the fingerprint is gone, the matcher tries the structural anchor (file, qualname, sdk, role) → CHANGED; if only the loose (whitespace-collapsed) fingerprint still matches, it’s labeled formatting-only — flagged, never suppressed. If nothing matches, it’s REMOVED — always with the three-way caveat: feature removed, OR renamed+edited in one commit, OR moved beyond static reach. REMOVED is a prompt to investigate, never an auto-delete signal. There is no fuzzy text-similarity matching: rename+edit in one commit is statically unbridgeable, and the tool says so instead of guessing.
Getting cases stamped
Bootstrap cases — automatic
multivon-eval bootstrap --repo . writes prompt_baseline.json and
stamps every generated case with authored_by="bootstrap", the repo SHA,
and targets=[]. Bindings are never fabricated — bootstrap generates
cases from your product description and traces, and knows nothing about
call sites.Hand-written JSONL — `staleness stamp`
Explicit, opt-in binding via
--site, as above. This is what turns the
coverage number from zero into something meaningful.load_csv reads no metadata.
Documented limitation, not a roadmap item. Unstamped cases of any origin are
first-class: counted and reported with a stamp hint, never guessed at.
CI integration
Default exit is 0 even with findings — CHANGED means “authored against an older prompt, re-run recommended”, never “failing”. Gate per-category when you’re ready:0 clean or report-only, 1 a --fail-on category fired,
2 warn-only (no baseline, unreadable baseline, scanner-version mismatch).
Gating on added (uncovered new prompts) punishes adoption — possible, not
recommended.
The one-line warn-only recipe for GitHub Actions:
--format json emits the full machine-readable report (per-site verdicts
with a confidence field — exact, structural, moved, or ambiguous —
so CI consumers can filter surfaced ambiguity).
Honest scope — the blind spots
The static scan:- sees kwarg-only anthropic/openai/litellm Python call sites only
- does not see the OpenAI Responses API or positional message args
- does not see prompts in YAML/Jinja/templates/files or prompt hubs
- does not see non-Python services
source: "external" — they
report as UNVERIFIABLE, never orphaned. A file the scanner can see but not
parse (syntax error, non-UTF8 encoding) reports as UNSCANNABLE since
0.11.1 — “file exists but could not be parsed — verdict unknown, NOT
removed” — with a warning naming each file, a skipped_files list in the
JSON report, and no --fail-on removed trip. Skipped files are a
report-time concept, never written into baselines.
Runtime recordings
The determinacy gate (0.10.1) measured scanner v3 against five real repos: 20.9% of call sites are statically resolvable — the rest build prompts dynamically and are statically unbridgeable by construction. The runtime prompt recorder (shipped in 0.11.0, designed in multivon-eval#9) is the honest path past that ceiling: during an eval run, an opt-in interceptor records the rendered prompt text per call site, fingerprinted with the samefingerprint_text the static scanner uses. A
**kwargs unpack the scanner can only report as UNKNOWN is, at call time,
real kwargs with real text.
The three trust tiers
The honesty discipline survives the new power — three labeled tiers, never collapsed into one another (the report footer states all three verbatim):- static — the scan proves the prompt text.
- runtime — recordings prove only the renderings observed, not all renderings. Variable renderings per site are a fingerprint SET, and every verdict speaks in OBSERVED k/N language: “current recordings matched k of N previously observed renderings.” A site is never called fresh because one rendering matched.
- templates / external prompts — deferred, unverifiable.
Recording a run
- Opt-in only, zero overhead when off. Importing multivon_eval performs
NO patching — pinned by a fresh-interpreter subprocess test. Recording
method-wraps exactly the three SDK surfaces the static scanner knows
(anthropic
Messages.create, openaichat.completions.create,litellm.completion/acompletion): save original, wrap, restore byte-identical on exit. Missing SDKs are skipped silently. - Recordings stay local in
prompt_recordings.jsonl; no telemetry. Fingerprints only by default — rendered text is stored only behind an explicit--record-text. Storage is append-safe: duplicate (anchor, role, fingerprint) keys merge counts and case_uids on write. - Case binding by observation. A contextvar carries the active
case_uid:EvalSuitebinds it per case from_provenance.case_uid, and the pytest plugin binds the test nodeid per test. Recordings carry the case_uids observed per (anchor, role, fingerprint) — the run knows which sites fired for which case. - Capture scope v1 (honest): string
system=kwargs and stringcontententries inmessages=lists. Content-block lists (vision, tool results) and calls anchored outside the repo root are skipped, not guessed at.
Merging into the baseline
prompt_baseline.json as source:"runtime" records
with fingerprint SETS, stored under a separate runtime_records key.
Merge-only: it never rescans and NEVER touches static records; a static
rescan never discards the runtime tier; re-merging the same recordings file
is idempotent. Runtime-sourced sites then render as a distinct OBSERVED
tier in all three report formats — compared recordings-vs-recordings
(runtime-only sites cannot be compared against a static scan, and the
report says so), always in the k/N language. The determinacy headline gains
a third clause: “K sites observed at runtime.”
Stamping from recordings
--from-recordings prints observed case→site bindings as proposals
(case_uid → anchor + fingerprint with observation counts); it writes only
with explicit --apply --cases F.jsonl, landing targets as
source:"runtime", bound:"observed". Observation removes the fabrication
objection that blocked auto-binding in the 0.10.0 adversarial review — the
human confirmation stays. Runtime-bound targets are verified against
recordings, never against the static scan (where they report
unverifiable [runtime], by rule), and never enter the static coverage
denominator.
What’s deferred
sync— propose-and-review case refresh that consumes the staleness JSON. Never auto-commit, by design. Tracked in multivon-eval#8.- eval-action enforcement — the staleness gate as an Action input with
per-category fail-on; today the
GITHUB_STEP_SUMMARYline above is the documented warn-only path. Tracked in eval-action#1.
See also
- Prompt attribution — the scanner this is built on,
including the v3 detection fixes (aliased/
**kwargs/messages=<var>shapes) and the v4 hardening (NFC-normalized fingerprints, UNSCANNABLE). - Bootstrap — writes the baseline and stamps generated
cases automatically via
--repo. /eval-auditskill — the per-PR audit; staleness is the standing drift report between PRs.- CI/CD integration — wiring multivon-eval into GitHub Actions.

