eval-explain

eval-explain closes the “black box recommender” DX gap. Bootstrap picks evaluators based on the inferred shape of your product and traces; this skill explains why a specific evaluator showed up. The answer is exactly three sentences plus an optional one-case illustration: short enough to read inline, dense enough to act on.

When it auto-invokes

The skill auto-invokes in two situations:

Right after eval-bootstrap completes. It surfaces the rationale for the top one or two evaluators that were picked, so you understand what you just got without scrolling DISCOVERY_REPORT.md.
On user phrases, any of:
- “why did multivon recommend X”
- “what does X evaluator do”
- “is X the right eval for my use case”
- “explain this threshold”

allowed-tools

allowed-tools: Read, Grep, WebFetch

The skill reads DISCOVERY_REPORT.md and seed_cases.jsonl locally, greps the evaluator class for its docstring, and optionally fetches the methodology or benchmark page for cross-reference. It never writes files and never executes the eval suite — explanation only.

What it does

Locate the source of truth

First checks for DISCOVERY_REPORT.md next to eval_suite.py (written by eval-bootstrap) and reads the rationale block for the named evaluator. If no report exists, falls back to the evaluator’s docstring:

python -c "from multivon_eval import Faithfulness; print(Faithfulness.__doc__)"

Find an example case

Locates one or two example cases that exercise the evaluator — from seed_cases.jsonl if present, or by generating one via generate_hallucination_pairs for hallucination / faithfulness evaluators.

Answer in exactly 3 sentences

The skill is constrained to a three-sentence shape:

What the evaluator measures (paraphrase the docstring, do not quote it).
Why bootstrap picked it for this project — citing the trace pattern or product-shape signal that drove the pick.
What alternatives exist and when you would use them instead.

Example prompt and output

You:

Why did multivon recommend Faithfulness here?

Skill output:

Faithfulness measures whether your agent’s answer is grounded in the retrieved context — it generates yes/no questions about claims in the answer and scores by the fraction the context supports. Bootstrap picked it because your traces contain a context field on every row (RAG shape) and your product description mentions “answers from our internal docs.” If your context were short (1-2 sentences), Hallucination would be the better pick; if it were long-form generated text without retrieval, Coherence + AnswerAccuracy would be the right pair instead. Example case:
input: "what is the company's return policy?"
context: "Returns accepted within 30 days with receipt..."
output: "We accept returns within 60 days with no receipt."
expected: faithfulness LOW — output contradicts context on both
          the time window and the receipt requirement

What it doesn’t do

Doesn’t argue with your choice. If you want to override bootstrap’s recommendation, the skill surfaces why bootstrap picked X and then steps back. You have context bootstrap doesn’t.
Doesn’t lecture about statistical methodology. Explanation stays scoped to the chosen evaluator’s behavior — not eval theory.
Doesn’t generate cases you didn’t ask for. The example case is illustrative, not material you have to keep.

Getting Started

Evaluators

Claude Code Skills

Guides

Reference

Compliance

When it auto-invokes

allowed-tools

What it does

Example prompt and output

What it doesn’t do

See also

​When it auto-invokes

​allowed-tools

​What it does

​Example prompt and output

​What it doesn’t do

​See also

When it auto-invokes

allowed-tools

What it does

Example prompt and output

What it doesn’t do

See also