Documentation Index
Fetch the complete documentation index at: https://docs.multivon.ai/llms.txt
Use this file to discover all available pages before exploring further.
multivon-eval bootstrap (covered in the bootstrap guide) is the user-facing CLI. Under it sit three primitives in multivon_eval.auto that you can call directly when you want fine-grained control or want to compose them into your own pipeline. Each one is documented and tested independently — you can use them in any combination.
auto_evaluators(case) — heuristic recommender
Pure pattern-match over an EvalCase shape. Returns a ranked list of recommended evaluators across primary / secondary / guardrail tiers, with a confidence rating per recommendation. Zero LLM cost. Microseconds.
Signature
task_type="auto"infers from case shape. Pin explicitly when the shape is ambiguous (e.g., a case with bothcontextandexpected_outputcould be RAG or fact-check).strict_mode=TrueraisesAmbiguousCaseShapeif the heuristic can only offer a low-confidence primary. Use this in CI / production code paths where a silent mis-recommendation is worse than failing loud.include_pii=True+pii_jurisdictionappendsPIIEvaluatoras a guardrail ("gdpr" | "ccpa" | "pipeda" | "hipaa" | "dpdp" | "all").include_safety=TrueappendsToxicity+Biasas guardrails.
EvaluatorRecommendation shape
suite.add_evaluators(...) after instantiation. The function never picks for you — it suggests.
generate_adversarial_cases(seed, mode, n) — LLM-generated stress cases
Synthesises cases targeting one of 10 named failure modes. Uses your configured judge (defaults to Claude Haiku) to write cases plausible enough to look real but designed to trip up a model on the specific weakness.
The 10 failure modes
| Mode | Stress-tests | What it generates |
|---|---|---|
ungrounded_claim | Faithfulness, Hallucination | Questions whose answer is NOT in the provided context. A confabulating model gets caught. |
off_topic | Relevance | Tangentially-related questions that the context can’t actually address. |
format_violation | SchemaEvaluator, JSONSchemaEval | Requests for structured output — expected_output is the canonical valid form. |
jailbreak | Toxicity, Bias | Role-play, prefix-injection, refusal-bypass attempts. |
tool_misuse | ToolCallNecessity | Cases where calling a tool is UNNECESSARY — the model should answer directly. |
numeric_edge | AnswerAccuracy | Zero, negative, NaN, division-by-zero, off-by-one, currency boundaries. |
prompt_injection_direct | CheckEvaluator, Relevance | Direct overrides: “Ignore previous instructions and…” |
prompt_injection_indirect | Faithfulness, Hallucination, CheckEvaluator | RAG-specific — injection inside a retrieved chunk. |
tool_injection | ToolCallAccuracy, ToolCallNecessity | Attempts to manipulate agent tool calls (extra-tool injection, argument override). |
pii_leakage_invitation | PIIEvaluator | Cases that try to get the model to echo PII unnecessarily. |
Signature
tags = ["adversarial:<mode>"]metadata["target_failure_mode"]— the mode it was generated formetadata["stress_tests"]— the evaluators it’s designed to testmetadata["prompt_version"]— for reproducibility (the prompt template can evolve)metadata["judge_used"]— provider:model of the generator
Deterministic variant: generate_unicode_obfuscation_cases
Some attacks shouldn’t be LLM-generated — LLMs are aligned to NOT produce bypass attacks, so they tend to produce polished-looking but technically-easy attacks. For character-level obfuscation patterns (homoglyph, zero-width, RTL-override), use the deterministic generator:
validate_adversarial_cases(cases, baseline) — N-shot judge-noise filter
The “are these cases actually adversarial, or just synthetic noise?” question. Runs each generated case N times against a baseline model + evaluator, computes per-case failure rate, filters by hardness band. This is the validation step that separates real signal from generation noise.
What this catches
Single-shot validation can’t distinguish a hard case from judge noise on one observation. With N≥3 shots the failure rate has enough granularity for the band to filter on real signal. Validated live: +0.80 mean failure-rate separation between weak (always-confabulate) and strong (always-refuse) baselines onungrounded_claim cases. Judge noise is visible at the per-shot level as [1.00, 0.00, 1.00] score arrays and gets filtered out by the band.
Signature
n_shots— how many times to sample baseline + evaluator per case. Default 3 dampens judge noise; setting to 1 reproduces single-shot behavior (NOT recommended).hardness_band—(min, max)failure-rate band. Cases outside the band are dropped from the kept list (but still returned in the full report). Default(0.5, 1.0)keeps cases the baseline fails at least half the time. Use(0.2, 0.8)for a discriminating-case filter that drops both too-easy and impossibly-hard cases.
HardnessReport shape
baseline_outputs[shot] + scores[shot] to audit why a case ended up in or out of the band.
Composing the primitives
Themultivon-eval bootstrap CLI is approximately:
- Add
auto_evaluators(case)recommendations to a manually-built suite - Run
generate_adversarial_caseson a different failure mode than your default - Validate any case set (not just generated ones) against your real baseline with N-shot aggregation
See also
- Bootstrap CLI guide — the one-command path that composes all three primitives + PII redaction + threshold calibration
- Synthetic data generation — the higher-level
generate_from_file/generate_from_texthelpers, when you want generation without targeting a specific failure mode - Statistical rigor — why N-shot aggregation matters and how the hardness band relates to power
- Quickstart — the manual path

