multivon-eval simulate (0.12.0) drives the conversation
live instead: a persona LLM with a profile, a goal, and behavior traits
generates each user turn in response to what your system actually said.
What you get
One result per persona: the full transcript, a stop reason (goal_reached / max_turns / assistant_refused / budget_exceeded /
driver_error), a goal-completion verdict judged against the persona’s
success_criteria, and scores from the conversation evaluators
(ConversationRelevance, KnowledgeRetention, TurnConsistency) over the
transcript.
Personas
A persona is four fields plus traits:--propose-from PRODUCT.md generate a diverse
set with one LLM call — the proposal prompt always demands at least one
persona with an adversarial trait.
The honesty contract
Simulation output is synthetic, and the tool never lets you forget it:- Every result and report carries “simulated personas — measures behavior under synthetic users, not real traffic.” This is test-pinned behavior, not copy.
- Hard budget ceiling. The spend estimate prints before the first call;
hitting
--budgetmid-run stops cleanly — completed transcripts are returned, cut-off personas carrystop_reason="budget_exceeded". No partial work is ever lost to an exception. - No determinism claims. Persona proposal is seeded; conversation turns are stochastic. The judge model and temperature are recorded in every result’s metadata so a run is at least describable.
- A
driver_erroron one persona never kills the run — it is recorded and the next persona proceeds.
Simulation with provenance
Each conversation binds itscase_uid through the same contextvar the
runtime recorder uses. Run your simulation under
pytest --record-prompts (or wrap it in record_prompts()) and the prompts
your system rendered during each simulated conversation are captured and
bound to that conversation’s case — observed case→site bindings, the same
propose-only discipline as staleness stamp --from-recordings.
Python API
model_fn has the same contract as EvalSuite.run: one rendered-prompt
string in (the conversation so far, rendered like
EvalCase.conversation_str()), the assistant’s reply out.
