eval-dataset¶

Generates realistic test cases based on the eval.yaml dataset schema and judge criteria. Supports three strategies: bootstrap (from scratch with simple/complex/edge case coverage), expand (fills gaps in existing datasets by analyzing what judges check that no case tests), and from-traces (extracts real inputs from MLflow production traces). Handles external-state fields with TODO_ placeholders, generates answers.yaml for interactive skills using AskUserQuestion, and creates annotations.yaml for outcome-aware judges.

Plugin: agent-eval-harness | User-invocable

Diagram¶

eval-dataset diagram

Arguments¶

/eval-dataset [--config <path>] [--count <N>] [--strategy <type>]

Argument	Default	Description
`--config`	`eval.yaml`	Path to eval config.
`--count`	`5`	Number of test cases to generate.
`--strategy`	`bootstrap`	Generation strategy. bootstrap: from scratch. expand: fill gaps in existing dataset. from-traces: extract from MLflow traces.

Usage¶

/eval-dataset
/eval-dataset --count 10 --strategy expand
/eval-dataset --strategy from-traces