Skip to content

eval-dataset

Generates realistic test cases based on the eval.yaml dataset schema and judge criteria. Supports three strategies: bootstrap (from scratch with simple/complex/edge case coverage), expand (fills gaps in existing datasets by analyzing what judges check that no case tests), and from-traces (extracts real inputs from MLflow production traces). Handles external-state fields with TODO_ placeholders, generates answers.yaml for interactive skills using AskUserQuestion, and creates annotations.yaml for outcome-aware judges.

Plugin: agent-eval-harness | User-invocable

Diagram

eval-dataset diagram

Arguments

/eval-dataset [--config <path>] [--count <N>] [--strategy <type>]
Argument Required Default Description
--config eval.yaml Path to eval config.
--count 5 Number of test cases to generate.
--strategy bootstrap Generation strategy. bootstrap: from scratch. expand: fill gaps in existing dataset. from-traces: extract from MLflow traces.

Usage

/eval-dataset
/eval-dataset --count 10 --strategy expand
/eval-dataset --strategy from-traces