eval-optimize¶

Automated skill improvement loop. Runs evaluation, identifies judge failures from summary.yaml, reads execution transcripts via Explore sub-agents to trace root causes to specific SKILL.md instructions, makes surgical edits grounded in evidence, re-runs evaluation with regression baseline checks, and iterates up to a configurable maximum. Also reads human feedback from review.yaml (from /eval-review) and MLflow annotations to prioritize issues flagged by humans over automated judge failures. Stops when all judges pass or max iterations reached.

Plugin: agent-eval-harness | User-invocable

Diagram¶

eval-optimize diagram

Arguments¶

/eval-optimize [--config <path>] [--model <model>] [--max-iterations <N>] [--run-id <id>] [--target-judge <name>]

Argument	Default	Description
`--config`	`eval.yaml`	Path to eval config.
`--model`	`models.skill from config`	Model for skill execution across all iterations.
`--max-iterations`	`3`	Maximum optimization iterations before stopping.
`--run-id`	`auto-generated`	Base run ID. Iterations append -iter-N.
`--target-judge`	-	Focus optimization on a specific failing judge instead of all judges.

Usage¶

/eval-optimize
/eval-optimize --max-iterations 5 --model claude-opus-4-6
/eval-optimize --target-judge completeness