Skip to content

eval-run

Executes a skill against test cases, collects artifacts, scores with judges, and generates an HTML report. Orchestrates via scripts: preflight checks for stale artifacts, workspace creation with isolated per-case directories, headless skill execution (case mode: once per case; batch mode: single invocation), artifact collection into per-case dirs, scoring with inline checks and LLM judges, optional pairwise comparison against a baseline for regression detection, and report generation. Supports concurrent case execution via parallelism setting, tool interception for AskUserQuestion and external APIs, and configurable reasoning effort levels.

Plugin: agent-eval-harness | User-invocable

Diagram

eval-run diagram

Arguments

/eval-run [--config <path>] [--model <model>] [--run-id <id>] [--baseline <run-id>] [--case <filter>] [--no-judge] [--gold] [--effort <level>] [--subagent-model <model>] [--skill <name>]
Argument Required Default Description
--config eval.yaml Path to eval config.
--model models.skill from config Model for skill execution. Required if models.skill is unset in eval.yaml.
--subagent-model models.subagent, falls back to skill model Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7).
--run-id YYYY-MM-DD-<model> Identifier for this run.
--case - Substring match to select specific cases.
--baseline - Previous run to compare against for regression detection via pairwise comparison.
--no-judge false Skip LLM judges, run inline checks only.
--gold false Save outputs as gold references after the run.
--effort runner.effort from config Claude Code reasoning effort level.
--skill from config Override the skill to test.

Usage

/eval-run --model claude-opus-4-6
/eval-run --model claude-opus-4-6 --baseline 2026-05-01-opus
/eval-run --case case-001 --no-judge
/eval-run --gold