Skip to content

agent-eval-harness

Generic agentic evaluation framework for Claude Code skills. Provides an end-to-end pipeline to analyze skills, generate test cases, execute evaluations, review results with human feedback, sync with MLflow, and iteratively optimize skill quality with regression checks.

The framework is schema-driven via eval.yaml, which defines execution mode (case-by-case or batch), dataset schemas, output descriptions, judges (inline checks, LLM prompts, external modules), model selection per role (skill, subagent, judge, hook), thresholds for regression detection, and tool interception handlers. Supports AskUserQuestion answering via 3-tier resolution (exact overrides, LLM with case context, fallback) and annotation-aware judges that adapt scoring based on expected outcomes per test case.

The harness also integrates with EvalHub for running evaluations on Red Hat OpenShift AI via a custom provider adapter, supporting S3-hosted datasets and containerized execution.

Plugin Details

Pipeline

agent-eval-harness pipeline

Skills

Skill Description Invocable
/eval-setup One-time environment setup for evaluation (dependencies, MLflow, API keys)
/eval-analyze Deep-read a target skill and generate eval.yaml configuration with dataset schemas and judges
/eval-dataset Generate realistic test cases from eval.yaml schema (bootstrap, expand, from-traces)
/eval-run Execute skill against test cases, collect artifacts, run judges, and detect regressions
/eval-review Human-in-the-loop review of scores and outputs with qualitative feedback collection
/eval-mlflow Bidirectional MLflow sync for results, datasets, and feedback
/eval-optimize Automated improvement loop that identifies failures, edits SKILL.md, and re-runs with regression checks

Installation

/plugin install agent-eval-harness@opendatahub-skills

Architecture

Seven skills form a linear pipeline with feedback loops: setup (optional) -> analyze -> dataset -> run -> review/optimize, with mlflow available at any point after run. eval-run is the central hub -- it executes skills headlessly, runs judges (inline checks + LLM scoring + pairwise comparison), and produces summary.yaml consumed by review, optimize, and mlflow.

eval-optimize creates a closed loop by reading judge rationale and execution transcripts, forming hypotheses about SKILL.md deficiencies, making surgical edits, and re-running eval-run with regression baseline checks. eval-review complements this with human-in-the-loop feedback that catches qualitative issues judges miss. eval-mlflow provides bidirectional sync -- pushing datasets, results, and feedback to MLflow experiment tracking and pulling annotations back for optimization.

Scripts live alongside each skill and share the agent_eval Python package (auto-installed via SessionStart hook into an isolated venv at .eval-venv/). The data flow is: eval.yaml config -> workspace creation (isolated per run) -> skill execution via EvalRunner -> artifact collection -> judge scoring -> summary.yaml + HTML report.