agent-eval-harness¶

Generic agentic evaluation framework for Claude Code skills. Provides an end-to-end pipeline to analyze skills, generate test cases, execute evaluations, review results with human feedback, sync with MLflow, and iteratively optimize skill quality with regression checks.

The framework is schema-driven via eval.yaml, which defines execution mode (case-by-case or batch), dataset schemas, output descriptions, judges (inline checks, LLM prompts, external modules), model selection per role (skill, subagent, judge, hook), thresholds for regression detection, and tool interception handlers. Supports AskUserQuestion answering via 3-tier resolution (exact overrides, LLM with case context, fallback) and annotation-aware judges that adapt scoring based on expected outcomes per test case.

The harness also integrates with EvalHub for running evaluations on Red Hat OpenShift AI via a custom provider adapter, supporting S3-hosted datasets and containerized execution.

Plugin Details

Version: 0.1.0
Author: Antonin Stefanutti
Category: Evaluation & Testing
Repository: opendatahub-io/agent-eval-harness
Tags: evaluation testing skills agents mlflow optimization scoring

Pipeline¶

agent-eval-harness pipeline

Skills¶

Skill	Description	Invocable
`/eval-setup`	One-time environment setup for evaluation (dependencies, MLflow, API keys)
`/eval-analyze`	Deep-read a target skill and generate eval.yaml configuration with dataset schemas and judges
`/eval-dataset`	Generate realistic test cases from eval.yaml schema (bootstrap, expand, from-traces)
`/eval-run`	Execute skill against test cases, collect artifacts, run judges, and detect regressions
`/eval-review`	Human-in-the-loop review of scores and outputs with qualitative feedback collection
`/eval-mlflow`	Bidirectional MLflow sync for results, datasets, and feedback
`/eval-optimize`	Automated improvement loop that identifies failures, edits SKILL.md, and re-runs with regression checks

Installation¶

/plugin install agent-eval-harness@opendatahub-skills

Architecture¶

Seven skills form a linear pipeline with feedback loops: setup (optional) -> analyze -> dataset -> run -> review/optimize, with mlflow available at any point after run. eval-run is the central hub -- it executes skills headlessly, runs judges (inline checks + LLM scoring + pairwise comparison), and produces summary.yaml consumed by review, optimize, and mlflow.

eval-optimize creates a closed loop by reading judge rationale and execution transcripts, forming hypotheses about SKILL.md deficiencies, making surgical edits, and re-running eval-run with regression baseline checks. eval-review complements this with human-in-the-loop feedback that catches qualitative issues judges miss. eval-mlflow provides bidirectional sync -- pushing datasets, results, and feedback to MLflow experiment tracking and pulling annotations back for optimization.

Scripts live alongside each skill and share the agent_eval Python package (auto-installed via SessionStart hook into an isolated venv at .eval-venv/). The data flow is: eval.yaml config -> workspace creation (isolated per run) -> skill execution via EvalRunner -> artifact collection -> judge scoring -> summary.yaml + HTML report.