CLI Reference
Global flags
Section titled “Global flags”These flags are available on every command:
| Flag | Alias | Type | Description |
|---|---|---|---|
--verbose | -v | boolean | Enable detailed output |
--quiet | -q | boolean | Suppress all but errors |
--no-color | — | boolean | Disable color output |
--config | -c | string | Path to config file or directory |
All logging goes to stderr. Stdout is reserved for reporter output (JSON, JUnit), keeping it clean for piping.
Exit codes
Section titled “Exit codes”| Code | Meaning |
|---|---|
0 | Success — all gates passed |
1 | Eval failure — a gate did not pass |
2 | Config error — invalid arguments or configuration |
3 | Runtime error — unexpected failure |
130 | Aborted — SIGINT (Ctrl+C) |
Commands
Section titled “Commands”Execute eval suites against the target function.
agent-eval-kit run [suite] [options]Suite name can be passed as a positional argument or via --suite/-s. If neither is provided, all suites run.
| Flag | Alias | Type | Default | Description |
|---|---|---|---|---|
--suite | -s | string | (all) | Comma-separated suite name filter |
--mode | — | string | config default | live, replay, or judge-only |
--record | — | boolean | false | Record fixtures during live mode |
--filter | -f | string | — | Comma-separated case ID filter |
--filter-failing | — | string | — | Re-run only failing cases from a previous run ID |
--run-id | — | string | — | Previous run ID (required for judge-only mode) |
--trials | -t | number | 1 | Number of trials per case |
--concurrency | — | number | — | Max parallel case executions |
--rate-limit | — | number | — | Max requests per minute (live mode only) |
--strict-fixtures | — | boolean | false | Fail on stale or missing fixtures |
--update-fixtures | — | boolean | false | Force --mode=live --record |
--reporter | -r | string | console | Reporter: console, json, junit, markdown |
--output | -o | string | — | Output file path for reporter |
--confirm-cost | — | boolean | false | Show cost estimate and confirm before running |
--auto-approve | — | boolean | false | Skip confirmation in non-interactive mode |
--no-progress | — | boolean | false | Disable progress display |
--watch | -w | boolean | false | Watch files and re-run on changes |
--filter and --filter-failing are mutually exclusive.
SIGINT (Ctrl+C) gracefully aborts the run. Press again to force exit.
Run results are saved to .eval-runs/<run-id>.json (e.g. run-20260302-143022-a7f3.json). In GitHub Actions, results are also written to $GITHUB_STEP_SUMMARY.
record
Section titled “record”Shorthand for run --mode=live --record.
agent-eval-kit record [suite] [options]Suite name can be passed as a positional argument or via --suite/-s.
Accepts: --suite, --concurrency, --rate-limit, and global flags.
compare
Section titled “compare”Compare two eval runs side-by-side.
agent-eval-kit compare --base=<run-id> --compare=<run-id>| Flag | Type | Default | Description |
|---|---|---|---|
--base | string | (required) | Base run ID (older, “before”) |
--compare | string | (required) | Compare run ID (newer, “after”) |
--fail-on-regression | boolean | false | Exit 1 if any regressions detected |
--score-threshold | number | 0.05 | Minimum score delta to count as a change |
--format | string | console | Output format: console or json |
The comparison shows per-case regressions, improvements, score deltas, per-grader changes, per-category breakdowns, cost deltas, and gate changes.
List recent eval runs.
agent-eval-kit list [suite] [options]Suite name can be passed as a positional argument or via --suite.
| Flag | Alias | Type | Default | Description |
|---|---|---|---|---|
--limit | -n | number | 10 | Maximum runs to show |
--suite | — | string | — | Filter by suite name |
Output is a table: ID | Suite | Mode | Pass Rate | Date. Pass rate is color-coded: green (≥90%), yellow (≥70%), red (below 70%).
Cache management commands.
cache stats
Section titled “cache stats”Show statistics for fixture and judge caches.
agent-eval-kit cache statsDisplays suites, fixture counts, disk size, age, and judge cache entries.
cache clear
Section titled “cache clear”Clear fixture or judge cache. In non-interactive environments (CI), the --yes flag is required for destructive operations.
agent-eval-kit cache clear # Clear fixture cache (prompts in TTY)agent-eval-kit cache clear --judge # Clear judge cache onlyagent-eval-kit cache clear --all # Clear all cachesagent-eval-kit cache clear --suite=foo # Clear fixtures for specific suiteagent-eval-kit cache clear --yes # Skip confirmation (required in CI)doctor
Section titled “doctor”Run health checks on your eval setup.
agent-eval-kit doctorChecks: Node.js version (>= 20.16), config validation, duplicate suite names, .eval-runs/ and .eval-fixtures/ directories, git hook manager detection, AGENTS.md presence. Exits 1 if any checks fail.
Initialize a new eval project with an interactive wizard.
agent-eval-kit init [options]| Flag | Alias | Type | Description |
|---|---|---|---|
--cwd | — | string | Working directory |
--yes | -y | boolean | Non-interactive mode with defaults |
The wizard auto-detects your framework (Vercel AI SDK, LangChain, Mastra, or custom), package manager, and git hook manager. It creates:
eval.config.ts— config with framework-specific target stubcases/smoke.jsonl— 3 starter cases.eval-fixtures/.gitkeep.github/workflows/evals.yml(optional)AGENTS.md(optional)
install-hooks
Section titled “install-hooks”Install git pre-push hooks for eval checks.
agent-eval-kit install-hooks [options]| Flag | Type | Description |
|---|---|---|
--cwd | string | Working directory |
--manager | string | Force hook manager: husky, lefthook, simple-git-hooks, or none (raw git hook) |
Auto-detects your hook manager if --manager is not specified. All installers are idempotent — safe to run multiple times. The hook runs <runner> agent-eval-kit run --mode=replay --quiet on pre-push, where the runner (npx, pnpm, etc.) is auto-detected.
Start the MCP server for AI assistant integration.
agent-eval-kit mcpSee MCP Server for setup instructions and the full tool reference.