CLI Reference

Global flags

These flags are available on every command:

Flag	Alias	Type	Description
`--verbose`	`-v`	boolean	Enable detailed output
`--quiet`	`-q`	boolean	Suppress all but errors
`--no-color`	—	boolean	Disable color output
`--config`	`-c`	string	Path to config file or directory

All logging goes to stderr. Stdout is reserved for reporter output (JSON, JUnit), keeping it clean for piping.

Exit codes

Code	Meaning
`0`	Success — all gates passed
`1`	Eval failure — a gate did not pass
`2`	Config error — invalid arguments or configuration
`3`	Runtime error — unexpected failure
`130`	Aborted — SIGINT (Ctrl+C)

Commands

`run`

Execute eval suites against the target function.

agent-eval-kit run [suite] [options]

Suite name can be passed as a positional argument or via --suite/-s. If neither is provided, all suites run.

Flag	Alias	Type	Default	Description
`--suite`	`-s`	string	(all)	Comma-separated suite name filter
`--mode`	—	string	config default	`live`, `replay`, or `judge-only`
`--record`	—	boolean	`false`	Record fixtures during live mode
`--filter`	`-f`	string	—	Comma-separated case ID filter
`--filter-failing`	—	string	—	Re-run only failing cases from a previous run ID
`--run-id`	—	string	—	Previous run ID (required for `judge-only` mode)
`--trials`	`-t`	number	`1`	Number of trials per case
`--concurrency`	—	number	—	Max parallel case executions
`--rate-limit`	—	number	—	Max requests per minute (live mode only)
`--strict-fixtures`	—	boolean	`false`	Fail on stale or missing fixtures
`--update-fixtures`	—	boolean	`false`	Force `--mode=live --record`
`--reporter`	`-r`	string	`console`	Reporter: `console`, `json`, `junit`, `markdown`
`--output`	`-o`	string	—	Output file path for reporter
`--confirm-cost`	—	boolean	`false`	Show cost estimate and confirm before running
`--auto-approve`	—	boolean	`false`	Skip confirmation in non-interactive mode
`--no-progress`	—	boolean	`false`	Disable progress display
`--watch`	`-w`	boolean	`false`	Watch files and re-run on changes

--filter and --filter-failing are mutually exclusive.

SIGINT (Ctrl+C) gracefully aborts the run. Press again to force exit.

Run results are saved to .eval-runs/<run-id>.json (e.g. run-20260302-143022-a7f3.json). In GitHub Actions, results are also written to $GITHUB_STEP_SUMMARY.

`record`

Shorthand for run --mode=live --record.

agent-eval-kit record [suite] [options]

Suite name can be passed as a positional argument or via --suite/-s.

Accepts: --suite, --concurrency, --rate-limit, and global flags.

`compare`

Compare two eval runs side-by-side.

agent-eval-kit compare --base=<run-id> --compare=<run-id>

Flag	Type	Default	Description
`--base`	string	(required)	Base run ID (older, “before”)
`--compare`	string	(required)	Compare run ID (newer, “after”)
`--fail-on-regression`	boolean	`false`	Exit 1 if any regressions detected
`--score-threshold`	number	`0.05`	Minimum score delta to count as a change
`--format`	string	`console`	Output format: `console` or `json`

The comparison shows per-case regressions, improvements, score deltas, per-grader changes, per-category breakdowns, cost deltas, and gate changes.

`list`

List recent eval runs.

agent-eval-kit list [suite] [options]

Suite name can be passed as a positional argument or via --suite.

Flag	Alias	Type	Default	Description
`--limit`	`-n`	number	`10`	Maximum runs to show
`--suite`	—	string	—	Filter by suite name

Output is a table: ID | Suite | Mode | Pass Rate | Date. Pass rate is color-coded: green (≥90%), yellow (≥70%), red (below 70%).

`cache`

Cache management commands.

`cache stats`

Show statistics for fixture and judge caches.

agent-eval-kit cache stats

Displays suites, fixture counts, disk size, age, and judge cache entries.

`cache clear`

Clear fixture or judge cache. In non-interactive environments (CI), the --yes flag is required for destructive operations.

agent-eval-kit cache clear              # Clear fixture cache (prompts in TTY)
agent-eval-kit cache clear --judge      # Clear judge cache only
agent-eval-kit cache clear --all        # Clear all caches
agent-eval-kit cache clear --suite=foo  # Clear fixtures for specific suite
agent-eval-kit cache clear --yes        # Skip confirmation (required in CI)

`doctor`

Run health checks on your eval setup.

agent-eval-kit doctor

Checks: Node.js version (>= 20.16), config validation, duplicate suite names, .eval-runs/ and .eval-fixtures/ directories, git hook manager detection, AGENTS.md presence. Exits 1 if any checks fail.

`init`

Initialize a new eval project with an interactive wizard.

agent-eval-kit init [options]

Flag	Alias	Type	Description
`--cwd`	—	string	Working directory
`--yes`	`-y`	boolean	Non-interactive mode with defaults

The wizard auto-detects your framework (Vercel AI SDK, LangChain, Mastra, or custom), package manager, and git hook manager. It creates:

eval.config.ts — config with framework-specific target stub
cases/smoke.jsonl — 3 starter cases
.eval-fixtures/.gitkeep
.github/workflows/evals.yml (optional)
AGENTS.md (optional)

`install-hooks`

Install git pre-push hooks for eval checks.

agent-eval-kit install-hooks [options]

Flag	Type	Description
`--cwd`	string	Working directory
`--manager`	string	Force hook manager: `husky`, `lefthook`, `simple-git-hooks`, or `none` (raw git hook)

Auto-detects your hook manager if --manager is not specified. All installers are idempotent — safe to run multiple times. The hook runs <runner> agent-eval-kit run --mode=replay --quiet on pre-push, where the runner (npx, pnpm, etc.) is auto-detected.

`mcp`

Start the MCP server for AI assistant integration.

agent-eval-kit mcp

See MCP Server for setup instructions and the full tool reference.