Skip to content

CLI Reference

These flags are available on every command:

FlagAliasTypeDescription
--verbose-vbooleanEnable detailed output
--quiet-qbooleanSuppress all but errors
--no-colorbooleanDisable color output
--config-cstringPath to config file or directory

All logging goes to stderr. Stdout is reserved for reporter output (JSON, JUnit), keeping it clean for piping.

CodeMeaning
0Success — all gates passed
1Eval failure — a gate did not pass
2Config error — invalid arguments or configuration
3Runtime error — unexpected failure
130Aborted — SIGINT (Ctrl+C)

Execute eval suites against the target function.

Terminal window
agent-eval-kit run [suite] [options]

Suite name can be passed as a positional argument or via --suite/-s. If neither is provided, all suites run.

FlagAliasTypeDefaultDescription
--suite-sstring(all)Comma-separated suite name filter
--modestringconfig defaultlive, replay, or judge-only
--recordbooleanfalseRecord fixtures during live mode
--filter-fstringComma-separated case ID filter
--filter-failingstringRe-run only failing cases from a previous run ID
--run-idstringPrevious run ID (required for judge-only mode)
--trials-tnumber1Number of trials per case
--concurrencynumberMax parallel case executions
--rate-limitnumberMax requests per minute (live mode only)
--strict-fixturesbooleanfalseFail on stale or missing fixtures
--update-fixturesbooleanfalseForce --mode=live --record
--reporter-rstringconsoleReporter: console, json, junit, markdown
--output-ostringOutput file path for reporter
--confirm-costbooleanfalseShow cost estimate and confirm before running
--auto-approvebooleanfalseSkip confirmation in non-interactive mode
--no-progressbooleanfalseDisable progress display
--watch-wbooleanfalseWatch files and re-run on changes

--filter and --filter-failing are mutually exclusive.

SIGINT (Ctrl+C) gracefully aborts the run. Press again to force exit.

Run results are saved to .eval-runs/<run-id>.json (e.g. run-20260302-143022-a7f3.json). In GitHub Actions, results are also written to $GITHUB_STEP_SUMMARY.

Shorthand for run --mode=live --record.

Terminal window
agent-eval-kit record [suite] [options]

Suite name can be passed as a positional argument or via --suite/-s.

Accepts: --suite, --concurrency, --rate-limit, and global flags.

Compare two eval runs side-by-side.

Terminal window
agent-eval-kit compare --base=<run-id> --compare=<run-id>
FlagTypeDefaultDescription
--basestring(required)Base run ID (older, “before”)
--comparestring(required)Compare run ID (newer, “after”)
--fail-on-regressionbooleanfalseExit 1 if any regressions detected
--score-thresholdnumber0.05Minimum score delta to count as a change
--formatstringconsoleOutput format: console or json

The comparison shows per-case regressions, improvements, score deltas, per-grader changes, per-category breakdowns, cost deltas, and gate changes.

List recent eval runs.

Terminal window
agent-eval-kit list [suite] [options]

Suite name can be passed as a positional argument or via --suite.

FlagAliasTypeDefaultDescription
--limit-nnumber10Maximum runs to show
--suitestringFilter by suite name

Output is a table: ID | Suite | Mode | Pass Rate | Date. Pass rate is color-coded: green (≥90%), yellow (≥70%), red (below 70%).

Cache management commands.

Show statistics for fixture and judge caches.

Terminal window
agent-eval-kit cache stats

Displays suites, fixture counts, disk size, age, and judge cache entries.

Clear fixture or judge cache. In non-interactive environments (CI), the --yes flag is required for destructive operations.

Terminal window
agent-eval-kit cache clear # Clear fixture cache (prompts in TTY)
agent-eval-kit cache clear --judge # Clear judge cache only
agent-eval-kit cache clear --all # Clear all caches
agent-eval-kit cache clear --suite=foo # Clear fixtures for specific suite
agent-eval-kit cache clear --yes # Skip confirmation (required in CI)

Run health checks on your eval setup.

Terminal window
agent-eval-kit doctor

Checks: Node.js version (>= 20.16), config validation, duplicate suite names, .eval-runs/ and .eval-fixtures/ directories, git hook manager detection, AGENTS.md presence. Exits 1 if any checks fail.

Initialize a new eval project with an interactive wizard.

Terminal window
agent-eval-kit init [options]
FlagAliasTypeDescription
--cwdstringWorking directory
--yes-ybooleanNon-interactive mode with defaults

The wizard auto-detects your framework (Vercel AI SDK, LangChain, Mastra, or custom), package manager, and git hook manager. It creates:

  • eval.config.ts — config with framework-specific target stub
  • cases/smoke.jsonl — 3 starter cases
  • .eval-fixtures/.gitkeep
  • .github/workflows/evals.yml (optional)
  • AGENTS.md (optional)

Install git pre-push hooks for eval checks.

Terminal window
agent-eval-kit install-hooks [options]
FlagTypeDescription
--cwdstringWorking directory
--managerstringForce hook manager: husky, lefthook, simple-git-hooks, or none (raw git hook)

Auto-detects your hook manager if --manager is not specified. All installers are idempotent — safe to run multiple times. The hook runs <runner> agent-eval-kit run --mode=replay --quiet on pre-push, where the runner (npx, pnpm, etc.) is auto-detected.

Start the MCP server for AI assistant integration.

Terminal window
agent-eval-kit mcp

See MCP Server for setup instructions and the full tool reference.