CI Integration

GitHub Actions

Add evals to your CI pipeline. The init wizard can generate this workflow for you (agent-eval-kit init).

name: Eval Suite
on:
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: pnpm

      - run: pnpm install --frozen-lockfile

      # Replay from recorded fixtures (instant, $0)
      - name: Run evals
        run: pnpm agent-eval-kit run --mode=replay

With JUnit reporting

      - name: Run evals
        run: pnpm agent-eval-kit run --mode=replay -r junit -o results.xml

      - name: Publish test results
        uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: results.xml

With strict fixtures

      - name: Run evals (strict)
        run: pnpm agent-eval-kit run --mode=replay --strict-fixtures

Quality gates in CI

Gates determine the exit code. When a gate fails, agent-eval-kit run exits with code 1, failing the CI step.

gates: {
  passRate: 0.95,       // 95% pass rate required
  maxCost: 2.00,        // Total run cost under $2
  p95LatencyMs: 5000,   // 95th percentile latency under 5s
}

GitHub Step Summary

In GitHub Actions, eval results are automatically written to $GITHUB_STEP_SUMMARY when the environment variable is present. This shows formatted results directly in the PR checks UI without any additional configuration.

Pre-push hooks

Run evals before every push:

# Auto-detect your hook manager and install
agent-eval-kit install-hooks

# Or specify a manager
agent-eval-kit install-hooks --manager=husky

Supported hook managers:

Husky — creates/appends .husky/pre-push
Lefthook — adds to lefthook.yml
simple-git-hooks — adds to package.json
Raw git hook — creates .git/hooks/pre-push

The hook runs <runner> agent-eval-kit run --mode=replay --quiet, where the runner (npx, pnpm, etc.) is auto-detected. Failed gates block the push. All installers are idempotent.

Comparing across runs

# After running evals on your PR branch
agent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id>

# Fail CI if regressions are detected
agent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id> --fail-on-regression

The comparison highlights regressions, improvements, score deltas, per-grader changes, and per-category breakdowns.

Recording fixtures in CI

For the initial fixture recording, run in live mode once:

# Record fixtures (requires API keys)
OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }} \
  agent-eval-kit record --suite=smoke

Commit the .eval-fixtures/ directory to your repository. Subsequent CI runs use replay mode.

Tips

Commit fixtures: Check .eval-fixtures/ into git for reproducible CI runs
Use targetVersion: Bump the version when your agent changes to invalidate old fixtures
Separate suites: Use a fast smoke suite for pre-push and a thorough full suite for CI
Cache the judge cache: The .eval-cache/ directory can be cached between CI runs to avoid redundant LLM calls
Cost estimation: Use --confirm-cost to preview cost before live runs