Skip to content

CI Integration

Add evals to your CI pipeline. The init wizard can generate this workflow for you (agent-eval-kit init).

.github/workflows/evals.yml
name: Eval Suite
on:
pull_request:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm
- run: pnpm install --frozen-lockfile
# Replay from recorded fixtures (instant, $0)
- name: Run evals
run: pnpm agent-eval-kit run --mode=replay
- name: Run evals
run: pnpm agent-eval-kit run --mode=replay -r junit -o results.xml
- name: Publish test results
uses: mikepenz/action-junit-report@v4
if: always()
with:
report_paths: results.xml
- name: Run evals (strict)
run: pnpm agent-eval-kit run --mode=replay --strict-fixtures

Gates determine the exit code. When a gate fails, agent-eval-kit run exits with code 1, failing the CI step.

eval.config.ts
gates: {
passRate: 0.95, // 95% pass rate required
maxCost: 2.00, // Total run cost under $2
p95LatencyMs: 5000, // 95th percentile latency under 5s
}

In GitHub Actions, eval results are automatically written to $GITHUB_STEP_SUMMARY when the environment variable is present. This shows formatted results directly in the PR checks UI without any additional configuration.

Run evals before every push:

Terminal window
# Auto-detect your hook manager and install
agent-eval-kit install-hooks
# Or specify a manager
agent-eval-kit install-hooks --manager=husky

Supported hook managers:

  • Husky — creates/appends .husky/pre-push
  • Lefthook — adds to lefthook.yml
  • simple-git-hooks — adds to package.json
  • Raw git hook — creates .git/hooks/pre-push

The hook runs <runner> agent-eval-kit run --mode=replay --quiet, where the runner (npx, pnpm, etc.) is auto-detected. Failed gates block the push. All installers are idempotent.

Terminal window
# After running evals on your PR branch
agent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id>
# Fail CI if regressions are detected
agent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id> --fail-on-regression

The comparison highlights regressions, improvements, score deltas, per-grader changes, and per-category breakdowns.

For the initial fixture recording, run in live mode once:

Terminal window
# Record fixtures (requires API keys)
OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }} \
agent-eval-kit record --suite=smoke

Commit the .eval-fixtures/ directory to your repository. Subsequent CI runs use replay mode.

  • Commit fixtures: Check .eval-fixtures/ into git for reproducible CI runs
  • Use targetVersion: Bump the version when your agent changes to invalidate old fixtures
  • Separate suites: Use a fast smoke suite for pre-push and a thorough full suite for CI
  • Cache the judge cache: The .eval-cache/ directory can be cached between CI runs to avoid redundant LLM calls
  • Cost estimation: Use --confirm-cost to preview cost before live runs