CI Integration
GitHub Actions
Section titled “GitHub Actions”Add evals to your CI pipeline. The init wizard can generate this workflow for you (agent-eval-kit init).
name: Eval Suiteon: pull_request: branches: [main]
jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4 with: node-version: 22 cache: pnpm
- run: pnpm install --frozen-lockfile
# Replay from recorded fixtures (instant, $0) - name: Run evals run: pnpm agent-eval-kit run --mode=replayWith JUnit reporting
Section titled “With JUnit reporting” - name: Run evals run: pnpm agent-eval-kit run --mode=replay -r junit -o results.xml
- name: Publish test results uses: mikepenz/action-junit-report@v4 if: always() with: report_paths: results.xmlWith strict fixtures
Section titled “With strict fixtures” - name: Run evals (strict) run: pnpm agent-eval-kit run --mode=replay --strict-fixturesQuality gates in CI
Section titled “Quality gates in CI”Gates determine the exit code. When a gate fails, agent-eval-kit run exits with code 1, failing the CI step.
gates: { passRate: 0.95, // 95% pass rate required maxCost: 2.00, // Total run cost under $2 p95LatencyMs: 5000, // 95th percentile latency under 5s}GitHub Step Summary
Section titled “GitHub Step Summary”In GitHub Actions, eval results are automatically written to $GITHUB_STEP_SUMMARY when the environment variable is present. This shows formatted results directly in the PR checks UI without any additional configuration.
Pre-push hooks
Section titled “Pre-push hooks”Run evals before every push:
# Auto-detect your hook manager and installagent-eval-kit install-hooks
# Or specify a manageragent-eval-kit install-hooks --manager=huskySupported hook managers:
- Husky — creates/appends
.husky/pre-push - Lefthook — adds to
lefthook.yml - simple-git-hooks — adds to
package.json - Raw git hook — creates
.git/hooks/pre-push
The hook runs <runner> agent-eval-kit run --mode=replay --quiet, where the runner (npx, pnpm, etc.) is auto-detected. Failed gates block the push. All installers are idempotent.
Comparing across runs
Section titled “Comparing across runs”# After running evals on your PR branchagent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id>
# Fail CI if regressions are detectedagent-eval-kit compare --base=<main-run-id> --compare=<pr-run-id> --fail-on-regressionThe comparison highlights regressions, improvements, score deltas, per-grader changes, and per-category breakdowns.
Recording fixtures in CI
Section titled “Recording fixtures in CI”For the initial fixture recording, run in live mode once:
# Record fixtures (requires API keys)OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }} \ agent-eval-kit record --suite=smokeCommit the .eval-fixtures/ directory to your repository. Subsequent CI runs use replay mode.
- Commit fixtures: Check
.eval-fixtures/into git for reproducible CI runs - Use
targetVersion: Bump the version when your agent changes to invalidate old fixtures - Separate suites: Use a fast
smokesuite for pre-push and a thoroughfullsuite for CI - Cache the judge cache: The
.eval-cache/directory can be cached between CI runs to avoid redundant LLM calls - Cost estimation: Use
--confirm-costto preview cost before live runs