Quick Start
1. Initialize your project
Section titled “1. Initialize your project”After installation, use the init wizard to scaffold everything:
npx agent-eval-kit initThis creates eval.config.ts, starter cases in cases/smoke.jsonl, and optionally a GitHub Actions workflow and AGENTS.md.
Or create the files manually:
2. Create your config
Section titled “2. Create your config”Create eval.config.ts in your project root:
import { defineConfig } from "agent-eval-kit";import { contains, latency } from "agent-eval-kit";
export default defineConfig({ suites: [ { name: "smoke", target: async (input) => { // Replace with your actual agent/API call const response = await myAgent(input.prompt); return { text: response.text, latencyMs: response.latencyMs }; }, cases: "cases/smoke.jsonl", defaultGraders: [ { grader: contains("hello") }, { grader: latency(5000) }, ], gates: { passRate: 0.8 }, }, ],});3. Define test cases
Section titled “3. Define test cases”Create cases/smoke.jsonl:
{"id": "greeting", "input": {"prompt": "Say hello"}, "expected": {"text": "hello"}}{"id": "farewell", "input": {"prompt": "Say goodbye"}, "expected": {"text": "goodbye"}}Or use YAML (cases/smoke.yaml):
- id: greeting input: prompt: "Say hello" expected: text: "hello"
- id: farewell input: prompt: "Say goodbye" expected: text: "goodbye"4. Run in live mode
Section titled “4. Run in live mode”# Run against your real APIagent-eval-kit run --suite=smokeThis calls your target for each case, grades the output, and prints a summary.
5. Record fixtures
Section titled “5. Record fixtures”# Record API responses for instant replayagent-eval-kit run --mode=live --record --suite=smokeFixtures are saved to .eval-fixtures/.
6. Replay from fixtures
Section titled “6. Replay from fixtures”# Instant, zero-cost replayagent-eval-kit run --mode=replay --suite=smokeNo API calls are made — the recorded responses are replayed and graded.
7. Watch mode
Section titled “7. Watch mode”# Re-run on file changesagent-eval-kit run --watch --mode=replay --suite=smoke8. Compare runs
Section titled “8. Compare runs”# List recent runsagent-eval-kit list
# Compare two runsagent-eval-kit compare --base=<run-id-1> --compare=<run-id-2>Next steps
Section titled “Next steps”- Learn about Concepts (modes, graders, fixtures, gates)
- Explore the Graders Guide for scoring strategies
- Set up CI Integration for automated quality gates