Quick Start

1. Initialize your project

After installation, use the init wizard to scaffold everything:

npx agent-eval-kit init

This creates eval.config.ts, starter cases in cases/smoke.jsonl, and optionally a GitHub Actions workflow and AGENTS.md.

Or create the files manually:

2. Create your config

Create eval.config.ts in your project root:

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";

export default defineConfig({
  suites: [
    {
      name: "smoke",
      target: async (input) => {
        // Replace with your actual agent/API call
        const response = await myAgent(input.prompt);
        return { text: response.text, latencyMs: response.latencyMs };
      },
      cases: "cases/smoke.jsonl",
      defaultGraders: [
        { grader: contains("hello") },
        { grader: latency(5000) },
      ],
      gates: { passRate: 0.8 },
    },
  ],
});

3. Define test cases

Create cases/smoke.jsonl:

{"id": "greeting", "input": {"prompt": "Say hello"}, "expected": {"text": "hello"}}
{"id": "farewell", "input": {"prompt": "Say goodbye"}, "expected": {"text": "goodbye"}}

Or use YAML (cases/smoke.yaml):

- id: greeting
  input:
    prompt: "Say hello"
  expected:
    text: "hello"

- id: farewell
  input:
    prompt: "Say goodbye"
  expected:
    text: "goodbye"

4. Run in live mode

# Run against your real API
agent-eval-kit run --suite=smoke

This calls your target for each case, grades the output, and prints a summary.

5. Record fixtures

# Record API responses for instant replay
agent-eval-kit run --mode=live --record --suite=smoke

Fixtures are saved to .eval-fixtures/.

6. Replay from fixtures

# Instant, zero-cost replay
agent-eval-kit run --mode=replay --suite=smoke

No API calls are made — the recorded responses are replayed and graded.

7. Watch mode

# Re-run on file changes
agent-eval-kit run --watch --mode=replay --suite=smoke

8. Compare runs

# List recent runs
agent-eval-kit list

# Compare two runs
agent-eval-kit compare --base=<run-id-1> --compare=<run-id-2>

Next steps

Learn about Concepts (modes, graders, fixtures, gates)
Explore the Graders Guide for scoring strategies
Set up CI Integration for automated quality gates