Skip to content

Quick Start

After installation, use the init wizard to scaffold everything:

Terminal window
npx agent-eval-kit init

This creates eval.config.ts, starter cases in cases/smoke.jsonl, and optionally a GitHub Actions workflow and AGENTS.md.

Or create the files manually:

Create eval.config.ts in your project root:

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";
export default defineConfig({
suites: [
{
name: "smoke",
target: async (input) => {
// Replace with your actual agent/API call
const response = await myAgent(input.prompt);
return { text: response.text, latencyMs: response.latencyMs };
},
cases: "cases/smoke.jsonl",
defaultGraders: [
{ grader: contains("hello") },
{ grader: latency(5000) },
],
gates: { passRate: 0.8 },
},
],
});

Create cases/smoke.jsonl:

{"id": "greeting", "input": {"prompt": "Say hello"}, "expected": {"text": "hello"}}
{"id": "farewell", "input": {"prompt": "Say goodbye"}, "expected": {"text": "goodbye"}}

Or use YAML (cases/smoke.yaml):

- id: greeting
input:
prompt: "Say hello"
expected:
text: "hello"
- id: farewell
input:
prompt: "Say goodbye"
expected:
text: "goodbye"
Terminal window
# Run against your real API
agent-eval-kit run --suite=smoke

This calls your target for each case, grades the output, and prints a summary.

Terminal window
# Record API responses for instant replay
agent-eval-kit run --mode=live --record --suite=smoke

Fixtures are saved to .eval-fixtures/.

Terminal window
# Instant, zero-cost replay
agent-eval-kit run --mode=replay --suite=smoke

No API calls are made — the recorded responses are replayed and graded.

Terminal window
# Re-run on file changes
agent-eval-kit run --watch --mode=replay --suite=smoke
Terminal window
# List recent runs
agent-eval-kit list
# Compare two runs
agent-eval-kit compare --base=<run-id-1> --compare=<run-id-2>