agent-eval-kit

Record-replay, deterministic graders, LLM-as-judge, CI integration. Ship confident AI agents with $0 pre-push evals.

Why agent-eval-kit?

Record & Replay

Record API responses once, replay them instantly. Run evals in milliseconds with zero API costs during development.

20 Built-in Graders

From exact-match to hallucination detection. Compose graders with boolean logic, weight scores, and set quality gates.

LLM-as-Judge

When deterministic graders aren’t enough, use LLM rubric scoring with built-in caching, bias mitigation, and cost tracking.

CI-Ready

Quality gates with configurable thresholds. Fail your CI pipeline when eval scores regress below acceptable levels.

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";

export default defineConfig({
  suites: [
    {
      name: "smoke",
      target: async (input) => {
        const response = await myAgent(input.prompt);
        return { text: response.text, latencyMs: response.latencyMs };
      },
      cases: "cases/smoke.jsonl",
      defaultGraders: [
        { grader: contains("expected output") },
        { grader: latency(5000) },
      ],
      gates: { passRate: 0.9 },
    },
  ],
});

{"id": "greeting", "input": {"prompt": "Say hello"}, "expected": {"text": "hello"}}
{"id": "capital", "input": {"prompt": "What is the capital of France?"}, "expected": {"text": "Paris"}}
{"id": "math", "input": {"prompt": "What is 2 + 2?"}, "expected": {"text": "4"}}

# Record fixtures (calls your agent once)
agent-eval-kit run --mode=live --record --suite=smoke

# Replay from fixtures (instant, $0)
agent-eval-kit run --mode=replay --suite=smoke

# Watch mode for development
agent-eval-kit run --watch --mode=replay --suite=smoke

Next Steps

Quick Start Run your first eval in under 5 minutes

Record & Replay Capture responses once, replay them for $0 evals

Graders 20 built-in graders from exact-match to LLM rubrics

CI Integration Quality gates that fail your pipeline on regressions