Skip to content

agent-eval-kit

Record-replay, deterministic graders, LLM-as-judge, CI integration. Ship confident AI agents with $0 pre-push evals.

Record & Replay

Record API responses once, replay them instantly. Run evals in milliseconds with zero API costs during development.

20 Built-in Graders

From exact-match to hallucination detection. Compose graders with boolean logic, weight scores, and set quality gates.

LLM-as-Judge

When deterministic graders aren’t enough, use LLM rubric scoring with built-in caching, bias mitigation, and cost tracking.

CI-Ready

Quality gates with configurable thresholds. Fail your CI pipeline when eval scores regress below acceptable levels.

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";
export default defineConfig({
suites: [
{
name: "smoke",
target: async (input) => {
const response = await myAgent(input.prompt);
return { text: response.text, latencyMs: response.latencyMs };
},
cases: "cases/smoke.jsonl",
defaultGraders: [
{ grader: contains("expected output") },
{ grader: latency(5000) },
],
gates: { passRate: 0.9 },
},
],
});
Terminal window
# Record fixtures (calls your agent once)
agent-eval-kit run --mode=live --record --suite=smoke
# Replay from fixtures (instant, $0)
agent-eval-kit run --mode=replay --suite=smoke
# Watch mode for development
agent-eval-kit run --watch --mode=replay --suite=smoke