Record & Replay
Record API responses once, replay them instantly. Run evals in milliseconds with zero API costs during development.
Record & Replay
Record API responses once, replay them instantly. Run evals in milliseconds with zero API costs during development.
20 Built-in Graders
From exact-match to hallucination detection. Compose graders with boolean logic, weight scores, and set quality gates.
LLM-as-Judge
When deterministic graders aren’t enough, use LLM rubric scoring with built-in caching, bias mitigation, and cost tracking.
CI-Ready
Quality gates with configurable thresholds. Fail your CI pipeline when eval scores regress below acceptable levels.
import { defineConfig } from "agent-eval-kit";import { contains, latency } from "agent-eval-kit";
export default defineConfig({ suites: [ { name: "smoke", target: async (input) => { const response = await myAgent(input.prompt); return { text: response.text, latencyMs: response.latencyMs }; }, cases: "cases/smoke.jsonl", defaultGraders: [ { grader: contains("expected output") }, { grader: latency(5000) }, ], gates: { passRate: 0.9 }, }, ],});{"id": "greeting", "input": {"prompt": "Say hello"}, "expected": {"text": "hello"}}{"id": "capital", "input": {"prompt": "What is the capital of France?"}, "expected": {"text": "Paris"}}{"id": "math", "input": {"prompt": "What is 2 + 2?"}, "expected": {"text": "4"}}# Record fixtures (calls your agent once)agent-eval-kit run --mode=live --record --suite=smoke
# Replay from fixtures (instant, $0)agent-eval-kit run --mode=replay --suite=smoke
# Watch mode for developmentagent-eval-kit run --watch --mode=replay --suite=smoke