Core Concepts
Terminology
Section titled “Terminology”agent-eval-kit uses precise terminology. Understanding these terms makes the documentation and API clearer.
| Term | Definition |
|---|---|
| Suite | A collection of test cases with a target function, default graders, and quality gates |
| Case | A single test input with optional expected output |
| Trial | One execution of a case (multiple trials enable statistical analysis) |
| Run | A complete execution of a suite — all trials, all grades, all metadata |
| Grader | A function that scores a target’s output (deterministic or LLM-based) |
| Gate | A threshold that determines if a run passes or fails (e.g., 90% pass rate) |
| Fixture | A recorded target response for replay without API calls |
Execution Modes
Section titled “Execution Modes”Live mode
Section titled “Live mode”Calls your target function for every case. Use this for initial recording and when you need fresh responses.
agent-eval-kit run --mode=live --suite=smokeReplay mode
Section titled “Replay mode”Loads pre-recorded fixtures instead of calling the target. Instant, deterministic, zero cost.
agent-eval-kit run --mode=replay --suite=smokeJudge-only mode
Section titled “Judge-only mode”Re-grades a previous run with updated graders. Useful when tuning LLM judge criteria without re-running the target.
agent-eval-kit run --mode=judge-only --run-id=<previous-run-id> --suite=smokeGraders
Section titled “Graders”Graders are functions that score outputs. agent-eval-kit ships 20 built-in graders in three tiers:
Deterministic graders
Section titled “Deterministic graders”Fast, free, reproducible. Pure functions with no I/O:
- Text:
contains,notContains,exactMatch,regex,jsonSchema - Tool call:
toolCalled,toolNotCalled,toolSequence,toolArgsMatch - Metric:
latency,cost,tokenCount - Safety:
safetyKeywords,noHallucinatedNumbers
LLM graders
Section titled “LLM graders”Use an LLM judge to evaluate quality. Require a judge configuration:
llmRubric— Score against natural language criteria (1–4 scale)factuality— Check factual consistency against a referencellmClassify— Classify output into categories
Composition operators
Section titled “Composition operators”Combine graders with boolean logic: all (conjunction), any (disjunction), not (negation). These do not short-circuit — all results are collected for complete reporting.
Weighted Scoring
Section titled “Weighted Scoring”Each grader in a suite’s defaultGraders array can have a weight, required flag, and threshold:
defaultGraders: [ { grader: contains("Paris"), weight: 0.3 }, { grader: latency(5000), weight: 0.2 }, { grader: llmRubric("Helpful response"), weight: 0.5 },]The final case score is a weighted average. Required graders that fail cause an immediate score of 0 regardless of other graders. The case passes if the weighted score meets the pass threshold (default: 0.5, or the minimum of all configured grader thresholds).
Suite-Level Graders
Section titled “Suite-Level Graders”All grading configuration is at the suite level via defaultGraders. Every case in a suite is evaluated with the same graders. To use different graders for different cases, split them into separate suites.
Gates are quality thresholds that make eval runs actionable. Three gate types are available:
gates: { passRate: 0.9, // 90% of cases must pass maxCost: 1.50, // Total run cost must not exceed $1.50 p95LatencyMs: 5000, // 95th percentile latency under 5 seconds}All gates are optional. They are checked after all grading completes. A failed gate means the run’s gateResult.pass is false — the CLI exits with code 1, useful for CI pipelines.
Record-Replay (Fixtures)
Section titled “Record-Replay (Fixtures)”The record-replay engine is the core differentiator:
- Record: Run in live mode with
--recordto capture target responses as fixtures - Replay: Run in replay mode to grade from fixtures — no API calls, instant results
- Invalidation: Fixtures are keyed by a config hash of the suite name and
targetVersion. BumptargetVersionwhen your target’s behavior changes - Staleness: Fixtures older than the configured TTL (default: 14 days) generate warnings (or errors with
--strict-fixtures)
This enables $0 pre-push evals — record once, replay thousands of times during development.
Case Categories
Section titled “Case Categories”Cases can be tagged with a category for organized reporting:
{"id": "greeting", "input": {"prompt": "hi"}, "category": "happy_path"}{"id": "empty-input", "input": {"prompt": ""}, "category": "edge_case"}Built-in categories: happy_path, edge_case, adversarial, multi_step, regression. Run summaries include per-category pass rates.