Core Concepts

Terminology

agent-eval-kit uses precise terminology. Understanding these terms makes the documentation and API clearer.

Term	Definition
Suite	A collection of test cases with a target function, default graders, and quality gates
Case	A single test input with optional expected output
Trial	One execution of a case (multiple trials enable statistical analysis)
Run	A complete execution of a suite — all trials, all grades, all metadata
Grader	A function that scores a target’s output (deterministic or LLM-based)
Gate	A threshold that determines if a run passes or fails (e.g., 90% pass rate)
Fixture	A recorded target response for replay without API calls

Execution Modes

Live mode

Calls your target function for every case. Use this for initial recording and when you need fresh responses.

agent-eval-kit run --mode=live --suite=smoke

Replay mode

Loads pre-recorded fixtures instead of calling the target. Instant, deterministic, zero cost.

agent-eval-kit run --mode=replay --suite=smoke

Judge-only mode

Re-grades a previous run with updated graders. Useful when tuning LLM judge criteria without re-running the target.

agent-eval-kit run --mode=judge-only --run-id=<previous-run-id> --suite=smoke

Graders

Graders are functions that score outputs. agent-eval-kit ships 20 built-in graders in three tiers:

Deterministic graders

Fast, free, reproducible. Pure functions with no I/O:

Text: contains, notContains, exactMatch, regex, jsonSchema
Tool call: toolCalled, toolNotCalled, toolSequence, toolArgsMatch
Metric: latency, cost, tokenCount
Safety: safetyKeywords, noHallucinatedNumbers

LLM graders

Use an LLM judge to evaluate quality. Require a judge configuration:

llmRubric — Score against natural language criteria (1–4 scale)
factuality — Check factual consistency against a reference
llmClassify — Classify output into categories

Composition operators

Combine graders with boolean logic: all (conjunction), any (disjunction), not (negation). These do not short-circuit — all results are collected for complete reporting.

Weighted Scoring

Each grader in a suite’s defaultGraders array can have a weight, required flag, and threshold:

defaultGraders: [
  { grader: contains("Paris"), weight: 0.3 },
  { grader: latency(5000), weight: 0.2 },
  { grader: llmRubric("Helpful response"), weight: 0.5 },
]

The final case score is a weighted average. Required graders that fail cause an immediate score of 0 regardless of other graders. The case passes if the weighted score meets the pass threshold (default: 0.5, or the minimum of all configured grader thresholds).

Suite-Level Graders

All grading configuration is at the suite level via defaultGraders. Every case in a suite is evaluated with the same graders. To use different graders for different cases, split them into separate suites.

Gates

Gates are quality thresholds that make eval runs actionable. Three gate types are available:

gates: {
  passRate: 0.9,       // 90% of cases must pass
  maxCost: 1.50,       // Total run cost must not exceed $1.50
  p95LatencyMs: 5000,  // 95th percentile latency under 5 seconds
}

All gates are optional. They are checked after all grading completes. A failed gate means the run’s gateResult.pass is false — the CLI exits with code 1, useful for CI pipelines.

Record-Replay (Fixtures)

The record-replay engine is the core differentiator:

Record: Run in live mode with --record to capture target responses as fixtures
Replay: Run in replay mode to grade from fixtures — no API calls, instant results
Invalidation: Fixtures are keyed by a config hash of the suite name and targetVersion. Bump targetVersion when your target’s behavior changes
Staleness: Fixtures older than the configured TTL (default: 14 days) generate warnings (or errors with --strict-fixtures)

This enables $0 pre-push evals — record once, replay thousands of times during development.

Case Categories

Cases can be tagged with a category for organized reporting:

{"id": "greeting", "input": {"prompt": "hi"}, "category": "happy_path"}
{"id": "empty-input", "input": {"prompt": ""}, "category": "edge_case"}

Built-in categories: happy_path, edge_case, adversarial, multi_step, regression. Run summaries include per-category pass rates.