Skip to content

Core Concepts

agent-eval-kit uses precise terminology. Understanding these terms makes the documentation and API clearer.

TermDefinition
SuiteA collection of test cases with a target function, default graders, and quality gates
CaseA single test input with optional expected output
TrialOne execution of a case (multiple trials enable statistical analysis)
RunA complete execution of a suite — all trials, all grades, all metadata
GraderA function that scores a target’s output (deterministic or LLM-based)
GateA threshold that determines if a run passes or fails (e.g., 90% pass rate)
FixtureA recorded target response for replay without API calls

Calls your target function for every case. Use this for initial recording and when you need fresh responses.

Terminal window
agent-eval-kit run --mode=live --suite=smoke

Loads pre-recorded fixtures instead of calling the target. Instant, deterministic, zero cost.

Terminal window
agent-eval-kit run --mode=replay --suite=smoke

Re-grades a previous run with updated graders. Useful when tuning LLM judge criteria without re-running the target.

Terminal window
agent-eval-kit run --mode=judge-only --run-id=<previous-run-id> --suite=smoke

Graders are functions that score outputs. agent-eval-kit ships 20 built-in graders in three tiers:

Fast, free, reproducible. Pure functions with no I/O:

  • Text: contains, notContains, exactMatch, regex, jsonSchema
  • Tool call: toolCalled, toolNotCalled, toolSequence, toolArgsMatch
  • Metric: latency, cost, tokenCount
  • Safety: safetyKeywords, noHallucinatedNumbers

Use an LLM judge to evaluate quality. Require a judge configuration:

  • llmRubric — Score against natural language criteria (1–4 scale)
  • factuality — Check factual consistency against a reference
  • llmClassify — Classify output into categories

Combine graders with boolean logic: all (conjunction), any (disjunction), not (negation). These do not short-circuit — all results are collected for complete reporting.

Each grader in a suite’s defaultGraders array can have a weight, required flag, and threshold:

defaultGraders: [
{ grader: contains("Paris"), weight: 0.3 },
{ grader: latency(5000), weight: 0.2 },
{ grader: llmRubric("Helpful response"), weight: 0.5 },
]

The final case score is a weighted average. Required graders that fail cause an immediate score of 0 regardless of other graders. The case passes if the weighted score meets the pass threshold (default: 0.5, or the minimum of all configured grader thresholds).

All grading configuration is at the suite level via defaultGraders. Every case in a suite is evaluated with the same graders. To use different graders for different cases, split them into separate suites.

Gates are quality thresholds that make eval runs actionable. Three gate types are available:

gates: {
passRate: 0.9, // 90% of cases must pass
maxCost: 1.50, // Total run cost must not exceed $1.50
p95LatencyMs: 5000, // 95th percentile latency under 5 seconds
}

All gates are optional. They are checked after all grading completes. A failed gate means the run’s gateResult.pass is false — the CLI exits with code 1, useful for CI pipelines.

The record-replay engine is the core differentiator:

  1. Record: Run in live mode with --record to capture target responses as fixtures
  2. Replay: Run in replay mode to grade from fixtures — no API calls, instant results
  3. Invalidation: Fixtures are keyed by a config hash of the suite name and targetVersion. Bump targetVersion when your target’s behavior changes
  4. Staleness: Fixtures older than the configured TTL (default: 14 days) generate warnings (or errors with --strict-fixtures)

This enables $0 pre-push evals — record once, replay thousands of times during development.

Cases can be tagged with a category for organized reporting:

{"id": "greeting", "input": {"prompt": "hi"}, "category": "happy_path"}
{"id": "empty-input", "input": {"prompt": ""}, "category": "edge_case"}

Built-in categories: happy_path, edge_case, adversarial, multi_step, regression. Run summaries include per-category pass rates.