Skip to content

Record & Replay

The record-replay engine captures your target’s responses as fixtures and replays them for instant, zero-cost eval runs. This is the core differentiator of agent-eval-kit — record once, replay thousands of times.

Terminal window
agent-eval-kit run --mode=live --record --suite=smoke
# Or use the shorthand
agent-eval-kit record --suite=smoke

This calls your target for each case and saves the response to .eval-fixtures/.

Terminal window
agent-eval-kit run --mode=replay --suite=smoke

In replay mode, the target function is never called. Responses are loaded from fixtures and graded.

Each fixture is a JSONL file with two lines:

  1. Meta line: Schema version, suite ID, case ID, config hash, framework version, timestamp
  2. Data line: The recorded TargetOutput
{"_meta":{"schemaVersion":"1.0.0","suiteId":"smoke","caseId":"greeting","configHash":"abc123","recordedAt":"2026-02-28T12:00:00.000Z","frameworkVersion":"0.0.2"}}
{"output":{"text":"Agent response","latencyMs":150,"toolCalls":[]}}

Keys are sorted deterministically for clean git diffs.

Fixtures are keyed by a config hash computed from the suite name and targetVersion. This is intentionally narrow — grader changes and case additions/removals do not invalidate fixtures because those changes don’t affect the recorded target output. If you change your target’s behavior (prompt, model, tools), bump targetVersion to invalidate fixtures:

suites: [
{
name: "smoke",
targetVersion: "v2.0.0", // Changing this invalidates fixtures
// ...
},
]

Fixtures have a configurable TTL (default: 14 days). Stale fixtures generate warnings:

replay: {
ttlDays: 14, // Warn after 14 days
}

Use --strict-fixtures to fail on stale fixtures instead of warning:

Terminal window
agent-eval-kit run --mode=replay --strict-fixtures --suite=smoke
Terminal window
# Re-record all fixtures
agent-eval-kit run --update-fixtures --suite=smoke
# Equivalent to:
agent-eval-kit run --mode=live --record --suite=smoke

By default, the raw field is stripped from fixtures to reduce size:

replay: {
stripRaw: true, // default
}

Set to false to preserve the full raw API response in fixtures.

  • Commit .eval-fixtures/ to git for reproducible CI
  • Add .eval-fixtures/**/*.jsonl linguist-generated to .gitattributes to reduce PR noise
  • Use targetVersion to control when fixtures are invalidated
  • Re-record periodically to catch API behavior changes
  • Use --strict-fixtures in CI for maximum reliability