Record & Replay

Overview

The record-replay engine captures your target’s responses as fixtures and replays them for instant, zero-cost eval runs. This is the core differentiator of agent-eval-kit — record once, replay thousands of times.

Recording fixtures

agent-eval-kit run --mode=live --record --suite=smoke

# Or use the shorthand
agent-eval-kit record --suite=smoke

This calls your target for each case and saves the response to .eval-fixtures/.

Replaying from fixtures

agent-eval-kit run --mode=replay --suite=smoke

In replay mode, the target function is never called. Responses are loaded from fixtures and graded.

Fixture format

Each fixture is a JSONL file with two lines:

Meta line: Schema version, suite ID, case ID, config hash, framework version, timestamp
Data line: The recorded TargetOutput

{"_meta":{"schemaVersion":"1.0.0","suiteId":"smoke","caseId":"greeting","configHash":"abc123","recordedAt":"2026-02-28T12:00:00.000Z","frameworkVersion":"0.0.2"}}
{"output":{"text":"Agent response","latencyMs":150,"toolCalls":[]}}

Keys are sorted deterministically for clean git diffs.

Config hash invalidation

Fixtures are keyed by a config hash computed from the suite name and targetVersion. This is intentionally narrow — grader changes and case additions/removals do not invalidate fixtures because those changes don’t affect the recorded target output. If you change your target’s behavior (prompt, model, tools), bump targetVersion to invalidate fixtures:

suites: [
  {
    name: "smoke",
    targetVersion: "v2.0.0", // Changing this invalidates fixtures
    // ...
  },
]

Staleness

Fixtures have a configurable TTL (default: 14 days). Stale fixtures generate warnings:

replay: {
  ttlDays: 14, // Warn after 14 days
}

Use --strict-fixtures to fail on stale fixtures instead of warning:

agent-eval-kit run --mode=replay --strict-fixtures --suite=smoke

Re-recording

# Re-record all fixtures
agent-eval-kit run --update-fixtures --suite=smoke

# Equivalent to:
agent-eval-kit run --mode=live --record --suite=smoke

Stripping raw responses

By default, the raw field is stripped from fixtures to reduce size:

replay: {
  stripRaw: true,  // default
}

Set to false to preserve the full raw API response in fixtures.

Best practices

Commit .eval-fixtures/ to git for reproducible CI
Add .eval-fixtures/**/*.jsonl linguist-generated to .gitattributes to reduce PR noise
Use targetVersion to control when fixtures are invalidated
Re-record periodically to catch API behavior changes
Use --strict-fixtures in CI for maximum reliability