Record & Replay
Overview
Section titled “Overview”The record-replay engine captures your target’s responses as fixtures and replays them for instant, zero-cost eval runs. This is the core differentiator of agent-eval-kit — record once, replay thousands of times.
Recording fixtures
Section titled “Recording fixtures”agent-eval-kit run --mode=live --record --suite=smoke
# Or use the shorthandagent-eval-kit record --suite=smokeThis calls your target for each case and saves the response to .eval-fixtures/.
Replaying from fixtures
Section titled “Replaying from fixtures”agent-eval-kit run --mode=replay --suite=smokeIn replay mode, the target function is never called. Responses are loaded from fixtures and graded.
Fixture format
Section titled “Fixture format”Each fixture is a JSONL file with two lines:
- Meta line: Schema version, suite ID, case ID, config hash, framework version, timestamp
- Data line: The recorded
TargetOutput
{"_meta":{"schemaVersion":"1.0.0","suiteId":"smoke","caseId":"greeting","configHash":"abc123","recordedAt":"2026-02-28T12:00:00.000Z","frameworkVersion":"0.0.2"}}{"output":{"text":"Agent response","latencyMs":150,"toolCalls":[]}}Keys are sorted deterministically for clean git diffs.
Config hash invalidation
Section titled “Config hash invalidation”Fixtures are keyed by a config hash computed from the suite name and targetVersion. This is intentionally narrow — grader changes and case additions/removals do not invalidate fixtures because those changes don’t affect the recorded target output. If you change your target’s behavior (prompt, model, tools), bump targetVersion to invalidate fixtures:
suites: [ { name: "smoke", targetVersion: "v2.0.0", // Changing this invalidates fixtures // ... },]Staleness
Section titled “Staleness”Fixtures have a configurable TTL (default: 14 days). Stale fixtures generate warnings:
replay: { ttlDays: 14, // Warn after 14 days}Use --strict-fixtures to fail on stale fixtures instead of warning:
agent-eval-kit run --mode=replay --strict-fixtures --suite=smokeRe-recording
Section titled “Re-recording”# Re-record all fixturesagent-eval-kit run --update-fixtures --suite=smoke
# Equivalent to:agent-eval-kit run --mode=live --record --suite=smokeStripping raw responses
Section titled “Stripping raw responses”By default, the raw field is stripped from fixtures to reduce size:
replay: { stripRaw: true, // default}Set to false to preserve the full raw API response in fixtures.
Best practices
Section titled “Best practices”- Commit
.eval-fixtures/to git for reproducible CI - Add
.eval-fixtures/**/*.jsonl linguist-generatedto.gitattributesto reduce PR noise - Use
targetVersionto control when fixtures are invalidated - Re-record periodically to catch API behavior changes
- Use
--strict-fixturesin CI for maximum reliability