Skip to content

Config Reference

agent-eval-kit uses jiti for TypeScript config loading. Create one of:

  • eval.config.ts (recommended)
  • eval.config.mts
  • eval.config.js
  • eval.config.mjs
import { defineConfig } from "agent-eval-kit";
export default defineConfig({
// ... configuration
});

defineConfig is a pure identity function that provides TypeScript type inference.

PropertyTypeDefaultDescription
suitesSuiteConfig[](required)Array of eval suites
judgeJudgeConfigGlobal LLM judge configuration
fixtureDirstring".eval-fixtures"Directory for recorded fixtures
pluginsEvalPlugin[][]Plugins to apply
reportersReporterConfig[][]Output reporters
runRunConfigGlobal run settings
PropertyTypeDefaultDescription
defaultMode"live" | "replay" | "judge-only""live"Default execution mode
timeoutMsnumber30000Per-case timeout in milliseconds
rateLimitnumberMax requests per minute (live mode only)

Each suite defines a target function, cases, graders, and gates.

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";
export default defineConfig({
suites: [
{
name: "smoke",
description: "Basic smoke tests",
target: async (input) => {
const response = await myAgent(input.prompt);
return { text: response.text, latencyMs: response.latencyMs };
},
cases: "cases/smoke.jsonl",
defaultGraders: [
{ grader: contains("hello"), weight: 1.0 },
{ grader: latency(5000), weight: 0.5, required: true },
],
gates: { passRate: 0.9 },
concurrency: 5,
targetVersion: "v1.2.0",
replay: { ttlDays: 14, stripRaw: true },
tags: ["ci", "fast"],
},
],
});
PropertyTypeDefaultDescription
namestring(required)Suite identifier
descriptionstringHuman-readable description
target(input: CaseInput) => Promise<TargetOutput>(required)Function that calls your agent/API
casesCase[] | string | (Case | string)[](required)Inline cases, file path(s), or a mix
defaultGradersGraderConfig[][]Graders applied to all cases in this suite
gatesGateConfigQuality thresholds
concurrencynumberMax parallel case executions
targetVersionstringVersion identifier for fixture invalidation
replayReplayConfigReplay-specific settings
tagsstring[]Tags for organization

Cases can be specified as:

  • File path: "cases/smoke.jsonl" or "cases/smoke.yaml" (resolved relative to config file)
  • Inline array: [{ id: "test-1", input: { prompt: "hello" } }]
  • Mixed array: ["cases/base.jsonl", { id: "extra", input: { prompt: "edge case" } }]

Supported formats: .jsonl (with // and # comment support), .yaml, .yml.

Case IDs must be unique within a file.

Each entry in defaultGraders wraps a grader function with execution metadata:

PropertyTypeDefaultDescription
graderGraderFn(required)The grader function (e.g., contains("hello"))
weightnumber1.0Weight in the weighted average score
requiredbooleanfalseIf true, failure causes immediate score 0
thresholdnumber0.5Per-grader pass threshold
{"id": "unique-id", "input": {"prompt": "question"}, "expected": {"text": "answer"}, "category": "happy_path", "tags": ["important"]}
FieldTypeRequiredDescription
idstringyesUnique identifier within the file
inputRecord<string, unknown>yesArbitrary key-value pairs passed to your target
expectedCaseExpectednoExpected output for grader comparison
descriptionstringnoHuman-readable description
categoryCaseCategorynoOne of: happy_path, edge_case, adversarial, multi_step, regression
tagsstring[]noTags for filtering and organization
FieldTypeDescription
textstringExpected text (used by factuality, exactMatch, etc.)
toolCallsToolCall[]Expected tool calls
metadataRecord<string, unknown>Arbitrary metadata (e.g., classification for llmClassify)
PropertyTypeDescription
passRatenumber (0–1)Minimum fraction of cases that must pass
maxCostnumberMaximum total run cost in USD
p95LatencyMsnumberMaximum 95th percentile latency in milliseconds

All gates are optional. A failed gate causes gateResult.pass = false and CLI exit code 1.

PropertyTypeDefaultDescription
ttlDaysnumber14Fixture staleness threshold in days
stripRawbooleantrueRemove raw field from recorded fixtures

The target function must return a TargetOutput:

interface TargetOutput {
text?: string; // Main text response
latencyMs: number; // Response latency (required)
raw?: unknown; // Raw API response (stripped in fixtures if configured)
toolCalls?: ToolCall[]; // Tool/function calls made
tokenUsage?: TokenUsage; // Token counts { input?: number, output?: number }
cost?: number; // Estimated cost in USD
}

Only latencyMs is required. All other fields are optional.

Required for LLM graders (llmRubric, factuality, llmClassify). The judge is provider-agnostic — you implement the call function for your LLM SDK.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export default defineConfig({
judge: {
call: async (messages, options) => {
const response = await client.messages.create({
model: options?.model ?? "claude-sonnet-4-20250514",
max_tokens: options?.maxTokens ?? 1024,
messages: messages.filter((m) => m.role !== "system").map((m) => ({
role: m.role as "user" | "assistant",
content: m.content,
})),
system: messages.find((m) => m.role === "system")?.content,
});
return {
text: response.content[0].type === "text" ? response.content[0].text : "",
tokenUsage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
},
};
},
model: "claude-sonnet-4-20250514",
temperature: 0,
maxTokens: 1024,
},
suites: [/* ... */],
});
PropertyTypeRequiredDescription
callJudgeCallFnyes(messages, options?) => Promise<JudgeResponse>
modelstringnoDefault model identifier
temperaturenumbernoDefault temperature
maxTokensnumbernoDefault max tokens
{ role: "system" | "user" | "assistant", content: string }
{ text: string, tokenUsage?: TokenUsage, cost?: number, modelId?: string }

Reporters can be specified as strings, objects, or with options:

reporters: [
"console", // string name
{ reporter: "json", output: "results.json" }, // with output file
{ reporter: "junit", output: "results.xml" }, // JUnit XML
"markdown", // markdown table
]

See Reporters for details on each format.