Config Reference

Config file

agent-eval-kit uses jiti for TypeScript config loading. Create one of:

eval.config.ts (recommended)
eval.config.mts
eval.config.js
eval.config.mjs

import { defineConfig } from "agent-eval-kit";

export default defineConfig({
  // ... configuration
});

defineConfig is a pure identity function that provides TypeScript type inference.

Top-level options

Property	Type	Default	Description
`suites`	`SuiteConfig[]`	(required)	Array of eval suites
`judge`	`JudgeConfig`	—	Global LLM judge configuration
`fixtureDir`	`string`	`".eval-fixtures"`	Directory for recorded fixtures
`plugins`	`EvalPlugin[]`	`[]`	Plugins to apply
`reporters`	`ReporterConfig[]`	`[]`	Output reporters
`run`	`RunConfig`	—	Global run settings

Run config

Property	Type	Default	Description
`defaultMode`	`"live" \| "replay" \| "judge-only"`	`"live"`	Default execution mode
`timeoutMs`	`number`	`30000`	Per-case timeout in milliseconds
`rateLimit`	`number`	—	Max requests per minute (live mode only)

Suite config

Each suite defines a target function, cases, graders, and gates.

import { defineConfig } from "agent-eval-kit";
import { contains, latency } from "agent-eval-kit";

export default defineConfig({
  suites: [
    {
      name: "smoke",
      description: "Basic smoke tests",
      target: async (input) => {
        const response = await myAgent(input.prompt);
        return { text: response.text, latencyMs: response.latencyMs };
      },
      cases: "cases/smoke.jsonl",
      defaultGraders: [
        { grader: contains("hello"), weight: 1.0 },
        { grader: latency(5000), weight: 0.5, required: true },
      ],
      gates: { passRate: 0.9 },
      concurrency: 5,
      targetVersion: "v1.2.0",
      replay: { ttlDays: 14, stripRaw: true },
      tags: ["ci", "fast"],
    },
  ],
});

Property	Type	Default	Description
`name`	`string`	(required)	Suite identifier
`description`	`string`	—	Human-readable description
`target`	`(input: CaseInput) => Promise<TargetOutput>`	(required)	Function that calls your agent/API
`cases`	`Case[] \| string \| (Case \| string)[]`	(required)	Inline cases, file path(s), or a mix
`defaultGraders`	`GraderConfig[]`	`[]`	Graders applied to all cases in this suite
`gates`	`GateConfig`	—	Quality thresholds
`concurrency`	`number`	—	Max parallel case executions
`targetVersion`	`string`	—	Version identifier for fixture invalidation
`replay`	`ReplayConfig`	—	Replay-specific settings
`tags`	`string[]`	—	Tags for organization

Cases

Cases can be specified as:

File path: "cases/smoke.jsonl" or "cases/smoke.yaml" (resolved relative to config file)
Inline array: [{ id: "test-1", input: { prompt: "hello" } }]
Mixed array: ["cases/base.jsonl", { id: "extra", input: { prompt: "edge case" } }]

Supported formats: .jsonl (with // and # comment support), .yaml, .yml.

Case IDs must be unique within a file.

Grader config

Each entry in defaultGraders wraps a grader function with execution metadata:

Property	Type	Default	Description
`grader`	`GraderFn`	(required)	The grader function (e.g., `contains("hello")`)
`weight`	`number`	`1.0`	Weight in the weighted average score
`required`	`boolean`	`false`	If `true`, failure causes immediate score 0
`threshold`	`number`	`0.5`	Per-grader pass threshold

Case format

{"id": "unique-id", "input": {"prompt": "question"}, "expected": {"text": "answer"}, "category": "happy_path", "tags": ["important"]}

Field	Type	Required	Description
`id`	`string`	yes	Unique identifier within the file
`input`	`Record<string, unknown>`	yes	Arbitrary key-value pairs passed to your target
`expected`	`CaseExpected`	no	Expected output for grader comparison
`description`	`string`	no	Human-readable description
`category`	`CaseCategory`	no	One of: `happy_path`, `edge_case`, `adversarial`, `multi_step`, `regression`
`tags`	`string[]`	no	Tags for filtering and organization

Expected output

Field	Type	Description
`text`	`string`	Expected text (used by `factuality`, `exactMatch`, etc.)
`toolCalls`	`ToolCall[]`	Expected tool calls
`metadata`	`Record<string, unknown>`	Arbitrary metadata (e.g., `classification` for `llmClassify`)

Gate config

Property	Type	Description
`passRate`	`number` (0–1)	Minimum fraction of cases that must pass
`maxCost`	`number`	Maximum total run cost in USD
`p95LatencyMs`	`number`	Maximum 95th percentile latency in milliseconds

All gates are optional. A failed gate causes gateResult.pass = false and CLI exit code 1.

Replay config

Property	Type	Default	Description
`ttlDays`	`number`	`14`	Fixture staleness threshold in days
`stripRaw`	`boolean`	`true`	Remove `raw` field from recorded fixtures

Target output

The target function must return a TargetOutput:

interface TargetOutput {
  text?: string;             // Main text response
  latencyMs: number;         // Response latency (required)
  raw?: unknown;             // Raw API response (stripped in fixtures if configured)
  toolCalls?: ToolCall[];    // Tool/function calls made
  tokenUsage?: TokenUsage;   // Token counts { input?: number, output?: number }
  cost?: number;             // Estimated cost in USD
}

Only latencyMs is required. All other fields are optional.

Judge config

Required for LLM graders (llmRubric, factuality, llmClassify). The judge is provider-agnostic — you implement the call function for your LLM SDK.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export default defineConfig({
  judge: {
    call: async (messages, options) => {
      const response = await client.messages.create({
        model: options?.model ?? "claude-sonnet-4-20250514",
        max_tokens: options?.maxTokens ?? 1024,
        messages: messages.filter((m) => m.role !== "system").map((m) => ({
          role: m.role as "user" | "assistant",
          content: m.content,
        })),
        system: messages.find((m) => m.role === "system")?.content,
      });
      return {
        text: response.content[0].type === "text" ? response.content[0].text : "",
        tokenUsage: {
          input: response.usage.input_tokens,
          output: response.usage.output_tokens,
        },
      };
    },
    model: "claude-sonnet-4-20250514",
    temperature: 0,
    maxTokens: 1024,
  },
  suites: [/* ... */],
});

Property	Type	Required	Description
`call`	`JudgeCallFn`	yes	`(messages, options?) => Promise<JudgeResponse>`
`model`	`string`	no	Default model identifier
`temperature`	`number`	no	Default temperature
`maxTokens`	`number`	no	Default max tokens

JudgeMessage

{ role: "system" | "user" | "assistant", content: string }

JudgeResponse

{ text: string, tokenUsage?: TokenUsage, cost?: number, modelId?: string }

Reporter config

Reporters can be specified as strings, objects, or with options:

reporters: [
  "console",                                    // string name
  { reporter: "json", output: "results.json" }, // with output file
  { reporter: "junit", output: "results.xml" }, // JUnit XML
  "markdown",                                   // markdown table
]

See Reporters for details on each format.