Graders API

agent-eval-kit ships 20 built-in graders across three tiers: deterministic (pure functions — fast, free, reproducible), LLM (require a judge configuration), and composition (combine other graders with boolean logic).

All graders are available from both agent-eval-kit (root) and the agent-eval-kit/graders subpath.

Text Graders

`contains`

Checks if output text contains a substring. Case-insensitive by default.

contains("Paris")
contains("Paris", { caseSensitive: true })

Option	Type	Default	Description
`caseSensitive`	boolean	`false`	Enable case-sensitive matching

`notContains`

Checks that output text does NOT contain a substring. Case-insensitive by default.

notContains("sorry")
notContains("ERROR", { caseSensitive: true })

Option	Type	Default	Description
`caseSensitive`	boolean	`false`	Enable case-sensitive matching

`exactMatch`

Checks that output text exactly equals the expected string. Trims whitespace by default.

exactMatch("42")
exactMatch("Hello World", { trim: false, caseSensitive: false })

Option	Type	Default	Description
`trim`	boolean	`true`	Trim whitespace before comparing
`caseSensitive`	boolean	`true`	Enable case-sensitive comparison

`regex`

Tests output against a regular expression. Accepts a string or RegExp. The regex is compiled at factory time — errors are caught early, not at grade time.

regex(/\d{3}-\d{4}/)
regex("\\d+\\.\\d{2}", { flags: "i" })

Option	Type	Default	Description
`flags`	string	—	Regex flags (only for string patterns)

`jsonSchema`

Parses output text as JSON and validates against a Zod schema. Fails if text is empty, not valid JSON, or does not match the schema.

import { z } from "zod";

jsonSchema(z.object({ name: z.string(), age: z.number() }))

Tool Call Graders

`toolCalled`

Checks that a specific tool was invoked in the output tool calls. No tool calls = fail.

toolCalled("search")

`toolNotCalled`

Checks that a specific tool was NOT invoked. No tool calls = pass.

toolNotCalled("deleteAll")

`toolSequence`

Checks that tool calls match an expected sequence. Four modes available.

toolSequence(["search", "summarize"])                     // unordered (default)
toolSequence(["search", "summarize"], "strict")           // exact order and count
toolSequence(["search"], "subset")                        // expected tools appear in actual
toolSequence(["search", "summarize", "save"], "superset") // actual tools appear in expected

Mode	Description
`unordered`	Same tools, any order, same count (default)
`strict`	Exact order and count
`subset`	All expected tools appear in actual (actual may have extras)
`superset`	All actual tools appear in expected (actual did fewer steps)

`toolArgsMatch`

Checks that a tool call’s arguments match expected values. Matches the first tool call with the given name.

toolArgsMatch("search", { query: "weather" })                    // subset (default)
toolArgsMatch("search", { query: "weather" }, "exact")           // deep equality
toolArgsMatch("search", { query: "weather" }, "contains")        // strings use .includes()

Mode	Description
`subset`	Every expected key exists in actual with matching value (default)
`exact`	Deep equality of entire args object
`contains`	Like subset, but string values use `.includes()` for natural language args

Metric Graders

`latency`

Checks that response latency (output.latencyMs) is within the allowed threshold. Always evaluates since latencyMs is required.

latency(5000)

`cost`

Checks that response cost (output.cost) is within budget. Skips gracefully (pass) if output.cost is not reported.

cost(0.05)

`tokenCount`

Checks that total token usage (input + output) is within the allowed limit. Skips gracefully (pass) if output.tokenUsage is not reported.

tokenCount(4096)

Safety Graders

`safetyKeywords`

Checks that output text does NOT contain any prohibited keywords. Case-insensitive matching.

safetyKeywords(["guaranteed returns", "buy now", "act fast"])

`noHallucinatedNumbers`

Checks that numbers in output text are grounded in tool call results. Catches fabricated statistics — one of the most dangerous agent failure modes.

noHallucinatedNumbers()
noHallucinatedNumbers({ tolerance: 0.01, skipSmallIntegers: false })

Option	Type	Default	Description
`tolerance`	number	`0.005`	Relative tolerance for matching (0.005 = 0.5%)
`skipSmallIntegers`	boolean	`true`	Skip integers with absolute value < 10

Always skips year-like numbers (1900–2100). Score is proportional: (checked - hallucinated) / checked. Returns metadata: { hallucinated: number[], totalChecked: number }.

LLM Graders

These require a judge configuration in your eval config. See LLM Judge Guide for setup.

`llmRubric`

Scores agent output against natural language criteria using an LLM judge. Judge scores 1–4 (poor to excellent), normalized to 0.25–1.0.

// String shorthand
llmRubric("Response is helpful, accurate, and well-formatted")

// With options
llmRubric({
  criteria: "Answer is concise",
  passThreshold: 0.75,
  examples: [
    { output: "Yes.", score: 4, reasoning: "Direct and concise" },
    { output: "Well, I think maybe...", score: 1, reasoning: "Rambling" },
  ],
})

Option	Type	Default	Description
`criteria`	string	(required)	Natural language evaluation criteria
`passThreshold`	number	`0.75`	Score threshold for passing (0–1)
`examples`	array	—	Few-shot calibration examples (`{ output, score: 1\|2\|3\|4, reasoning }`)

Default pass threshold of 0.75 maps to a judge score of 3 (“Good”) or higher.

`factuality`

Specialized LLM judge that evaluates factual consistency against a reference. Requires expected.text on the case — fails immediately without it.

factuality()
factuality({ passThreshold: 0.9 })

Option	Type	Default	Description
`passThreshold`	number	`0.75`	Score threshold for passing (0–1)

Evaluates accuracy, completeness, and absence of fabrication using a built-in rubric with 3 calibration examples.

`llmClassify`

Classifies agent output into one of N categories using an LLM judge. Requires at least 2 categories.

llmClassify({
  categories: {
    helpful: "Answers the question directly",
    unhelpful: "Does not answer the question",
  },
})

llmClassify({
  categories: {
    positive: "Positive sentiment",
    negative: "Negative sentiment",
    neutral: "Neutral sentiment",
  },
  criteria: "Classify based on overall tone",
})

Option	Type	Required	Description
`categories`	`Record<string, string>`	yes	Category name → description (min 2)
`criteria`	string	no	Additional classification instructions

Pass condition: If expected.metadata.classification is set on the case, the judge’s classification must match. If not set, runs in classification-only mode (always passes).

Returns metadata: { classification, reasoning, confidence, judgeCost }.

Composition Operators

Combine graders with boolean logic. These operators do not short-circuit — they collect all results for complete reporting.

`all`

Conjunction: all graders must pass. Score is the minimum of all scores.

all([contains("Paris"), toolCalled("search"), latency(5000)])

Empty list returns pass (vacuous truth, score 1.0).

`any`

Disjunction: at least one grader must pass. Score is the maximum of all scores.

any([contains("capital of France"), contains("Paris")])

Empty list returns fail (score 0.0).

`not`

Negation: inverts a grader’s result. Score becomes 1 - original. graderName becomes not(<inner>).

not(contains("I don't know"))

Weighted Scoring

Apply multiple graders per case with weights and a required flag. The final score is a weighted average. A case passes if the weighted score meets the pass threshold (default: 0.5).

defaultGraders: [
  { grader: contains("hello"), weight: 0.3 },
  { grader: latency(5000), weight: 0.2, required: true },
  { grader: llmRubric("Helpful response"), weight: 0.5 },
]

Required graders that fail cause an immediate score of 0 regardless of other graders.