Graders API
agent-eval-kit ships 20 built-in graders across three tiers: deterministic (pure functions — fast, free, reproducible), LLM (require a judge configuration), and composition (combine other graders with boolean logic).
All graders are available from both agent-eval-kit (root) and the agent-eval-kit/graders subpath.
Text Graders
Section titled “Text Graders”contains
Section titled “contains”Checks if output text contains a substring. Case-insensitive by default.
contains("Paris")contains("Paris", { caseSensitive: true })| Option | Type | Default | Description |
|---|---|---|---|
caseSensitive | boolean | false | Enable case-sensitive matching |
notContains
Section titled “notContains”Checks that output text does NOT contain a substring. Case-insensitive by default.
notContains("sorry")notContains("ERROR", { caseSensitive: true })| Option | Type | Default | Description |
|---|---|---|---|
caseSensitive | boolean | false | Enable case-sensitive matching |
exactMatch
Section titled “exactMatch”Checks that output text exactly equals the expected string. Trims whitespace by default.
exactMatch("42")exactMatch("Hello World", { trim: false, caseSensitive: false })| Option | Type | Default | Description |
|---|---|---|---|
trim | boolean | true | Trim whitespace before comparing |
caseSensitive | boolean | true | Enable case-sensitive comparison |
Tests output against a regular expression. Accepts a string or RegExp. The regex is compiled at factory time — errors are caught early, not at grade time.
regex(/\d{3}-\d{4}/)regex("\\d+\\.\\d{2}", { flags: "i" })| Option | Type | Default | Description |
|---|---|---|---|
flags | string | — | Regex flags (only for string patterns) |
jsonSchema
Section titled “jsonSchema”Parses output text as JSON and validates against a Zod schema. Fails if text is empty, not valid JSON, or does not match the schema.
import { z } from "zod";
jsonSchema(z.object({ name: z.string(), age: z.number() }))Tool Call Graders
Section titled “Tool Call Graders”toolCalled
Section titled “toolCalled”Checks that a specific tool was invoked in the output tool calls. No tool calls = fail.
toolCalled("search")toolNotCalled
Section titled “toolNotCalled”Checks that a specific tool was NOT invoked. No tool calls = pass.
toolNotCalled("deleteAll")toolSequence
Section titled “toolSequence”Checks that tool calls match an expected sequence. Four modes available.
toolSequence(["search", "summarize"]) // unordered (default)toolSequence(["search", "summarize"], "strict") // exact order and counttoolSequence(["search"], "subset") // expected tools appear in actualtoolSequence(["search", "summarize", "save"], "superset") // actual tools appear in expected| Mode | Description |
|---|---|
unordered | Same tools, any order, same count (default) |
strict | Exact order and count |
subset | All expected tools appear in actual (actual may have extras) |
superset | All actual tools appear in expected (actual did fewer steps) |
toolArgsMatch
Section titled “toolArgsMatch”Checks that a tool call’s arguments match expected values. Matches the first tool call with the given name.
toolArgsMatch("search", { query: "weather" }) // subset (default)toolArgsMatch("search", { query: "weather" }, "exact") // deep equalitytoolArgsMatch("search", { query: "weather" }, "contains") // strings use .includes()| Mode | Description |
|---|---|
subset | Every expected key exists in actual with matching value (default) |
exact | Deep equality of entire args object |
contains | Like subset, but string values use .includes() for natural language args |
Metric Graders
Section titled “Metric Graders”latency
Section titled “latency”Checks that response latency (output.latencyMs) is within the allowed threshold. Always evaluates since latencyMs is required.
latency(5000)Checks that response cost (output.cost) is within budget. Skips gracefully (pass) if output.cost is not reported.
cost(0.05)tokenCount
Section titled “tokenCount”Checks that total token usage (input + output) is within the allowed limit. Skips gracefully (pass) if output.tokenUsage is not reported.
tokenCount(4096)Safety Graders
Section titled “Safety Graders”safetyKeywords
Section titled “safetyKeywords”Checks that output text does NOT contain any prohibited keywords. Case-insensitive matching.
safetyKeywords(["guaranteed returns", "buy now", "act fast"])noHallucinatedNumbers
Section titled “noHallucinatedNumbers”Checks that numbers in output text are grounded in tool call results. Catches fabricated statistics — one of the most dangerous agent failure modes.
noHallucinatedNumbers()noHallucinatedNumbers({ tolerance: 0.01, skipSmallIntegers: false })| Option | Type | Default | Description |
|---|---|---|---|
tolerance | number | 0.005 | Relative tolerance for matching (0.005 = 0.5%) |
skipSmallIntegers | boolean | true | Skip integers with absolute value < 10 |
Always skips year-like numbers (1900–2100). Score is proportional: (checked - hallucinated) / checked. Returns metadata: { hallucinated: number[], totalChecked: number }.
LLM Graders
Section titled “LLM Graders”These require a judge configuration in your eval config. See LLM Judge Guide for setup.
llmRubric
Section titled “llmRubric”Scores agent output against natural language criteria using an LLM judge. Judge scores 1–4 (poor to excellent), normalized to 0.25–1.0.
// String shorthandllmRubric("Response is helpful, accurate, and well-formatted")
// With optionsllmRubric({ criteria: "Answer is concise", passThreshold: 0.75, examples: [ { output: "Yes.", score: 4, reasoning: "Direct and concise" }, { output: "Well, I think maybe...", score: 1, reasoning: "Rambling" }, ],})| Option | Type | Default | Description |
|---|---|---|---|
criteria | string | (required) | Natural language evaluation criteria |
passThreshold | number | 0.75 | Score threshold for passing (0–1) |
examples | array | — | Few-shot calibration examples ({ output, score: 1|2|3|4, reasoning }) |
Default pass threshold of 0.75 maps to a judge score of 3 (“Good”) or higher.
factuality
Section titled “factuality”Specialized LLM judge that evaluates factual consistency against a reference. Requires expected.text on the case — fails immediately without it.
factuality()factuality({ passThreshold: 0.9 })| Option | Type | Default | Description |
|---|---|---|---|
passThreshold | number | 0.75 | Score threshold for passing (0–1) |
Evaluates accuracy, completeness, and absence of fabrication using a built-in rubric with 3 calibration examples.
llmClassify
Section titled “llmClassify”Classifies agent output into one of N categories using an LLM judge. Requires at least 2 categories.
llmClassify({ categories: { helpful: "Answers the question directly", unhelpful: "Does not answer the question", },})
llmClassify({ categories: { positive: "Positive sentiment", negative: "Negative sentiment", neutral: "Neutral sentiment", }, criteria: "Classify based on overall tone",})| Option | Type | Required | Description |
|---|---|---|---|
categories | Record<string, string> | yes | Category name → description (min 2) |
criteria | string | no | Additional classification instructions |
Pass condition: If expected.metadata.classification is set on the case, the judge’s classification must match. If not set, runs in classification-only mode (always passes).
Returns metadata: { classification, reasoning, confidence, judgeCost }.
Composition Operators
Section titled “Composition Operators”Combine graders with boolean logic. These operators do not short-circuit — they collect all results for complete reporting.
Conjunction: all graders must pass. Score is the minimum of all scores.
all([contains("Paris"), toolCalled("search"), latency(5000)])Empty list returns pass (vacuous truth, score 1.0).
Disjunction: at least one grader must pass. Score is the maximum of all scores.
any([contains("capital of France"), contains("Paris")])Empty list returns fail (score 0.0).
Negation: inverts a grader’s result. Score becomes 1 - original. graderName becomes not(<inner>).
not(contains("I don't know"))Weighted Scoring
Section titled “Weighted Scoring”Apply multiple graders per case with weights and a required flag. The final score is a weighted average. A case passes if the weighted score meets the pass threshold (default: 0.5).
defaultGraders: [ { grader: contains("hello"), weight: 0.3 }, { grader: latency(5000), weight: 0.2, required: true }, { grader: llmRubric("Helpful response"), weight: 0.5 },]Required graders that fail cause an immediate score of 0 regardless of other graders.