Skip to content

Graders API

agent-eval-kit ships 20 built-in graders across three tiers: deterministic (pure functions — fast, free, reproducible), LLM (require a judge configuration), and composition (combine other graders with boolean logic).

All graders are available from both agent-eval-kit (root) and the agent-eval-kit/graders subpath.

Checks if output text contains a substring. Case-insensitive by default.

contains("Paris")
contains("Paris", { caseSensitive: true })
OptionTypeDefaultDescription
caseSensitivebooleanfalseEnable case-sensitive matching

Checks that output text does NOT contain a substring. Case-insensitive by default.

notContains("sorry")
notContains("ERROR", { caseSensitive: true })
OptionTypeDefaultDescription
caseSensitivebooleanfalseEnable case-sensitive matching

Checks that output text exactly equals the expected string. Trims whitespace by default.

exactMatch("42")
exactMatch("Hello World", { trim: false, caseSensitive: false })
OptionTypeDefaultDescription
trimbooleantrueTrim whitespace before comparing
caseSensitivebooleantrueEnable case-sensitive comparison

Tests output against a regular expression. Accepts a string or RegExp. The regex is compiled at factory time — errors are caught early, not at grade time.

regex(/\d{3}-\d{4}/)
regex("\\d+\\.\\d{2}", { flags: "i" })
OptionTypeDefaultDescription
flagsstringRegex flags (only for string patterns)

Parses output text as JSON and validates against a Zod schema. Fails if text is empty, not valid JSON, or does not match the schema.

import { z } from "zod";
jsonSchema(z.object({ name: z.string(), age: z.number() }))

Checks that a specific tool was invoked in the output tool calls. No tool calls = fail.

toolCalled("search")

Checks that a specific tool was NOT invoked. No tool calls = pass.

toolNotCalled("deleteAll")

Checks that tool calls match an expected sequence. Four modes available.

toolSequence(["search", "summarize"]) // unordered (default)
toolSequence(["search", "summarize"], "strict") // exact order and count
toolSequence(["search"], "subset") // expected tools appear in actual
toolSequence(["search", "summarize", "save"], "superset") // actual tools appear in expected
ModeDescription
unorderedSame tools, any order, same count (default)
strictExact order and count
subsetAll expected tools appear in actual (actual may have extras)
supersetAll actual tools appear in expected (actual did fewer steps)

Checks that a tool call’s arguments match expected values. Matches the first tool call with the given name.

toolArgsMatch("search", { query: "weather" }) // subset (default)
toolArgsMatch("search", { query: "weather" }, "exact") // deep equality
toolArgsMatch("search", { query: "weather" }, "contains") // strings use .includes()
ModeDescription
subsetEvery expected key exists in actual with matching value (default)
exactDeep equality of entire args object
containsLike subset, but string values use .includes() for natural language args

Checks that response latency (output.latencyMs) is within the allowed threshold. Always evaluates since latencyMs is required.

latency(5000)

Checks that response cost (output.cost) is within budget. Skips gracefully (pass) if output.cost is not reported.

cost(0.05)

Checks that total token usage (input + output) is within the allowed limit. Skips gracefully (pass) if output.tokenUsage is not reported.

tokenCount(4096)

Checks that output text does NOT contain any prohibited keywords. Case-insensitive matching.

safetyKeywords(["guaranteed returns", "buy now", "act fast"])

Checks that numbers in output text are grounded in tool call results. Catches fabricated statistics — one of the most dangerous agent failure modes.

noHallucinatedNumbers()
noHallucinatedNumbers({ tolerance: 0.01, skipSmallIntegers: false })
OptionTypeDefaultDescription
tolerancenumber0.005Relative tolerance for matching (0.005 = 0.5%)
skipSmallIntegersbooleantrueSkip integers with absolute value < 10

Always skips year-like numbers (1900–2100). Score is proportional: (checked - hallucinated) / checked. Returns metadata: { hallucinated: number[], totalChecked: number }.

These require a judge configuration in your eval config. See LLM Judge Guide for setup.

Scores agent output against natural language criteria using an LLM judge. Judge scores 1–4 (poor to excellent), normalized to 0.25–1.0.

// String shorthand
llmRubric("Response is helpful, accurate, and well-formatted")
// With options
llmRubric({
criteria: "Answer is concise",
passThreshold: 0.75,
examples: [
{ output: "Yes.", score: 4, reasoning: "Direct and concise" },
{ output: "Well, I think maybe...", score: 1, reasoning: "Rambling" },
],
})
OptionTypeDefaultDescription
criteriastring(required)Natural language evaluation criteria
passThresholdnumber0.75Score threshold for passing (0–1)
examplesarrayFew-shot calibration examples ({ output, score: 1|2|3|4, reasoning })

Default pass threshold of 0.75 maps to a judge score of 3 (“Good”) or higher.

Specialized LLM judge that evaluates factual consistency against a reference. Requires expected.text on the case — fails immediately without it.

factuality()
factuality({ passThreshold: 0.9 })
OptionTypeDefaultDescription
passThresholdnumber0.75Score threshold for passing (0–1)

Evaluates accuracy, completeness, and absence of fabrication using a built-in rubric with 3 calibration examples.

Classifies agent output into one of N categories using an LLM judge. Requires at least 2 categories.

llmClassify({
categories: {
helpful: "Answers the question directly",
unhelpful: "Does not answer the question",
},
})
llmClassify({
categories: {
positive: "Positive sentiment",
negative: "Negative sentiment",
neutral: "Neutral sentiment",
},
criteria: "Classify based on overall tone",
})
OptionTypeRequiredDescription
categoriesRecord<string, string>yesCategory name → description (min 2)
criteriastringnoAdditional classification instructions

Pass condition: If expected.metadata.classification is set on the case, the judge’s classification must match. If not set, runs in classification-only mode (always passes).

Returns metadata: { classification, reasoning, confidence, judgeCost }.

Combine graders with boolean logic. These operators do not short-circuit — they collect all results for complete reporting.

Conjunction: all graders must pass. Score is the minimum of all scores.

all([contains("Paris"), toolCalled("search"), latency(5000)])

Empty list returns pass (vacuous truth, score 1.0).

Disjunction: at least one grader must pass. Score is the maximum of all scores.

any([contains("capital of France"), contains("Paris")])

Empty list returns fail (score 0.0).

Negation: inverts a grader’s result. Score becomes 1 - original. graderName becomes not(<inner>).

not(contains("I don't know"))

Apply multiple graders per case with weights and a required flag. The final score is a weighted average. A case passes if the weighted score meets the pass threshold (default: 0.5).

defaultGraders: [
{ grader: contains("hello"), weight: 0.3 },
{ grader: latency(5000), weight: 0.2, required: true },
{ grader: llmRubric("Helpful response"), weight: 0.5 },
]

Required graders that fail cause an immediate score of 0 regardless of other graders.