Graders Guide
Choosing graders
Section titled “Choosing graders”Start with deterministic graders — they’re fast, free, and reproducible. Reach for LLM graders only when deterministic checks can’t capture the quality dimension you need.
| If you need to check… | Use |
|---|---|
| Output contains specific text | contains / notContains |
| Output matches exactly | exactMatch |
| Output matches a pattern | regex |
| Output is valid structured data | jsonSchema |
| A tool was (or wasn’t) called | toolCalled / toolNotCalled |
| Tools were called in order | toolSequence |
| Tool arguments are correct | toolArgsMatch |
| Response time is acceptable | latency |
| Cost is within budget | cost / tokenCount |
| No prohibited content | safetyKeywords |
| Numbers are grounded in data | noHallucinatedNumbers |
| Overall quality (subjective) | llmRubric |
| Factual consistency | factuality |
| Output classification | llmClassify |
See the Graders API Reference for detailed parameters and options.
Configuring graders in suites
Section titled “Configuring graders in suites”Graders are configured in a suite’s defaultGraders array. Each entry wraps a grader function with optional scoring metadata:
import { contains, latency, llmRubric } from "agent-eval-kit";
defaultGraders: [ { grader: contains("Paris") }, // weight 1.0 (default) { grader: latency(5000), weight: 0.5 }, // lower weight { grader: llmRubric("Helpful response"), required: true }, // must pass]Weight
Section titled “Weight”Controls the grader’s contribution to the weighted average score. Default: 1.0. A grader with weight: 2.0 has twice the influence of one with weight: 1.0.
Required
Section titled “Required”If true, the grader must pass for the case to pass. A failed required grader immediately sets the case score to 0, regardless of other graders.
Threshold
Section titled “Threshold”Per-grader pass threshold. The case-level threshold is inferred as the minimum across all grader thresholds (or 0.5 if none are set).
Composition
Section titled “Composition”Compose graders with boolean logic using all, any, and not:
import { all, any, not, contains, toolCalled, latency } from "agent-eval-kit";
// All must passall([contains("Paris"), toolCalled("search"), latency(5000)])
// At least one must passany([contains("capital of France"), contains("Paris")])
// Must not matchnot(contains("I don't know"))Composition operators do not short-circuit — all inner graders run, and all results are collected for complete reporting.
| Operator | Score | Empty list |
|---|---|---|
all | minimum of all scores | pass (vacuous truth) |
any | maximum of all scores | fail |
not | 1 - original | — |
Suite-level graders
Section titled “Suite-level graders”All grading configuration is at the suite level via defaultGraders. Every case in the suite is evaluated with the same set of graders:
suites: [ { name: "smoke", target: myTarget, cases: [ { id: "greeting", input: { prompt: "Say hello" } }, { id: "special", input: { prompt: "test" } }, ], defaultGraders: [ { grader: contains("hello") }, { grader: latency(5000) }, ], },]To evaluate different cases with different graders, split them into separate suites.
Scoring algorithm
Section titled “Scoring algorithm”- Required graders are checked first. Any required grader that fails:
pass = false,score = 0. - Weighted average:
score = sum(grade.score × weight) / sum(weight). - Pass:
score >= threshold(default 0.5, or minimum of configured grader thresholds).
Edge case: if no graders are configured, the case passes with score 1.0.