Skip to content

Graders Guide

Start with deterministic graders — they’re fast, free, and reproducible. Reach for LLM graders only when deterministic checks can’t capture the quality dimension you need.

If you need to check…Use
Output contains specific textcontains / notContains
Output matches exactlyexactMatch
Output matches a patternregex
Output is valid structured datajsonSchema
A tool was (or wasn’t) calledtoolCalled / toolNotCalled
Tools were called in ordertoolSequence
Tool arguments are correcttoolArgsMatch
Response time is acceptablelatency
Cost is within budgetcost / tokenCount
No prohibited contentsafetyKeywords
Numbers are grounded in datanoHallucinatedNumbers
Overall quality (subjective)llmRubric
Factual consistencyfactuality
Output classificationllmClassify

See the Graders API Reference for detailed parameters and options.

Graders are configured in a suite’s defaultGraders array. Each entry wraps a grader function with optional scoring metadata:

import { contains, latency, llmRubric } from "agent-eval-kit";
defaultGraders: [
{ grader: contains("Paris") }, // weight 1.0 (default)
{ grader: latency(5000), weight: 0.5 }, // lower weight
{ grader: llmRubric("Helpful response"), required: true }, // must pass
]

Controls the grader’s contribution to the weighted average score. Default: 1.0. A grader with weight: 2.0 has twice the influence of one with weight: 1.0.

If true, the grader must pass for the case to pass. A failed required grader immediately sets the case score to 0, regardless of other graders.

Per-grader pass threshold. The case-level threshold is inferred as the minimum across all grader thresholds (or 0.5 if none are set).

Compose graders with boolean logic using all, any, and not:

import { all, any, not, contains, toolCalled, latency } from "agent-eval-kit";
// All must pass
all([contains("Paris"), toolCalled("search"), latency(5000)])
// At least one must pass
any([contains("capital of France"), contains("Paris")])
// Must not match
not(contains("I don't know"))

Composition operators do not short-circuit — all inner graders run, and all results are collected for complete reporting.

OperatorScoreEmpty list
allminimum of all scorespass (vacuous truth)
anymaximum of all scoresfail
not1 - original

All grading configuration is at the suite level via defaultGraders. Every case in the suite is evaluated with the same set of graders:

suites: [
{
name: "smoke",
target: myTarget,
cases: [
{ id: "greeting", input: { prompt: "Say hello" } },
{ id: "special", input: { prompt: "test" } },
],
defaultGraders: [
{ grader: contains("hello") },
{ grader: latency(5000) },
],
},
]

To evaluate different cases with different graders, split them into separate suites.

  1. Required graders are checked first. Any required grader that fails: pass = false, score = 0.
  2. Weighted average: score = sum(grade.score × weight) / sum(weight).
  3. Pass: score >= threshold (default 0.5, or minimum of configured grader thresholds).

Edge case: if no graders are configured, the case passes with score 1.0.