LLM Judge
Overview
Section titled “Overview”When deterministic graders can’t capture the quality you need, LLM graders use a language model to evaluate agent output. agent-eval-kit provides three LLM graders: llmRubric, factuality, and llmClassify.
Setting up a judge
Section titled “Setting up a judge”LLM graders require a judge configuration at the config root. The judge is provider-agnostic — you implement a call function that wraps your LLM SDK:
import { defineConfig } from "agent-eval-kit";import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export default defineConfig({ judge: { call: async (messages, options) => { const response = await client.messages.create({ model: options?.model ?? "claude-sonnet-4-20250514", max_tokens: options?.maxTokens ?? 1024, messages: messages .filter((m) => m.role !== "system") .map((m) => ({ role: m.role as "user" | "assistant", content: m.content })), system: messages.find((m) => m.role === "system")?.content, }); return { text: response.content[0].type === "text" ? response.content[0].text : "", tokenUsage: { input: response.usage.input_tokens, output: response.usage.output_tokens, }, }; }, }, suites: [/* ... */],});OpenAI example
Section titled “OpenAI example”import OpenAI from "openai";
const client = new OpenAI();
judge: { call: async (messages, options) => { const response = await client.chat.completions.create({ model: options?.model ?? "gpt-4o", max_tokens: options?.maxTokens ?? 1024, temperature: options?.temperature ?? 0, messages: messages.map((m) => ({ role: m.role, content: m.content })), }); return { text: response.choices[0].message.content ?? "", tokenUsage: { input: response.usage?.prompt_tokens ?? 0, output: response.usage?.completion_tokens ?? 0, }, }; },}LLM graders
Section titled “LLM graders”llmRubric
Section titled “llmRubric”Scores output against natural language criteria on a 1–4 scale:
| Score | Meaning | Normalized |
|---|---|---|
| 1 | Poor | 0.25 |
| 2 | Fair | 0.50 |
| 3 | Good | 0.75 |
| 4 | Excellent | 1.00 |
Default pass threshold is 0.75 (score 3+).
import { llmRubric } from "agent-eval-kit";
// Simple string shorthandllmRubric("Response is helpful, accurate, and well-formatted")
// With optionsllmRubric({ criteria: "Answer is concise and actionable", passThreshold: 0.5, // Accept score 2+ examples: [ { output: "Yes.", score: 4, reasoning: "Direct and concise" }, { output: "Well, I think maybe...", score: 1, reasoning: "Rambling" }, ],})Few-shot examples calibrate the judge — highly recommended for consistent scoring.
factuality
Section titled “factuality”Evaluates factual consistency against a reference text. Requires expected.text on the case.
import { factuality } from "agent-eval-kit";
factuality()factuality({ passThreshold: 0.9 })Checks three dimensions: accuracy, completeness, and absence of fabrication. Uses a built-in rubric with 3 calibration examples.
llmClassify
Section titled “llmClassify”Classifies output into categories. Requires at least 2 categories.
import { llmClassify } from "agent-eval-kit";
llmClassify({ categories: { helpful: "Directly answers the question", partial: "Partially addresses the question", unhelpful: "Does not address the question", }, criteria: "Focus on whether the core question is answered",})Pass condition: If expected.metadata.classification is set on the case, the judge’s classification must match. If no expected classification is set, the grader always passes (classification-only mode).
Judge caching
Section titled “Judge caching”LLM judge calls are expensive. agent-eval-kit provides two caching strategies to avoid redundant calls.
In-memory cache
Section titled “In-memory cache”Caches results for the duration of the process:
import { createCachingJudge } from "agent-eval-kit";
judge: { call: createCachingJudge(myJudgeFn, { maxEntries: 1000 }),}Disk cache
Section titled “Disk cache”Persists cache across runs in .eval-cache/judge/:
import { createDiskCachingJudge } from "agent-eval-kit";
judge: { call: createDiskCachingJudge(myJudgeFn, { cacheDir: ".eval-cache/judge", // default ttlDays: 7, // default maxEntries: 10_000, // default }),}Both caches only cache calls with temperature === 0 (deterministic responses). Cache keys are SHA-256 hashes of the messages, model, and max tokens.
Cache management
Section titled “Cache management”# View cache statsagent-eval-kit cache stats
# Clear judge cacheagent-eval-kit cache clear --judgeProgrammatically:
import { clearJudgeCache, judgeCacheStats } from "agent-eval-kit";
await clearJudgeCache(); // default dirawait clearJudgeCache(".my-cache");const stats = await judgeCacheStats();Cost tracking
Section titled “Cost tracking”LLM grader results include cost metadata in grade.metadata.judgeCost. The console reporter aggregates and displays total judge cost per run. Cost data flows from the JudgeResponse.cost and JudgeResponse.tokenUsage fields you return from your judge function.
Judge-only mode
Section titled “Judge-only mode”When tuning judge criteria, you don’t need to re-run your target. Judge-only mode re-grades a previous run:
# Run initiallyagent-eval-kit run --suite=smoke
# List runs to find the IDagent-eval-kit list
# Re-grade with updated gradersagent-eval-kit run --mode=judge-only --run-id=<run-id> --suite=smokeThis reuses the target outputs from the previous run and applies your current graders to them.
Bias mitigation
Section titled “Bias mitigation”The built-in judge prompt includes rules to mitigate common LLM judge biases:
- No length bias — longer responses are not automatically scored higher
- Chain-of-thought enforced — reasoning before scoring
- Structured JSON response format — prevents ambiguous parsing
The judge response parser has a 3-layer fallback: strict JSON, markdown code block extraction, and text pattern matching (Score: N).