MCP Server
Overview
Section titled “Overview”agent-eval-kit includes an MCP (Model Context Protocol) server that lets AI coding assistants discover suites, run evals, inspect results, compare runs, and understand the full config — all without leaving the editor.
Add the server to your editor’s MCP configuration. The command is the same everywhere — only the config file location and format differ.
Run this from your project root:
claude mcp add --scope project agent-eval-kit -- npx -y agent-eval-kit mcpThis writes a .mcp.json file at the project root that you can commit to share with your team. To add it for only yourself instead, use --scope local (default) or --scope user (all projects).
You can also create .mcp.json manually:
{ "mcpServers": { "agent-eval-kit": { "type": "stdio", "command": "npx", "args": ["-y", "agent-eval-kit", "mcp"] } }}Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{ "mcpServers": { "agent-eval-kit": { "command": "npx", "args": ["-y", "agent-eval-kit", "mcp"] } }}Restart Claude Desktop after editing the config.
Add to .cursor/mcp.json in the project root:
{ "mcpServers": { "agent-eval-kit": { "command": "npx", "args": ["-y", "agent-eval-kit", "mcp"] } }}Add to .vscode/mcp.json in the project root:
{ "servers": { "agent-eval-kit": { "type": "stdio", "command": "npx", "args": ["-y", "agent-eval-kit", "mcp"] } }}Add to ~/.codeium/windsurf/mcp_config.json:
{ "mcpServers": { "agent-eval-kit": { "command": "npx", "args": ["-y", "agent-eval-kit", "mcp"] } }}The MCP server exposes 8 tools and 3 resources. The server uses stdio transport — all JSON-RPC communication goes over stdout, all logging goes to stderr.
Recommended workflow
Section titled “Recommended workflow”list-suites— Discover available suite nameslist-graders— Understand which graders are availabledescribe-config— Inspect the full config structurevalidate-config— Catch config errors before runningrun-suite— Execute an eval suitelist-runs— Find run IDsget-run-details— Inspect a specific runcompare-runs— Diff two runs
Discovery tools
Section titled “Discovery tools”list-suites
Section titled “list-suites”List all eval suites with case counts, categories, and gates.
| Parameter | Type | Default | Description |
|---|---|---|---|
verbose | boolean | false | Include full case IDs in output |
Returns JSON with { suites, totalSuites }.
describe-config
Section titled “describe-config”Return the fully loaded eval config as structured JSON. Functions (target, judge, graders) are shown as metadata, not serialized.
No parameters.
list-graders
Section titled “list-graders”Enumerate all available graders with parameters, defaults, and usage examples.
| Parameter | Type | Default | Description |
|---|---|---|---|
tier | string | (all) | Filter: deterministic, llm, or composition |
category | string | (all) | Filter: text, tool-call, metric, safety, llm-judge, composition |
includePlugins | boolean | true | Include plugin-contributed graders |
validate-config
Section titled “validate-config”Validate that eval.config.ts loads without errors and check for common issues (empty suites, missing default graders).
| Parameter | Type | Default | Description |
|---|---|---|---|
configPath | string | — | Custom config file path |
Returns { valid, suiteCount, totalCases, warnings } or { valid: false, error }.
Execution tools
Section titled “Execution tools”run-suite
Section titled “run-suite”Execute an eval suite by name and return formatted results.
| Parameter | Type | Default | Description |
|---|---|---|---|
suite | string | (required) | Suite name (use list-suites to discover) |
mode | live | replay | replay | Execution mode |
record | boolean | false | Record fixtures in live mode |
list-runs
Section titled “list-runs”List recent eval runs with IDs, suite names, modes, pass rates, and timestamps.
| Parameter | Type | Default | Description |
|---|---|---|---|
limit | number | 10 | Maximum runs to return |
compare-runs
Section titled “compare-runs”Compare two runs and show regressions, improvements, and score deltas per case.
| Parameter | Type | Description |
|---|---|---|
baseRunId | string | Base run ID (older) |
compareRunId | string | Compare run ID (newer) |
get-run-details
Section titled “get-run-details”Get detailed results for a specific run including per-case grades, scores, and failure reasons.
| Parameter | Type | Description |
|---|---|---|
runId | string | Run ID to inspect |
Resources
Section titled “Resources”Resources are read-only reference data that agents can cache.
| URI | Type | Description |
|---|---|---|
eval://schema/config | JSON Schema | Schema for the serializable portion of eval.config.ts |
eval://schema/case | JSON Schema | Schema for individual eval cases (JSONL lines) |
eval://reference/graders | Markdown | Complete grader reference with parameters and examples |
Tool annotations
Section titled “Tool annotations”All tools include MCP annotations for agent safety reasoning:
| Tool | readOnly | destructive | idempotent |
|---|---|---|---|
list-suites | yes | no | yes |
describe-config | yes | no | yes |
list-graders | yes | no | yes |
validate-config | yes | no | yes |
run-suite | no | no | no |
list-runs | yes | no | yes |
compare-runs | yes | no | yes |
get-run-details | yes | no | yes |
Tool handlers are pure async functions (args, cwd) => Promise<ToolResult>, making them testable without the MCP SDK.