MCP Server

Overview

agent-eval-kit includes an MCP (Model Context Protocol) server that lets AI coding assistants discover suites, run evals, inspect results, compare runs, and understand the full config — all without leaving the editor.

Setup

Add the server to your editor’s MCP configuration. The command is the same everywhere — only the config file location and format differ.

Run this from your project root:

claude mcp add --scope project agent-eval-kit -- npx -y agent-eval-kit mcp

This writes a .mcp.json file at the project root that you can commit to share with your team. To add it for only yourself instead, use --scope local (default) or --scope user (all projects).

You can also create .mcp.json manually:

{
  "mcpServers": {
    "agent-eval-kit": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "agent-eval-kit", "mcp"]
    }
  }
}

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "agent-eval-kit": {
      "command": "npx",
      "args": ["-y", "agent-eval-kit", "mcp"]
    }
  }
}

Restart Claude Desktop after editing the config.

Add to .cursor/mcp.json in the project root:

{
  "mcpServers": {
    "agent-eval-kit": {
      "command": "npx",
      "args": ["-y", "agent-eval-kit", "mcp"]
    }
  }
}

Add to .vscode/mcp.json in the project root:

{
  "servers": {
    "agent-eval-kit": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "agent-eval-kit", "mcp"]
    }
  }
}

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "agent-eval-kit": {
      "command": "npx",
      "args": ["-y", "agent-eval-kit", "mcp"]
    }
  }
}

Usage

The MCP server exposes 8 tools and 3 resources. The server uses stdio transport — all JSON-RPC communication goes over stdout, all logging goes to stderr.

Recommended workflow

list-suites — Discover available suite names
list-graders — Understand which graders are available
describe-config — Inspect the full config structure
validate-config — Catch config errors before running
run-suite — Execute an eval suite
list-runs — Find run IDs
get-run-details — Inspect a specific run
compare-runs — Diff two runs

Tools

Discovery tools

`list-suites`

List all eval suites with case counts, categories, and gates.

Parameter	Type	Default	Description
`verbose`	boolean	`false`	Include full case IDs in output

Returns JSON with { suites, totalSuites }.

`describe-config`

Return the fully loaded eval config as structured JSON. Functions (target, judge, graders) are shown as metadata, not serialized.

No parameters.

`list-graders`

Enumerate all available graders with parameters, defaults, and usage examples.

Parameter	Type	Default	Description
`tier`	string	(all)	Filter: `deterministic`, `llm`, or `composition`
`category`	string	(all)	Filter: `text`, `tool-call`, `metric`, `safety`, `llm-judge`, `composition`
`includePlugins`	boolean	`true`	Include plugin-contributed graders

`validate-config`

Validate that eval.config.ts loads without errors and check for common issues (empty suites, missing default graders).

Parameter	Type	Default	Description
`configPath`	string	—	Custom config file path

Returns { valid, suiteCount, totalCases, warnings } or { valid: false, error }.

Execution tools

`run-suite`

Execute an eval suite by name and return formatted results.

Parameter	Type	Default	Description
`suite`	string	(required)	Suite name (use `list-suites` to discover)
`mode`	`live \| replay`	`replay`	Execution mode
`record`	boolean	`false`	Record fixtures in live mode

`list-runs`

List recent eval runs with IDs, suite names, modes, pass rates, and timestamps.

Parameter	Type	Default	Description
`limit`	number	`10`	Maximum runs to return

`compare-runs`

Compare two runs and show regressions, improvements, and score deltas per case.

Parameter	Type	Description
`baseRunId`	string	Base run ID (older)
`compareRunId`	string	Compare run ID (newer)

`get-run-details`

Get detailed results for a specific run including per-case grades, scores, and failure reasons.

Parameter	Type	Description
`runId`	string	Run ID to inspect

Resources

Resources are read-only reference data that agents can cache.

URI	Type	Description
`eval://schema/config`	JSON Schema	Schema for the serializable portion of `eval.config.ts`
`eval://schema/case`	JSON Schema	Schema for individual eval cases (JSONL lines)
`eval://reference/graders`	Markdown	Complete grader reference with parameters and examples

Tool annotations

All tools include MCP annotations for agent safety reasoning:

Tool	readOnly	destructive	idempotent
`list-suites`	yes	no	yes
`describe-config`	yes	no	yes
`list-graders`	yes	no	yes
`validate-config`	yes	no	yes
`run-suite`	no	no	no
`list-runs`	yes	no	yes
`compare-runs`	yes	no	yes
`get-run-details`	yes	no	yes

Tool handlers are pure async functions (args, cwd) => Promise<ToolResult>, making them testable without the MCP SDK.