Skip to content

MCP Server

agent-eval-kit includes an MCP (Model Context Protocol) server that lets AI coding assistants discover suites, run evals, inspect results, compare runs, and understand the full config — all without leaving the editor.

Add the server to your editor’s MCP configuration. The command is the same everywhere — only the config file location and format differ.

Run this from your project root:

Terminal window
claude mcp add --scope project agent-eval-kit -- npx -y agent-eval-kit mcp

This writes a .mcp.json file at the project root that you can commit to share with your team. To add it for only yourself instead, use --scope local (default) or --scope user (all projects).

You can also create .mcp.json manually:

{
"mcpServers": {
"agent-eval-kit": {
"type": "stdio",
"command": "npx",
"args": ["-y", "agent-eval-kit", "mcp"]
}
}
}

The MCP server exposes 8 tools and 3 resources. The server uses stdio transport — all JSON-RPC communication goes over stdout, all logging goes to stderr.

  1. list-suites — Discover available suite names
  2. list-graders — Understand which graders are available
  3. describe-config — Inspect the full config structure
  4. validate-config — Catch config errors before running
  5. run-suite — Execute an eval suite
  6. list-runs — Find run IDs
  7. get-run-details — Inspect a specific run
  8. compare-runs — Diff two runs

List all eval suites with case counts, categories, and gates.

ParameterTypeDefaultDescription
verbosebooleanfalseInclude full case IDs in output

Returns JSON with { suites, totalSuites }.

Return the fully loaded eval config as structured JSON. Functions (target, judge, graders) are shown as metadata, not serialized.

No parameters.

Enumerate all available graders with parameters, defaults, and usage examples.

ParameterTypeDefaultDescription
tierstring(all)Filter: deterministic, llm, or composition
categorystring(all)Filter: text, tool-call, metric, safety, llm-judge, composition
includePluginsbooleantrueInclude plugin-contributed graders

Validate that eval.config.ts loads without errors and check for common issues (empty suites, missing default graders).

ParameterTypeDefaultDescription
configPathstringCustom config file path

Returns { valid, suiteCount, totalCases, warnings } or { valid: false, error }.

Execute an eval suite by name and return formatted results.

ParameterTypeDefaultDescription
suitestring(required)Suite name (use list-suites to discover)
modelive | replayreplayExecution mode
recordbooleanfalseRecord fixtures in live mode

List recent eval runs with IDs, suite names, modes, pass rates, and timestamps.

ParameterTypeDefaultDescription
limitnumber10Maximum runs to return

Compare two runs and show regressions, improvements, and score deltas per case.

ParameterTypeDescription
baseRunIdstringBase run ID (older)
compareRunIdstringCompare run ID (newer)

Get detailed results for a specific run including per-case grades, scores, and failure reasons.

ParameterTypeDescription
runIdstringRun ID to inspect

Resources are read-only reference data that agents can cache.

URITypeDescription
eval://schema/configJSON SchemaSchema for the serializable portion of eval.config.ts
eval://schema/caseJSON SchemaSchema for individual eval cases (JSONL lines)
eval://reference/gradersMarkdownComplete grader reference with parameters and examples

All tools include MCP annotations for agent safety reasoning:

ToolreadOnlydestructiveidempotent
list-suitesyesnoyes
describe-configyesnoyes
list-gradersyesnoyes
validate-configyesnoyes
run-suitenonono
list-runsyesnoyes
compare-runsyesnoyes
get-run-detailsyesnoyes

Tool handlers are pure async functions (args, cwd) => Promise<ToolResult>, making them testable without the MCP SDK.