Skip to content

Statistics & Trials

Running multiple trials per case reveals flaky behavior and provides statistical confidence in pass rates:

Terminal window
agent-eval-kit run --suite=smoke --trials=5

Each case is executed N times. Results include per-case trial statistics.

When trials > 1, each case gets a TrialStats object:

MetricDescription
trialCountNumber of trials run
passCountNumber of passing trials
failCountNumber of failing trials
errorCountNumber of errored trials
passRatepassCount / trialCount
meanScoreAverage score across trials
scoreStdDevStandard deviation (Bessel’s correction, n-1)
ci95LowLower bound of 95% confidence interval (Wilson score)
ci95HighUpper bound of 95% confidence interval (Wilson score)
flakytrue if some trials pass and some fail

A case passes only if all trials pass (pass^k semantics). This is intentionally strict — a case that passes 4 out of 5 times is flagged as flaky, not passing.

Confidence intervals use the Wilson score interval (z = 1.96 for 95% confidence). This method is more accurate than simple proportion intervals for small sample sizes.

The console reporter displays these intervals:

case-1: 4/5 passed (80%) [95% CI: 0.37–0.96] ⚠ flaky

A case is marked flaky: true when it has at least one pass and at least one fail across trials. This surfaces non-determinism in your target — critical for AI agents where the same input can produce different outputs.

  • Flakiness detection: Run 3–5 trials to surface non-deterministic behavior
  • Confidence intervals: Run 10+ trials for meaningful statistical bounds
  • Cost consideration: Each trial calls your target (in live mode), so more trials = more API cost. Use replay mode for grader-only experimentation.
import { computeAllTrialStats, wilsonInterval } from "agent-eval-kit";
// Compute stats for all cases (trialCount is the configured number of trials)
const stats = computeAllTrialStats(trials, trialCount);
// Compute a single Wilson interval
const interval = wilsonInterval(successes, total, 1.96);
// interval.low, interval.high

The run summary always includes aggregate stats:

FieldDescription
totalCasesTotal cases evaluated
passedCases that passed
failedCases that failed
errorsCases that errored
passRatepassed / totalCases
totalCostAggregate cost in USD
totalDurationMsWall-clock duration
p95LatencyMs95th percentile latency
gateResultGate evaluation result
byCategoryPer-category pass rate breakdown
trialStatsPer-case trial statistics (when trials > 1)
abortedWhether the run was interrupted by SIGINT (boolean?)