About needle-bench

needle-bench is an open benchmark for evaluating AI coding agents on real bugs. Developers submit frozen Docker snapshots containing actual production failures. Each agent receives one prompt—“find the needle”—and shell access. Scoring is deterministic: the test passes or it doesn’t.

View the leaderboard · Source on GitHub

How benchmarks work

Each benchmark is a directory with four files:

benchmarks/off-by-one-pagination/
  Dockerfile        # builds the broken environment
  Agentfile         # agent constraints (tools, turn limit, token budget)
  test.sh           # exit 0 = fixed, exit 1 = still broken
  .bench/
    solution.patch  # sealed ground truth — agent never sees this

The Dockerfile contains a real bug from a real codebase. Not synthetic, not contrived. The agent gets bash access and a failing test. That’s it.

A real Agentfile looks like this:

FROM off-by-one-pagination
TOOL bash
LIMIT turns 20
LIMIT tokens 100000
LIMIT wall_clock 300

40 benchmarks span easy, medium, and hard tiers across logic errors, concurrency bugs, data integrity issues, and more.

Three-arm experiment

Each model is tested in three conditions using an identical prompt:

31 models × 40 benchmarks × 3 arms = 3,641 runs, 1,945 solved. Identical system prompt, identical user message, identical benchmark set across all arms.

Scoring

Every run produces the same set of metrics. No subjective evaluation. No LLM-as-judge. The test passes or it doesn’t. Everything else is measured, not guessed.

Leaderboard ranking: resolved (desc) → turns_to_fix (asc) → token_cost (asc) → wall_clock (asc).

Get started

Clone, verify, run. Docker required.

git clone https://github.com/os-tack/find-the-needle
cd find-the-needle
make verify   # confirms all benchmarks have valid solutions
make bench    # runs all benchmarks (Docker required)

Run a single benchmark:

make run BENCH=off-by-one-pagination

Validate your own benchmark before submitting:

make validate BENCH=my-new-benchmark

Data format

Every run appends a record to scores.json. Raw numbers, no interpretation layer.

{
  "benchmark": "off-by-one-pagination",
  "agent": "claude-opus-4-6",
  "resolved": true,
  "turns_to_discovery": 1,
  "turns_to_fix": 4,
  "signal_to_noise": 0.85,
  "false_positives": 0,
  "token_cost": 32000,
  "tokens_per_correct_line": 32000,
  "recovery_events": 0,
  "recovery_rate": 1.0,
  "wall_clock": 45.2,
  "blind_discovery": true
}

Contribute a benchmark

Got a real bug worth preserving? Package it as a benchmark.

Contribution guide