About needle-bench
needle-bench is an open benchmark for evaluating AI coding agents on real bugs. Developers submit frozen Docker snapshots containing actual production failures. Each agent receives one prompt—“find the needle”—and shell access. Scoring is deterministic: the test passes or it doesn’t.
How benchmarks work
Each benchmark is a directory with four files:
benchmarks/off-by-one-pagination/
Dockerfile # builds the broken environment
Agentfile # agent constraints (tools, turn limit, token budget)
test.sh # exit 0 = fixed, exit 1 = still broken
.bench/
solution.patch # sealed ground truth — agent never sees this The Dockerfile contains a real bug from a real codebase. Not synthetic, not contrived. The agent gets bash access and a failing test. That’s it.
A real Agentfile looks like this:
FROM off-by-one-pagination
TOOL bash
LIMIT turns 20
LIMIT tokens 100000
LIMIT wall_clock 300 40 benchmarks span easy, medium, and hard tiers across logic errors, concurrency bugs, data integrity issues, and more.
Three-arm experiment
Each model is tested in three conditions using an identical prompt:
- Native: the vendor's own CLI (Claude Code, Gemini CLI, Codex, Vibe, Kimi CLI) or an OpenCode fallback.
- Kernel: the ostk agent loop via OpenRouter — the controlled comparison, same harness for every model.
- CPU: the ostk agent loop via a native API driver (Anthropic, Google, OpenAI, Mistral). Models without a driver surface as “no driver” on the leaderboard.
31 models × 40 benchmarks × 3 arms = 3,641 runs, 1,945 solved. Identical system prompt, identical user message, identical benchmark set across all arms.
Scoring
Every run produces the same set of metrics. No subjective evaluation. No LLM-as-judge. The test passes or it doesn’t. Everything else is measured, not guessed.
- resolved did test.sh pass? boolean. the only metric that matters for the headline.
- turns_to_discovery turns before the agent identifies the correct file and region.
- turns_to_fix turns before the agent produces a passing patch.
- signal_to_noise ratio of productive actions to total actions. 0.0–1.0.
- false_positives files edited that aren’t in the solution. fewer is better.
- token_cost total tokens consumed (input + output).
- tokens_per_correct_line efficiency: how many tokens per line of correct fix.
- recovery_events times the agent went wrong, noticed, and self-corrected.
- recovery_rate of those recovery events, how many led back to the fix.
- wall_clock wall-clock seconds from start to score.
- blind_discovery did the agent find the bug with no hints beyond test output?
Leaderboard ranking: resolved (desc) → turns_to_fix (asc) → token_cost (asc) → wall_clock (asc).
Get started
Clone, verify, run. Docker required.
git clone https://github.com/os-tack/find-the-needle
cd find-the-needle
make verify # confirms all benchmarks have valid solutions
make bench # runs all benchmarks (Docker required) Run a single benchmark:
make run BENCH=off-by-one-pagination Validate your own benchmark before submitting:
make validate BENCH=my-new-benchmark Data format
Every run appends a record to scores.json. Raw numbers, no interpretation layer.
{
"benchmark": "off-by-one-pagination",
"agent": "claude-opus-4-6",
"resolved": true,
"turns_to_discovery": 1,
"turns_to_fix": 4,
"signal_to_noise": 0.85,
"false_positives": 0,
"token_cost": 32000,
"tokens_per_correct_line": 32000,
"recovery_events": 0,
"recovery_rate": 1.0,
"wall_clock": 45.2,
"blind_discovery": true
} Contribute a benchmark
Got a real bug worth preserving? Package it as a benchmark.
- A real bug from a real codebase — not synthetic, not contrived
- A Dockerfile that builds in under 5 minutes
- A
test.shthat fails on the bug and passes on the fix - A sealed
.bench/solution.patchthe agent never sees