Contribute — needle-bench

What makes a good benchmark

The best benchmarks come from the worst days. A bug that took you hours. A race condition that only showed up in production. A silent data corruption that passed every test except the one nobody wrote.

A real bug from a real codebase — not synthetic, not contrived
A Dockerfile that builds the broken environment in under 5 minutes
A test.sh that fails on the bug and passes on the fix
A sealed .bench/solution.patch the agent never sees
Image size under 500 MB. Alpine preferred.
No network access required at test time
Deterministic: same input, same result, every time

Directory structure

benchmarks/<name>/
  Dockerfile          # builds the broken environment
  Agentfile           # agent constraints
  test.sh             # exit 0 = pass, exit 1 = fail
  README.md           # human description of the bug
  .bench/
    solution.patch    # sealed ground truth (git diff format)

The Agentfile

Declares what the agent gets. No more, no less.

FROM <benchmark-name>       # Docker image reference
TOOL sh_run                 # tools the agent may use (repeatable)
TOOL ss
LIMIT turns 20              # max conversation turns
LIMIT tokens 100000         # max token budget
LIMIT wall_clock 300        # max wall-clock seconds

Defaults if not specified: 30 turns, 200k tokens, 600s wall clock.

Naming convention

Lowercase, hyphenated. Format: <bug-type>-<project-or-domain>

off-by-one-pagination
race-condition-counter
null-deref-tokio
silent-data-corruption

Submit

Clone the repo: git clone https://github.com/os-tack/find-the-needle
Copy benchmarks/_template/ to benchmarks/your-bug-name/
Package your bug: write the Dockerfile, test.sh, and solution.patch
Validate: make validate BENCH=your-bug-name
Test the solution: make verify (your solution patch must make test.sh pass)
Open a PR — CI validates everything automatically

Full contribution guide Spec v1.0

Rules

Benchmarks are immutable once merged. If you need to modify, create a v2 as a new directory.
All bugs must be real. Anonymize proprietary code, but the bug pattern must be genuine.
The solution patch must apply cleanly inside the container and make test.sh exit 0.
test.sh must be deterministic and complete in under 60 seconds.

Contribute a benchmark