What makes a good benchmark
The best benchmarks come from the worst days. A bug that took you hours. A race condition that only showed up in production. A silent data corruption that passed every test except the one nobody wrote.
- A real bug from a real codebase — not synthetic, not contrived
- A Dockerfile that builds the broken environment in under 5 minutes
- A
test.shthat fails on the bug and passes on the fix - A sealed
.bench/solution.patchthe agent never sees - Image size under 500 MB. Alpine preferred.
- No network access required at test time
- Deterministic: same input, same result, every time
Directory structure
benchmarks/<name>/
Dockerfile # builds the broken environment
Agentfile # agent constraints
test.sh # exit 0 = pass, exit 1 = fail
README.md # human description of the bug
.bench/
solution.patch # sealed ground truth (git diff format) The Agentfile
Declares what the agent gets. No more, no less.
FROM <benchmark-name> # Docker image reference
TOOL sh_run # tools the agent may use (repeatable)
TOOL ss
LIMIT turns 20 # max conversation turns
LIMIT tokens 100000 # max token budget
LIMIT wall_clock 300 # max wall-clock seconds Defaults if not specified: 30 turns, 200k tokens, 600s wall clock.
Naming convention
Lowercase, hyphenated. Format: <bug-type>-<project-or-domain>
off-by-one-pagination
race-condition-counter
null-deref-tokio
silent-data-corruption Submit
- Clone the repo:
git clone https://github.com/os-tack/find-the-needle - Copy
benchmarks/_template/tobenchmarks/your-bug-name/ - Package your bug: write the Dockerfile, test.sh, and solution.patch
- Validate:
make validate BENCH=your-bug-name - Test the solution:
make verify(your solution patch must make test.sh pass) - Open a PR — CI validates everything automatically
Rules
- Benchmarks are immutable once merged. If you need to modify, create a v2 as a new directory.
- All bugs must be real. Anonymize proprietary code, but the bug pattern must be genuine.
- The solution patch must apply cleanly inside the container and make test.sh exit 0.
- test.sh must be deterministic and complete in under 60 seconds.