What makes a good benchmark

The best benchmarks come from the worst days. A bug that took you hours. A race condition that only showed up in production. A silent data corruption that passed every test except the one nobody wrote.

Directory structure

benchmarks/<name>/
  Dockerfile          # builds the broken environment
  Agentfile           # agent constraints
  test.sh             # exit 0 = pass, exit 1 = fail
  README.md           # human description of the bug
  .bench/
    solution.patch    # sealed ground truth (git diff format)

The Agentfile

Declares what the agent gets. No more, no less.

FROM <benchmark-name>       # Docker image reference
TOOL sh_run                 # tools the agent may use (repeatable)
TOOL ss
LIMIT turns 20              # max conversation turns
LIMIT tokens 100000         # max token budget
LIMIT wall_clock 300        # max wall-clock seconds

Defaults if not specified: 30 turns, 200k tokens, 600s wall clock.

Naming convention

Lowercase, hyphenated. Format: <bug-type>-<project-or-domain>

off-by-one-pagination
race-condition-counter
null-deref-tokio
silent-data-corruption

Submit

  1. Clone the repo: git clone https://github.com/os-tack/find-the-needle
  2. Copy benchmarks/_template/ to benchmarks/your-bug-name/
  3. Package your bug: write the Dockerfile, test.sh, and solution.patch
  4. Validate: make validate BENCH=your-bug-name
  5. Test the solution: make verify (your solution patch must make test.sh pass)
  6. Open a PR — CI validates everything automatically

Full contribution guide Spec v1.0

Rules