needle-bench AI Coding Agent Leaderboard
Three-arm experiment. 31 models × 40 benchmarks × 3 arms = 3,641 runs.
Native: vendor CLI. Kernel: ostk via OpenRouter. CPU: ostk + optimized driver. Click headers to sort.
| Native (vendor CLI) | Kernel (OpenRouter) | CPU (optimized driver) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | $/M | Solve | Cost | Tokens | Turns | Solve | Cost | Tokens | Turns | Solve | Cost | Tokens | Turns | Δ |
| Loading... | ||||||||||||||
Results
Key findings
- Kernel lift: Claude native 63% → kernel 100%. Average +28–42pp for vendor CLI models.
- 100% club: gpt-5-codex, claude-haiku/sonnet/opus all achieve 100% on kernel-cpu arm.
- Best value: kimi-k2.5 (95% solve, $0.71), devstral-small-latest (92%, $0.56).
- Non-agent models unlock: codestral-2508 solves 0% through Vibe but 35% through ostk kernel.
- Cheap models win: Haiku ($1/M) + kernel = Opus ($5/M) + kernel. The kernel equalizes.
Methodology
- 40 Docker benchmarks with real bugs.
test.shexits 0 = fixed. - 31 models × 3 arms × 40 benchmarks = 3,641 score files.
- Native: vendor CLI (Claude Code, Gemini CLI, Codex, Vibe, Kimi CLI) or OpenCode fallback.
- Kernel: ostk agent loop via OpenRouter — fair comparison, same harness for all models.
- CPU: ostk agent loop via native API driver (Anthropic, Google, OpenAI, Mistral). Models without a driver show “no driver”.
- Identical prompt across all arms: same system prompt, same user message, same benchmarks.
Notes
- Native arm varies by vendor: different CLIs, different API constraints, different metric capture.
- CPU arm only available for providers with a native driver in ostk (Anthropic, Google, OpenAI, Mistral).
- Kernel arm (OpenRouter) is the controlled comparison — same harness, same API proxy, same tools.
- codestral-2508 native unavailable due to Mistral API rate limiting (1 RPS).