Open-source coding models on real bugs
Kimi K2.5 solves 38 of 40 real Docker bugs. Claude Opus solves 38 of 38. Devstral 2512 solves 34 of 38. All three run through the same agent loop. The price delta is 12×.
There is a common story that closed frontier models (Claude, GPT, Gemini) are categorically ahead of open-weight models on real coding work. We ran 31 models through an identical agent loop on 40 real Docker bug benchmarks and the story doesn't hold up. Inside a shared loop, the top open-weight models land within single-digit percentage points of the top closed models — at a fraction of the price.
All numbers below come from the kernel arm — each model runs inside the ostk agent loop via OpenRouter, with identical prompts and tools. This is the controlled cross-vendor comparison. Solve rates from a model's native vendor CLI often look different; see the kernel lift article for that breakdown.
The top of the open-source leaderboard
- kimi-k2-5 — 38/40 solved (95%) ·
$0.40/M input - qwen3-coder-plus — 36/40 (90%) ·
$0.65/M - devstral-2512 — 34/38 (89.5%) ·
$0.40/M - deepseek-r1 — 32/40 (80%) ·
$0.70/M - deepseek-v3-2 — 29/40 (72.5%) ·
$0.26/M - qwen3-coder-flash — 26/40 (65%) ·
$0.20/M - grok-3-mini — 24/40 (60%) ·
$0.30/M - qwen3-coder — 22/40 (55%) ·
$0.22/M - llama-4-maverick — 21/40 (52.5%) ·
$0.15/M - deepseek-r1-0528 — 19/40 (47.5%) ·
$0.45/M - codestral-2508 — 14/40 (35%) ·
$0.30/M
The closed frontier, same arm, for comparison
- claude-opus-4-6 — 38/38 (100%) ·
$5.00/M - claude-sonnet-4-6 — 38/38 (100%) ·
$3.00/M - o3 — 38/40 (95%) ·
$2.00/M - gemini-2-5-pro — 37/40 (92.5%) ·
$1.25/M - gpt-5-codex — 37/40 (92.5%) ·
$1.25/M - claude-haiku-4-5 — 37/39 (94.9%) ·
$1.00/M - o4-mini — 36/40 (90%) ·
$1.10/M - gemini-3-flash-preview — 36/38 (94.7%) ·
$0.50/M - gemini-2-5-flash — 35/38 (92.1%) ·
$0.30/M - gemini-3-1-pro-preview — 33/40 (82.5%) ·
$2.00/M
Three things that fall out of this
Kimi K2.5 is the story. 38/40 solved on the same run as Claude Opus 4.6's 38/38 is not a moral victory, it's a parity result. The price delta between them on input is 12.5×. Kimi is open-weight. The model you can run on your own hardware matches the model you can only rent.
Devstral 2512 at $0.40/M is the best price-performance on the board. 34/38 (89.5%) solved at the same per-million price as Kimi, well below any closed model that crosses 85%. Devstral was built by Mistral specifically for agent loops — it shows.
DeepSeek and Qwen are close behind. DeepSeek R1 at 80% and Qwen3-Coder-Plus at 90% both sit well inside the closed-model band. DeepSeek V3.2 at 72.5% for 26 cents per million input tokens is the dark horse for bulk workloads where 100% solve isn't required.
Note that several of these models — DeepSeek R1, Qwen3-Coder, Llama 4, most Grok variants — post near-zero solve rates on their native arm. They aren't built to drive an agent loop by themselves. Put them inside one and they become competitive. That is a point in favour of shared agent infrastructure, not a point against the models.
What the numbers don't show
These solve rates are from the kernel arm, which routes all models through OpenRouter. Some open-weight models have a native driver in ostk that bypasses the proxy — notably Mistral models through the CPU arm. Where a driver exists, the CPU arm often pushes the model a few percentage points higher. Devstral Small latest is the most extreme case: 14/40 on kernel, 37/40 on CPU. See the CPU driver article.
These numbers also don't capture speed, tool-call overhead, or cost per solve. The home leaderboard shows all of that — click the column headers to sort by $/M, tokens, or turns.