Kernel lift: what happens when the same model runs through a shared agent loop
The vendor's own CLI (Claude Code, Gemini CLI, Codex) is not a neutral platform. Swap the same model into a shared agent loop and the solve rate can swing by ninety percentage points.
needle-bench runs each of 31 AI coding models through three arms on the same 40 Docker bug benchmarks: its native vendor CLI, the ostk kernel loop via OpenRouter, and an ostk CPU loop via a native API driver. Identical prompt. Identical benchmark set. The only thing that changes between arms is the agent loop wrapping the model.
This article is about the native → kernel delta. Call it the kernel lift: the change in solve rate when the model is pulled out of the vendor CLI and dropped into a generic, shared loop.
Claude: the small, consistent lift
Claude is already designed to sit inside an agent loop — Claude Code is the reference implementation — so you might expect the native arm to dominate. It doesn't.
- Claude Opus 4.6 — native 22/38 (58%) → kernel 38/38 (100%).
+42pp - Claude Sonnet 4.6 — native 24/38 (63%) → kernel 38/38 (100%).
+37pp - Claude Haiku 4.5 — native 26/39 (67%) → kernel 37/39 (95%).
+28pp
All three Claude models sit at 100% or near-100% once they are inside the kernel loop. The shared harness has cheaper error recovery, tighter tool surfaces, and less ambient state for the model to get confused by. The cost is the same model's context for the same prompt — but the scaffolding around it does measurable work.
Non-agent models: from near-zero to near-perfect
The most dramatic lifts belong to models that weren't built to drive an agent loop at all. Without a vendor CLI that knows how to hold their hand, they post single-digit solve rates on the native arm. In the shared kernel loop they become competitive.
- grok-4-1-fast — native 1/40 (2.5%) → kernel 37/40 (92.5%).
+90pp - grok-4-20 — native 0/39 (0%) → kernel 36/40 (90%).
+90pp - o4-mini — native 1/40 (2.5%) → kernel 36/40 (90%).
+87.5pp - qwen3-coder-plus — native 1/39 (2.6%) → kernel 36/40 (90%).
+87.4pp - deepseek-r1 — native 0/39 (0%) → kernel 32/40 (80%).
+80pp - deepseek-v3-2 — native 1/39 (2.6%) → kernel 29/40 (72.5%).
+69.9pp - qwen3-coder — native 1/39 (2.6%) → kernel 22/40 (55%).
+52.4pp - llama-4-maverick — native 1/40 (2.5%) → kernel 21/40 (52.5%).
+50pp - codestral-2508 — native 0/0 (no CLI available) → kernel 14/40 (35%). Unlock.
The pattern is consistent: if a model can't drive its own shell in a loop, giving it a wrapper that does the driving for it recovers between 50 and 90 percentage points of solve rate. For several of these models, the kernel arm is the only way to get useful behaviour out of them on real bugs at all.
Where the kernel loses
The lift isn't universal. A few models do worse when pulled out of their native CLI.
- grok-3-fast — native 25/39 (64%) → kernel 0/40 (0%).
-64pp - devstral-small-latest — native 33/40 (82.5%) → kernel 14/40 (35%).
-47.5pp - grok-4 — native 24/40 (60%) → kernel 8/40 (20%).
-40pp - grok-code-fast-1 — native 35/40 (87.5%) → kernel 30/40 (75%).
-12.5pp - gemini-3-1-pro-preview — native 38/40 (95%) → kernel 33/40 (82.5%).
-12.5pp - grok-4-fast — native 37/40 (92.5%) → kernel 34/40 (85%).
-7.5pp - gpt-5-codex — native 38/40 (95%) → kernel 37/40 (92.5%).
-2.5pp
The biggest losses belong to models whose vendor CLI is doing something the shared loop can't replicate. Some of this is routing (the native CLI gets direct API access, the kernel goes through OpenRouter). Some of it is provider-specific prompting and tool-call formats the shared loop normalises away. The devstral-small-latest collapse is striking — and we'll come back to it in the CPU driver article, where it recovers all the way to 37/40 when given its own native driver.
What this says about benchmarking coding models
Most public leaderboards report a single score per model. That score is implicitly a score for model plus harness. On this benchmark, the harness contribution ranges from worth a few percentage points to worth ninety.
If you care about picking a model, the kernel arm is the cleanest comparison: every model sits inside the same loop, the same tools, the same prompt, the same API proxy. If you care about picking a product, the native arm is the one that matches what users actually experience when they install the vendor CLI.
These are different questions. needle-bench runs all three arms so you can answer both.