Insight · kernel arm · 31 models

Kernel lift: what happens when the same model runs through a shared agent loop

The vendor's own CLI (Claude Code, Gemini CLI, Codex) is not a neutral platform. Swap the same model into a shared agent loop and the solve rate can swing by ninety percentage points.

needle-bench runs each of 31 AI coding models through three arms on the same 40 Docker bug benchmarks: its native vendor CLI, the ostk kernel loop via OpenRouter, and an ostk CPU loop via a native API driver. Identical prompt. Identical benchmark set. The only thing that changes between arms is the agent loop wrapping the model.

This article is about the native → kernel delta. Call it the kernel lift: the change in solve rate when the model is pulled out of the vendor CLI and dropped into a generic, shared loop.

Claude: the small, consistent lift

Claude is already designed to sit inside an agent loop — Claude Code is the reference implementation — so you might expect the native arm to dominate. It doesn't.

Claude Opus 4.6 — native 22/38 (58%) → kernel 38/38 (100%). +42pp
Claude Sonnet 4.6 — native 24/38 (63%) → kernel 38/38 (100%). +37pp
Claude Haiku 4.5 — native 26/39 (67%) → kernel 37/39 (95%). +28pp

All three Claude models sit at 100% or near-100% once they are inside the kernel loop. The shared harness has cheaper error recovery, tighter tool surfaces, and less ambient state for the model to get confused by. The cost is the same model's context for the same prompt — but the scaffolding around it does measurable work.

Non-agent models: from near-zero to near-perfect

The most dramatic lifts belong to models that weren't built to drive an agent loop at all. Without a vendor CLI that knows how to hold their hand, they post single-digit solve rates on the native arm. In the shared kernel loop they become competitive.

grok-4-1-fast — native 1/40 (2.5%) → kernel 37/40 (92.5%). +90pp
grok-4-20 — native 0/39 (0%) → kernel 36/40 (90%). +90pp
o4-mini — native 1/40 (2.5%) → kernel 36/40 (90%). +87.5pp
qwen3-coder-plus — native 1/39 (2.6%) → kernel 36/40 (90%). +87.4pp
deepseek-r1 — native 0/39 (0%) → kernel 32/40 (80%). +80pp
deepseek-v3-2 — native 1/39 (2.6%) → kernel 29/40 (72.5%). +69.9pp
qwen3-coder — native 1/39 (2.6%) → kernel 22/40 (55%). +52.4pp
llama-4-maverick — native 1/40 (2.5%) → kernel 21/40 (52.5%). +50pp
codestral-2508 — native 0/0 (no CLI available) → kernel 14/40 (35%). Unlock.

The pattern is consistent: if a model can't drive its own shell in a loop, giving it a wrapper that does the driving for it recovers between 50 and 90 percentage points of solve rate. For several of these models, the kernel arm is the only way to get useful behaviour out of them on real bugs at all.

Where the kernel loses

The lift isn't universal. A few models do worse when pulled out of their native CLI.

grok-3-fast — native 25/39 (64%) → kernel 0/40 (0%). -64pp
devstral-small-latest — native 33/40 (82.5%) → kernel 14/40 (35%). -47.5pp
grok-4 — native 24/40 (60%) → kernel 8/40 (20%). -40pp
grok-code-fast-1 — native 35/40 (87.5%) → kernel 30/40 (75%). -12.5pp
gemini-3-1-pro-preview — native 38/40 (95%) → kernel 33/40 (82.5%). -12.5pp
grok-4-fast — native 37/40 (92.5%) → kernel 34/40 (85%). -7.5pp
gpt-5-codex — native 38/40 (95%) → kernel 37/40 (92.5%). -2.5pp

The biggest losses belong to models whose vendor CLI is doing something the shared loop can't replicate. Some of this is routing (the native CLI gets direct API access, the kernel goes through OpenRouter). Some of it is provider-specific prompting and tool-call formats the shared loop normalises away. The devstral-small-latest collapse is striking — and we'll come back to it in the CPU driver article, where it recovers all the way to 37/40 when given its own native driver.

What this says about benchmarking coding models

Most public leaderboards report a single score per model. That score is implicitly a score for model plus harness. On this benchmark, the harness contribution ranges from worth a few percentage points to worth ninety.

If you care about picking a model, the kernel arm is the cleanest comparison: every model sits inside the same loop, the same tools, the same prompt, the same API proxy. If you care about picking a product, the native arm is the one that matches what users actually experience when they install the vendor CLI.

These are different questions. needle-bench runs all three arms so you can answer both.