Native driver vs kernel: does the CPU arm pay for itself?
Every model in the kernel arm talks to OpenRouter. The CPU arm skips the proxy and hands the model directly to a native API driver inside ostk. We can now measure exactly what that buys you.
The ostk agent loop runs in two configurations on this benchmark. The kernel arm is vendor-agnostic: every request routes through OpenRouter, so all 31 models in the experiment sit inside the same loop through the same proxy. The CPU arm is vendor-specific: the model is wired directly into a native API driver for its provider (Anthropic, Google, OpenAI, or Mistral today), bypassing OpenRouter entirely.
Building a native driver is work. Every provider has its own auth, streaming protocol, tool-call schema, retry semantics, and rate-limit quirks. The question this article answers is: for the providers where we've done that work, was it worth it?
Where CPU beats kernel
For most tested drivers, the CPU arm outperforms kernel by a small margin — enough to justify the driver but not enough to change a model's ranking.
- devstral-small-latest — kernel 14/40 (35%) → CPU 37/40 (92.5%).
+57.5pp - gemini-3-1-pro-preview — kernel 33/40 (82.5%) → CPU 37/40 (92.5%).
+10pp - gpt-5-codex — kernel 37/40 (92.5%) → CPU 40/40 (100%).
+7.5pp - claude-haiku-4-5 — kernel 37/39 (94.9%) → CPU 39/39 (100%).
+5.1pp
Two of the four Claude and OpenAI entries above hit 100% solve on the CPU arm on their respective benchmark sets. GPT-5 Codex goes 40/40 clean. Claude Haiku 4.5 goes 39/39 clean. That is not a rounding-error delta — it's the difference between a model that fails a bug occasionally and a model that doesn't.
Where CPU matches kernel
For the Claude Sonnet and Opus entries, the kernel arm is already at ceiling, so there is nowhere for the CPU arm to go.
- claude-opus-4-6 — kernel 38/38 (100%) ≡ CPU 38/38 (100%)
- claude-sonnet-4-6 — kernel 38/38 (100%) ≡ CPU 38/38 (100%)
If the kernel loop is already at 100%, the CPU arm can only prove it's at least as stable. Both Opus and Sonnet do that.
Where CPU loses to kernel
Two Gemini entries move the other way, by a few points.
- gemini-3-flash-preview — kernel 36/38 (94.7%) → CPU 34/38 (89.5%).
-5.3pp - gpt-4-1 — kernel 20/40 (50%) → CPU 18/40 (45%).
-5pp
These are small enough that a few runs going the other way would flip the sign. With more benchmarks and repeated trials, we'd expect them to land near parity.
The Devstral Small anomaly
The single most dramatic number in the entire 3-arm experiment belongs to devstral-small-latest. On the native arm (via the Vibe CLI) it solves 33/40. Swap it into the kernel loop via OpenRouter and it collapses to 14/40. Wire it directly into the Mistral-native driver in the CPU arm and it recovers to 37/40 — better than either of the other arms.
Native 33, kernel 14, CPU 37. Same model weights in all three runs. Same 40 benchmarks. Same prompt. The routing is doing most of the work.
We haven't fully traced the kernel-arm regression yet. The most likely culprits: tool-call JSON schema differences that OpenRouter normalises differently than the Mistral API expects, or streaming-chunk boundary behaviour that interacts badly with the ostk agent loop. The CPU driver talks to Mistral directly in the format Mistral documents, and the regression disappears.
This is the strongest argument for investing in native drivers: for some models, the proxy itself is the problem, and no amount of prompt engineering or agent-loop tuning inside the kernel arm will recover the missing points.
So, is CPU worth the work?
Based on the 3-arm data so far:
- Yes, if you're already at 90%+. The last few points are where the CPU driver earns its keep. Three models that were at 92.5–94.9% on kernel go to 100% on CPU. That is directly where the leaderboard frontier lives.
- Yes, if a proxy-level regression exists. Devstral Small is the existence proof — in at least one case, the kernel loop leaves 57 percentage points on the floor that a native driver recovers.
- Maybe not, if the kernel arm already works. Claude Opus and Sonnet are at 100% in both arms. Building a CPU driver for a model that's already at ceiling adds maintenance burden without moving the score.
The CPU arm currently covers Anthropic, Google, OpenAI and Mistral. Open-weight models running through OpenRouter (DeepSeek, Qwen, Llama, Kimi) only have kernel-arm numbers because there is no provider API to wire directly into. For those models, the kernel arm is the best signal we have.