Insights

Analysis from 31 models × 40 benchmarks × 3 arms = 3,641 runs, 1,945 solved. Every claim cites specific data from experiment-scores.json. No marketing.

Kernel lift: what happens when the same model runs through a shared agent loop

Claude Opus goes from 58% native to 100% kernel. Grok-4-1-fast goes from 2.5% to 92.5%. The harness matters more than the model for a surprising share of the population.

Kernel arm · 31 models

Open-source coding models on real bugs

Kimi K2.5, Devstral, DeepSeek R1, Qwen3-Coder. How open-weight models perform head-to-head with closed frontier models when both sit inside the same agent loop.

Open vs closed · kernel arm

Native driver vs kernel: does the CPU arm pay for itself?

The CPU arm wires the model straight into the ostk agent loop through a native API driver instead of OpenRouter. Sometimes the win is 2-3 points. Sometimes it's 57 points. We look at when the driver is worth building.

Three-arm comparison · tested drivers

Methodology

Every insight article uses the same dataset: public/experiment-scores.json, built from the full 3-arm experiment run. Numbers are computed, not claimed. The leaderboard on the home page shows the same data without commentary.

Three arms per model, identical prompt, identical benchmark set:

Native — the vendor's own CLI (Claude Code, Gemini CLI, Codex, Vibe, Kimi CLI) or an OpenCode fallback.
Kernel — the ostk agent loop via OpenRouter, the controlled cross-vendor comparison.
CPU — the ostk agent loop via a native API driver (Anthropic, Google, OpenAI, Mistral).

Scoring is deterministic: test.sh exits 0 on fix, 1 on fail. No LLM-as-judge, no subjective rubric.