Insights
Analysis from 31 models × 40 benchmarks × 3 arms = 3,641 runs, 1,945 solved. Every claim cites specific data from experiment-scores.json. No marketing.
Kernel lift: what happens when the same model runs through a shared agent loop
Claude Opus goes from 58% native to 100% kernel. Grok-4-1-fast goes from 2.5% to 92.5%. The harness matters more than the model for a surprising share of the population.
Open-source coding models on real bugs
Kimi K2.5, Devstral, DeepSeek R1, Qwen3-Coder. How open-weight models perform head-to-head with closed frontier models when both sit inside the same agent loop.
Native driver vs kernel: does the CPU arm pay for itself?
The CPU arm wires the model straight into the ostk agent loop through a native API driver instead of OpenRouter. Sometimes the win is 2-3 points. Sometimes it's 57 points. We look at when the driver is worth building.
Methodology
Every insight article uses the same dataset: public/experiment-scores.json, built from the full 3-arm experiment run. Numbers are computed, not claimed. The leaderboard on the home page shows the same data without commentary.
Three arms per model, identical prompt, identical benchmark set:
- Native — the vendor's own CLI (Claude Code, Gemini CLI, Codex, Vibe, Kimi CLI) or an OpenCode fallback.
- Kernel — the ostk agent loop via OpenRouter, the controlled cross-vendor comparison.
- CPU — the ostk agent loop via a native API driver (Anthropic, Google, OpenAI, Mistral).
Scoring is deterministic: test.sh exits 0 on fix, 1 on fail. No LLM-as-judge, no subjective rubric.