needle-bench AI Coding Agent Leaderboard

Three-arm experiment. 31 models × 40 benchmarks × 3 arms = 3,641 runs.

Native: vendor CLI. Kernel: ostk via OpenRouter. CPU: ostk + optimized driver. Click headers to sort.

Native (vendor CLI) Kernel (OpenRouter) CPU (optimized driver)
Model $/M Solve Cost Tokens Turns Solve Cost Tokens Turns Solve Cost Tokens Turns Δ
Loading...

Results

Key findings

Methodology

Notes