Loading board…

Native vendor CLI vs the ostk kernel on real-bug Docker snapshots, single-shot. Cost/token Δ are resolved-gated (cells both arms solved) and per-bucket priced. Every figure is shown as recorded — trust badges flag masked, invalid, floor-suppressed and mixed-basis cells, and the corrections log ↓ carries the full fault context.

Cost Δ / Token Δ = kernel vs native (negative = kernel cheaper/lighter). Click headers to sort, or jump to the per-cell matrix ↓

		Native	Kernel	Native (vendor CLI)		Kernel (ostk)
Model	Type	Solve	Solve	Cost	Tokens	Cost	Tokens	Cost Δ	Tok Δ
Loading...

^* B*: kernel solve sourced from the generic OpenRouter kernel (no hand-written native driver for this provider) — distinct from true B (kernel-cpu, native driver).

Per-cell results

Every model × benchmark. Each cell splits native │ kernel, colored by outcome. Benchmarks run hardest → easiest left to right. Tap any cell for the full token & cost breakdown.

solved failed timeout harness invalid (excluded) n/a

Loading matrix…

Corrections & known faults — read before comparing any two numbers

Every published cell is shown as recorded (as-was); numbers are never silently restated. The trust totals and faults below are the honest context — validity caveats, harness defects, and cost-basis issues — for the data above. Reverse-chronological.

Loading corrections…

Other eras

Versioned boards under boards/. Each ostk version is a different experimental condition (kernel, drivers, oracles, schema era) — not cross-comparable, and no cross-version delta is published anywhere. Newest first.

Loading eras…

needle-bench AI Coding Agent Leaderboard

Per-cell results

Other eras