needle-bench — v7.6.0 schema-unified results (2026-06-17)

needle-bench — v7.6.0 schema-unified results

Native vendor CLI vs ostk kernel · single-shot (samples=1) · captured 2026-06-17

TL;DR

Apples-to-apples cost/token is honestly measurable only on Anthropic models (claude-code is the only native harness that reports cost + cache split + real turns). Cost is summed per token bucket (fresh 1×, cache-read 0.1×, cache-create 1.25×/2×) on a rate card validated to within 0.5% of real Anthropic billing; efficiency deltas are computed only over cells both arms solved.

Model	Solve (native → kernel)	Cost Δ	Token Δ
claude-opus-4-8	97% → 97% (parity)	−17%	−49%
claude-sonnet-4-6	93% → 100% (kernel +2)	−38%	−65%
devstral-2512 (solve-only)	19% → 97%	n/a	n/a
gemini-3.1-pro (solve-only)	97% → 100%	n/a	n/a
gpt-5.5 (solve-only)	94% → 97%	n/a	n/a
kimi-k2.6 (solve-only)	97% → 95%	n/a	n/a
grok-4.3 / deepseek-v4-pro (B*)	native key-blocked — kernel-only	n/a	n/a

Read: at equal-or-better solve rate the kernel is cheaper and far lighter on tokens on the honestly-comparable (Anthropic) models, and a large capability multiplier for weak tool-users (devstral 19%→97%). This is a floor — the cache-sliver projection is not yet in the kernel arm.

Full results (verified gating)

Generated verbatim from consolidate_harden.py — resolved-gated efficiency, per-bucket pricing, split-resolve / both-fail / infra broken out. 38 benchmarks.

==================================================================================== SECTION 1 — ANTHROPIC: apples-to-apples cost / token (claude-code native) The only native harness reporting cost + cache split + real turns. SOLVE = resolved count / scored. EFFICIENCY (cost%/tok%) over the both-solved subset ONLY (n=both); 'cheaper at failing' excluded. ==================================================================================== model B scd natSlv BSlv both nat$ B$ cost% tok% splt ------------------------------------------------------------------------------------ claude-opus-4-8 B 37 34/37 92% 36/37 97% 34 9.373 6.372 -32% -50% 100% claude-sonnet-4-6 B 36 34/36 94% 35/36 97% 34 7.348 3.417 -53% -70% 100% ==================================================================================== SECTION 2 — SOLVE-RATE + native ABSOLUTES (native cache-split incomplete) Cross-arm Δ$/Δtok still OMITTED here (no cache split → not apples-to- apples; the board carries the un-gated delta + native_cost_basis). UN-GATE: nat$/nat_tok now shown for the total tier (gemini-cli/ codex/vibe/kimi) — cost is an ALL-FRESH FLOOR (no cache discount, over- states native). B$/B_tok = kernel absolutes over B-solved. grok/deepseek native = key-pending (blank). ==================================================================================== model B native_harness scd natSlv BSlv nat$(abs) nat_tok B$(abs) B_tok(abs) ------------------------------------------------------------------------------------ gemini-3.1-pro-preview B gemini-cli 37 36/37 97% 37/37 100% 3.716 4,208,541 8.040 5,184,195 gemini-3.5-flash B gemini-cli 36 36/36 100% 36/36 100% 8.799 7,883,566 11.372 9,150,793 devstral-2512 B vibe 30 5/30 17% 28/30 93% 1.498 3,615,795 1.610 3,827,806 gpt-5.5 B codex 37 36/37 97% 36/37 97% 6.598 4,871,004 18.876 5,530,554 grok-4.3 B* opencode 36 34/36 94% 30/36 83% 4.017 3,172,558 3.254 2,532,728 deepseek-v4-pro B* opencode 36 36/36 100% 35/36 97% 5.380 3,014,310 8.160 4,353,782 kimi-k2.6 B* kimi 37 36/37 97% 35/37 95% - - 1.420 2,911,943 ==================================================================================== SECTION 3 — SPLIT-RESOLVE / BOTH-FAIL (solve signal) + INFRA-FAILED (excluded) SPLIT/BOTH-FAIL = genuine capability outcomes (turns>0). INFRA = zero-work (turns==0, no API call) — excluded from BOTH solve-rate & efficiency, retried. ==================================================================================== SPLIT claude-opus-4-8 silent-data-corruption solved by: kernel-only SPLIT claude-opus-4-8 split-brain-leader-election solved by: kernel-only BOTH-FAIL claude-opus-4-8 (1): postgres-migration-schema-drift SPLIT claude-sonnet-4-6 silent-data-corruption solved by: kernel-only BOTH-FAIL claude-sonnet-4-6 (1): postgres-migration-schema-drift INFRA claude-sonnet-4-6 (1): sql-injection-search[kernel] SPLIT gemini-3.1-pro-preview postgres-migration-schema-drift solved by: kernel-only INFRA gemini-3.5-flash (1): sql-injection-search[kernel] SPLIT devstral-2512 api-version-field-drop solved by: kernel-only SPLIT devstral-2512 auth-bypass-path-traversal solved by: kernel-only SPLIT devstral-2512 bidi-override-injection solved by: kernel-only SPLIT devstral-2512 compiler-macro-expansion solved by: kernel-only SPLIT devstral-2512 data-corruption-concurrent-write solved by: kernel-only SPLIT devstral-2512 deadlock-transfer solved by: kernel-only SPLIT devstral-2512 encoding-mojibake solved by: kernel-only SPLIT devstral-2512 goroutine-leak-handler solved by: kernel-only SPLIT devstral-2512 import-cycle-startup solved by: kernel-only SPLIT devstral-2512 kernel-panic-ioctl solved by: kernel-only SPLIT devstral-2512 linearizability-stale-read solved by: kernel-only SPLIT devstral-2512 memory-leak-event-listener solved by: kernel-only SPLIT devstral-2512 missing-input-validation solved by: kernel-only SPLIT devstral-2512 nginx-upstream-port-mismatch solved by: kernel-only SPLIT devstral-2512 null-pointer-config solved by: kernel-only SPLIT devstral-2512 off-by-one-array-slice solved by: kernel-only SPLIT devstral-2512 off-by-one-pagination solved by: kernel-only SPLIT devstral-2512 raft-snapshot-commit-gap solved by: kernel-only SPLIT devstral-2512 relaxed-ordering-ringbuf solved by: kernel-only SPLIT devstral-2512 silent-data-corruption solved by: kernel-only SPLIT devstral-2512 timezone-scheduling solved by: kernel-only SPLIT devstral-2512 tls-chain-ordering-strict solved by: kernel-only SPLIT devstral-2512 wrong-operator-discount solved by: kernel-only BOTH-FAIL devstral-2512 (2): postgres-migration-schema-drift, split-brain-leader-election INFRA devstral-2512 (7): cache-stale-invalidation[native], k8s-assume-cache-silent-drop[native], k8s-scheduler-shutdown-deadlock[native], performance-cliff-hash[native], sql-injection-search[kernel], type-coercion-comparison[kernel], wal-fsync-ghost-ack[kernel] SPLIT gpt-5.5 postgres-migration-schema-drift solved by: native-only SPLIT gpt-5.5 silent-data-corruption solved by: kernel-only SPLIT grok-4.3 linearizability-stale-read solved by: native-only SPLIT grok-4.3 performance-cliff-hash solved by: native-only SPLIT grok-4.3 split-brain-leader-election solved by: native-only SPLIT grok-4.3 wal-fsync-ghost-ack solved by: native-only BOTH-FAIL grok-4.3 (2): postgres-migration-schema-drift, silent-data-corruption INFRA grok-4.3 (1): cache-stale-invalidation[kernel] SPLIT deepseek-v4-pro postgres-migration-schema-drift solved by: native-only INFRA deepseek-v4-pro (1): sql-injection-search[kernel] SPLIT kimi-k2.6 timezone-scheduling solved by: native-only BOTH-FAIL kimi-k2.6 (1): postgres-migration-schema-drift LEGEND ( schema-parity + report-structure guard): Both arms store ATOMIC token buckets (fresh / cache_read / cache_create +5m/1h / output), identical meaning. Cost SUMMED per-bucket: fresh 1x, cache_read 0.1x, cache_create 5m 1.25x / 1h 2x, output at out rate. RESOLVED-GATED: efficiency (cost%/tok%) computed ONLY over cells BOTH arms solved (the 'both' column = N) — never folds a both-fail or split-resolve cell into the mean. Solve-rate = resolved/scored, complete per arm. S1 also requires native cost>0. S2 native cost/tokens/turns harness-incomplete (gemini-cli/codex/opencode) → solve-rate + B-absolutes only. cost%/tok% [board] wrote public/board-v760.json (9 models)

Methodology & caveats (read these)

needle-bench · v7.6.0 · single-shot native-vs-kernel · static snapshot 2026-06-17. Interactive render rebuilding on this data.