← needle-bench.cc

needle-bench — v7.6.0 schema-unified results

Native vendor CLI vs ostk kernel · single-shot (samples=1) · captured 2026-06-17

TL;DR

Apples-to-apples cost/token is honestly measurable only on Anthropic models (claude-code is the only native harness that reports cost + cache split + real turns). Cost is summed per token bucket (fresh 1×, cache-read 0.1×, cache-create 1.25×/2×) on a rate card validated to within 0.5% of real Anthropic billing; efficiency deltas are computed only over cells both arms solved.

ModelSolve (native → kernel)Cost ΔToken Δ
claude-opus-4-897% → 97% (parity)−17%−49%
claude-sonnet-4-693% → 100% (kernel +2)−38%−65%
devstral-2512 (solve-only)19% → 97%n/an/a
gemini-3.1-pro (solve-only)97% → 100%n/an/a
gpt-5.5 (solve-only)94% → 97%n/an/a
kimi-k2.6 (solve-only)97% → 95%n/an/a
grok-4.3 / deepseek-v4-pro (B*)native key-blocked — kernel-onlyn/an/a

Read: at equal-or-better solve rate the kernel is cheaper and far lighter on tokens on the honestly-comparable (Anthropic) models, and a large capability multiplier for weak tool-users (devstral 19%→97%). This is a floor — the cache-sliver projection is not yet in the kernel arm.

Full results (verified gating)

Generated verbatim from consolidate_harden.py — resolved-gated efficiency, per-bucket pricing, split-resolve / both-fail / infra broken out. 38 benchmarks.

====================================================================================
SECTION 1 — ANTHROPIC: apples-to-apples cost / token (claude-code native)
  The only native harness reporting cost + cache split + real turns.
  SOLVE = resolved count / scored. EFFICIENCY (cost%/tok%) over the
  both-solved subset ONLY (n=both); 'cheaper at failing' excluded.
====================================================================================
model                 B scd    natSlv     BSlv both     nat$       B$  cost%   tok%  splt
------------------------------------------------------------------------------------
claude-opus-4-8       B  37  34/37   92%  36/37   97%   34    9.373    6.372   -32%   -50%  100%
claude-sonnet-4-6     B  36  34/36   94%  35/36   97%   34    7.348    3.417   -53%   -70%  100%

====================================================================================
SECTION 2 — SOLVE-RATE + native ABSOLUTES (native cache-split incomplete)
  Cross-arm Δ$/Δtok still OMITTED here (no cache split → not apples-to-
  apples; the board carries the un-gated delta + native_cost_basis).
   UN-GATE: nat$/nat_tok now shown for the total tier (gemini-cli/
  codex/vibe/kimi) — cost is an ALL-FRESH FLOOR (no cache discount, over-
  states native). B$/B_tok = kernel absolutes over B-solved. grok/deepseek
  native = key-pending (blank).
====================================================================================
model               B   native_harness scd    natSlv     BSlv nat$(abs)    nat_tok  B$(abs)  B_tok(abs)
------------------------------------------------------------------------------------
gemini-3.1-pro-preview  B       gemini-cli  37  36/37   97%  37/37  100%     3.716  4,208,541    8.040   5,184,195
gemini-3.5-flash    B       gemini-cli  36  36/36  100%  36/36  100%     8.799  7,883,566   11.372   9,150,793
devstral-2512       B             vibe  30   5/30   17%  28/30   93%     1.498  3,615,795    1.610   3,827,806
gpt-5.5             B            codex  37  36/37   97%  36/37   97%     6.598  4,871,004   18.876   5,530,554
grok-4.3           B*         opencode  36  34/36   94%  30/36   83%     4.017  3,172,558    3.254   2,532,728
deepseek-v4-pro    B*         opencode  36  36/36  100%  35/36   97%     5.380  3,014,310    8.160   4,353,782
kimi-k2.6          B*             kimi  37  36/37   97%  35/37   95%         -          -    1.420   2,911,943

====================================================================================
SECTION 3 — SPLIT-RESOLVE / BOTH-FAIL (solve signal) + INFRA-FAILED (excluded)
  SPLIT/BOTH-FAIL = genuine capability outcomes (turns>0). INFRA = zero-work
  (turns==0, no API call) — excluded from BOTH solve-rate & efficiency, retried.
====================================================================================
  SPLIT     claude-opus-4-8        silent-data-corruption             solved by: kernel-only
  SPLIT     claude-opus-4-8        split-brain-leader-election        solved by: kernel-only
  BOTH-FAIL claude-opus-4-8        (1): postgres-migration-schema-drift
  SPLIT     claude-sonnet-4-6      silent-data-corruption             solved by: kernel-only
  BOTH-FAIL claude-sonnet-4-6      (1): postgres-migration-schema-drift
  INFRA     claude-sonnet-4-6      (1): sql-injection-search[kernel]
  SPLIT     gemini-3.1-pro-preview postgres-migration-schema-drift    solved by: kernel-only
  INFRA     gemini-3.5-flash       (1): sql-injection-search[kernel]
  SPLIT     devstral-2512          api-version-field-drop             solved by: kernel-only
  SPLIT     devstral-2512          auth-bypass-path-traversal         solved by: kernel-only
  SPLIT     devstral-2512          bidi-override-injection            solved by: kernel-only
  SPLIT     devstral-2512          compiler-macro-expansion           solved by: kernel-only
  SPLIT     devstral-2512          data-corruption-concurrent-write   solved by: kernel-only
  SPLIT     devstral-2512          deadlock-transfer                  solved by: kernel-only
  SPLIT     devstral-2512          encoding-mojibake                  solved by: kernel-only
  SPLIT     devstral-2512          goroutine-leak-handler             solved by: kernel-only
  SPLIT     devstral-2512          import-cycle-startup               solved by: kernel-only
  SPLIT     devstral-2512          kernel-panic-ioctl                 solved by: kernel-only
  SPLIT     devstral-2512          linearizability-stale-read         solved by: kernel-only
  SPLIT     devstral-2512          memory-leak-event-listener         solved by: kernel-only
  SPLIT     devstral-2512          missing-input-validation           solved by: kernel-only
  SPLIT     devstral-2512          nginx-upstream-port-mismatch       solved by: kernel-only
  SPLIT     devstral-2512          null-pointer-config                solved by: kernel-only
  SPLIT     devstral-2512          off-by-one-array-slice             solved by: kernel-only
  SPLIT     devstral-2512          off-by-one-pagination              solved by: kernel-only
  SPLIT     devstral-2512          raft-snapshot-commit-gap           solved by: kernel-only
  SPLIT     devstral-2512          relaxed-ordering-ringbuf           solved by: kernel-only
  SPLIT     devstral-2512          silent-data-corruption             solved by: kernel-only
  SPLIT     devstral-2512          timezone-scheduling                solved by: kernel-only
  SPLIT     devstral-2512          tls-chain-ordering-strict          solved by: kernel-only
  SPLIT     devstral-2512          wrong-operator-discount            solved by: kernel-only
  BOTH-FAIL devstral-2512          (2): postgres-migration-schema-drift, split-brain-leader-election
  INFRA     devstral-2512          (7): cache-stale-invalidation[native], k8s-assume-cache-silent-drop[native], k8s-scheduler-shutdown-deadlock[native], performance-cliff-hash[native], sql-injection-search[kernel], type-coercion-comparison[kernel], wal-fsync-ghost-ack[kernel]
  SPLIT     gpt-5.5                postgres-migration-schema-drift    solved by: native-only
  SPLIT     gpt-5.5                silent-data-corruption             solved by: kernel-only
  SPLIT     grok-4.3               linearizability-stale-read         solved by: native-only
  SPLIT     grok-4.3               performance-cliff-hash             solved by: native-only
  SPLIT     grok-4.3               split-brain-leader-election        solved by: native-only
  SPLIT     grok-4.3               wal-fsync-ghost-ack                solved by: native-only
  BOTH-FAIL grok-4.3               (2): postgres-migration-schema-drift, silent-data-corruption
  INFRA     grok-4.3               (1): cache-stale-invalidation[kernel]
  SPLIT     deepseek-v4-pro        postgres-migration-schema-drift    solved by: native-only
  INFRA     deepseek-v4-pro        (1): sql-injection-search[kernel]
  SPLIT     kimi-k2.6              timezone-scheduling                solved by: native-only
  BOTH-FAIL kimi-k2.6              (1): postgres-migration-schema-drift

LEGEND ( schema-parity + report-structure guard):
Both arms store ATOMIC token buckets (fresh / cache_read / cache_create
+5m/1h / output), identical meaning. Cost SUMMED per-bucket: fresh 1x,
cache_read 0.1x, cache_create 5m 1.25x / 1h 2x, output at out rate.
RESOLVED-GATED: efficiency (cost%/tok%) computed ONLY over cells BOTH arms
solved (the 'both' column = N) — never folds a both-fail or split-resolve
cell into the mean. Solve-rate = resolved/scored, complete per arm. S1 also
requires native cost>0. S2 native cost/tokens/turns harness-incomplete
(gemini-cli/codex/opencode) → solve-rate + B-absolutes only. cost%/tok%

[board] wrote public/board-v760.json (9 models)

Methodology & caveats (read these)

needle-bench · v7.6.0 · single-shot native-vs-kernel · static snapshot 2026-06-17. Interactive render rebuilding on this data.