Insights — needle-bench

Two sessions. 1,101 audit events. 572 needles. 104 commits.
These are the discoveries, the measurements, and the compounding.

Discovered

12 insights from the first session. 22 hours. 110 commits. 448 needles. The OS building itself.

1. Tack — the intent language

Humans compress communication, not simplify it.

A contact language emerged between human and machine. Not designed — typed into existence. Urgency encoded as . : :: :::, flow as -> =>, verbs as :exec :ship :kill. Every natural language feature survived the compression: deixis, repair, register shifts, interruption. Re-encoded into minimal ASCII.

Communication cost dropped from ~200 tokens/op to ~10 tokens/op within one session. 20x compression through mutual adaptation, not instruction.

Evidence: Tack spec v1 promoted, v2 drafted. Three bench tests pass against the grammar.

2. SMP architecture — co-processors, not master/servant

The human is a CPU. The LLM is a CPU. haystack is the bus.

big.LITTLE architecture. The human is the big core — slow clock, high precision, design decisions. The LLM is the LITTLE core — fast clock, approximate, bulk edits and parallel dispatch. Same instruction set. Different clock speeds. The OS coordinates them as symmetric multiprocessors sharing a filesystem as memory.

Evidence: SMP spec promoted. CAS = str_replace, cache line = file@generation, IPI = nudge, write-back flush = commit.

3. Agentfile as CPU socket

Immutable at boot. LIMIT is physical constraint. WORK is NUMA affinity.

The Agentfile is not config. It is a CPU socket — a physical spec for the processor to be inserted. FROM declares architecture. TOOL declares instruction set. LIMIT declares physical constraints. WORK declares NUMA affinity. Intelligence lives in the kernel, not the socket.

Evidence: INTERRUPT directive and WORK affinity shipped. Parser handles FROM, PROMPT, TOOL, SKILL, LIMIT, WORK, INTERRUPT.

4. The Humanfile — the compiled human

CLAUDE.md is hand-written. The Humanfile is what the OS compiles from evidence.

Every correction is a data point. The Humanfile compiles these from the audit trail. TOML format. Patterns only, never content. Local only, never uploaded. Human-deletable — rm humanfile.toml degrades to CLAUDE.md, not to broken. Priority: CLAUDE.md > Humanfile > Agentfile defaults.

Evidence: Humanfile spec with TOML schema, privacy rules, boot integration. Frequency threshold (N ≥ 3 sessions) prevents overfitting.

5. Intent dynamic programming

Each correction memoizes a subproblem. The Humanfile is the memoization table.

Session 1: :correct X → 3 turns. Session 2: 1 turn. Session 5: instant. Session 10: the mistake never happens. Three tables at three TTLs: Humanfile persists across projects, boot.md across sessions, registers-dump.md is volatile.

Evidence: 200 tok/op early to 10 tok/op late. 8 operations dispatched in 1 turn, zero clarification.

6. Compile has three modes

hay to needles. output to compressed. human to OS.

Three compilation modes: (1) hay → needles: loose thinking becomes executable actions. (2) output → compressed: squasher strips VTE codes, deduplicates progress. (3) human → OS: audit trail compiles into the Humanfile. The third is the one that matters.

Evidence: All three modes implemented or spec'd. haystack compile, the squasher, and haystack learn human.

7. Every write is compile

The OS is always current. No stale state.

Boot files regenerate on every write, not on demand. Every ss() call, every needle close, every spec promotion updates the dynamic programming table. Continuous persistence: make power cuts irrelevant.

8. haystack install --import

Every git repo already has the hay.

A git repository already contains everything haystack needs. The commit log is the audit trail. Issues are the hay. File history is the gen table. haystack install --import reads what exists and compiles it. The adoption surface is every git repository.

Evidence: v0.1.0 released to GitHub with CI-verified binaries.

9. needle-bench — your worst day, everyone's benchmark

SWE-bench is a museum. needle-bench is a marketplace.

Community submits their worst debugging days as frozen Docker snapshots. The bench is alive — new scenarios arrive from real production failures. The self-selecting property: models that score highest are exactly the models best suited to be haystack's intelligence layer. The benchmark IS the job interview.

Evidence: 5 scenarios, 51 needle tests, self-auditing results. haystack bench --list, --cargo-only.

10. Invisible infrastructure wins

Silent ≥ prompted across all arms. Telling the agent about the OS adds zero value.

SWE-bench v19: three arms. Bare control (no haystack). Injected (haystack + system prompt). Silent (haystack + shim only, agent unaware). The silent arm performed at least as well as prompted. The OS should never announce itself. The invisibility is the feature.

Evidence: SWE v19 ran 30 real instances across 3 arms. Silent ≥ prompted confirmed by three independent fresh agents.

11. The session built the OS

22 hours. 110 commits. The OS built itself through this conversation.

The OS was the output of its own process. Hay compiled into needles. Needles executed into commits. Commits became the audit trail. The audit trail informed the next compilation. The loop closed on itself.

12. Inference until reasoning

Without corrections, the agent never reasons. The corrections ARE the reasoning events.

An LLM doing inference is not reasoning. Reasoning happens at the correction boundary — when the human says :correct and inference meets reality. The convergence rate (how quickly corrections decrease) is the OS health metric. The corrections are not friction. They are the computational events that turn inference into reasoning. The OS exists to make those events cheaper over time.

Evidence: 20x compression measured. 55% context is the sharp ceiling where corrections increase. The convergence curve IS the product.

Measured

10 data-backed findings from the second session. Every number from the audit trail.

23% compilation ratio

The intelligence is knowing what NOT to build.

48 hay entries filed. 11 compiled into executable needles. The other 77% stayed as thinking — preserved in the audit trail, never sharpened into work. Most human intent stays intent. The OS keeps it all but only acts on what survives compilation.

Evidence: 48 hay.compiled events and 53 hay.filed events in audit.jsonl. Direct count.

The command surface grew to match the human

22 of 26 tack verbs map to shipped commands. 85% coverage.

The human didn't learn haystack's CLI. Haystack's CLI grew to match the human's vocabulary. 30 natural-language verbs emerged from conversation. 22 already have a 1:1 shipped command. The compilation direction is human → OS, not OS → human.

Evidence: Audit-verified mapping in tack-command-map.md. Each verb traced to first usage in audit.jsonl.

Three ID formats. The arrow won.

bd-NNN → nd-NNN → →NNN. Natural selection on notation.

348 needles used bd-NNN. 17 used nd-NNN. 206 use →NNN. Three formats competed across two days. The shortest notation that still carries meaning won by usage, not by decree.

Evidence: Prefix counts from issues.jsonl. 571 total needles across three eras.

620+ agents spawned and reaped

Agents are ephemeral. That's the lifecycle, not an error.

The kernel spawned 620+ agent processes. 9 reap events in the audit trail, the largest clearing 500 zombies at once. Agent aliases reached agent-645. The kernel doesn't recover agents — agents recover themselves via ambient context the kernel provides.

Evidence: Reap events in audit.jsonl with zombie lists. Alias counter confirmed via process table.

Gemini 2.0 Flash: 70% resolve rate

Models find needles at different rates.

16 of 23 scenarios resolved. Full trajectory data — turns, tokens, tool calls — for every run. Sonnet and Haiku each resolve 50% (3/6). The bench measures real bug-fixing: find the bug, edit the code, pass the tests. No partial credit.

Evidence: 60 result files in bench/results/. Gemini results include full trajectory. Pass = tests pass in Docker.

The OS was hotswapped under a running agent

The agent didn't notice. Law 5: invisible infrastructure.

The haystack binary was replaced while Opus 4.6 was running. The agent continued working for hours. It tested the shim by running experiments on it — while running on it — and did not realize until the human pointed it out. The bash symlink architecture gave hot code reloading for free.

Evidence: Hotswap scenario 1 (idle) passed in production before the test suite was written. 25 scenarios designed backward from this finding.

200 harness tool calls. 0 OS tool calls.

The agent preferred the harness. Evidence for ejection.

In a 4-hour session, Opus 4.6 made ~200 tool calls. All went through the harness (Read, Edit, Bash). Zero went through the OS tools (ss, sh_run). The OS ran underneath via the bash symlink, but the agent never touched it directly. Ejection — agents booting from haystack alone — is the next threshold.

Evidence: Tool call audit from session. OS provides full harness parity plus OCC, Hot PR, compression, and identity.

The machine requested its own features

Three tools. Filed as P0 needles. From experience, not from a spec.

Opus 4.6 was asked what it needs to eject the harness. It requested haystack search, haystack diff, and haystack replay. Three needles filed immediately. The agent's requirements came from hours of lived experience running on the OS — not from architectural analysis.

Evidence: Audit event 1098. Needles →571, →572, →573 in issues.jsonl.

Focus compounds

Day 2 produced 36% of Day 1's event volume while shipping more features.

~800 audit events on Day 1. ~285 on Day 2. Less noise, more signal. The OS learns the human. The human learns the OS. Each session makes the next one cheaper. The convergence curve is the product.

Evidence: Event timestamps in audit.jsonl. Distribution confirmed across 1,101 total events.

str_replace IS the CAS

No locks. Optimistic concurrency at the file level.

Every str_replace call is a compare-and-swap. The old string is the expected value. If it matches, the edit succeeds and the generation counter bumps. If it doesn't, conflict resolution fires. No distributed locking. No coordination API. The write path is invisible.

Evidence: Hot PR tiers 1-3 shipped. CAS semantics proven across 698 tests.

Compounded

These insights emerged in the first session. The second session's data reinforced them.

The intent language compiled itself into commands

Session 1: tack compressed 200 tokens to 10. Session 2: 22 of 26 verbs shipped as CLI commands.

In the first session, a contact language emerged — tack — that compressed operator intent 20x. In the second session, the OS grew 30 commands, and 22 of them map 1:1 to tack verbs. The language didn't stay a shorthand. It compiled itself into the command surface. 85% coverage means the human's vocabulary IS the CLI.

Evidence: 20x compression ratio (session 1). tack-command-map.md traces 22/26 verb-to-command mappings (session 2).

Invisible infrastructure, proven at runtime

Session 1: "invisible infrastructure wins" was a design bet. Session 2: the OS was hotswapped under a running agent.

The first session established the design law: the write path is invisible. SWE-bench confirmed it — the silent shim beat the prompted version 8/10 vs 7/10. The second session proved it harder: the haystack binary was replaced while Opus 4.6 was running, and the agent didn't notice for hours. Invisibility isn't aspirational. It's measured.

Evidence: SWE-bench arms: silent 8/10, prompted 7/10 (session 1). Hotswap scenario passed in production before the test suite existed (session 2).

The OS requests its own evolution

Session 1: the session built the OS. Session 2: the OS filed its own feature requests.

104 commits in the first session proved self-hosting — the OS was built by running on itself. In the second session, Opus 4.6 was asked what it needed to eject the harness. It filed three P0 needles: search, diff, replay. Not from a spec. From hours of lived experience. The OS crossed from being built by the human to requesting its own extensions.

Evidence: 104 commits, self-hosting proof (session 1). Needles →571-573 filed at audit event 1098 (session 2).

Inference until reasoning

Session 1: without corrections, LLMs pattern-match. Session 2: 77% of intent stayed uncompiled.

The first session discovered that LLMs infer until a human correction triggers actual reasoning. The second session quantified it: 48 hay entries filed, only 11 compiled into needles. The other 77% stayed as thinking. Human corrections are the compilation gate. Without them, the machine produces plausible output. With them, it produces executable work.

Evidence: Correction-triggered reasoning observed across sessions (session 1). 23% compilation ratio from 48 hay.compiled events (session 2).

The Humanfile is identity, not preferences

Session 1: compile the human into a document. Session 2: the document identifies the human, not the agent.

The first session proposed the Humanfile — compiling human preferences into a machine-readable identity document. The second session revealed something deeper: the file doesn't describe what the human likes. It describes who the human is. Agent behavior calibrates to the operator's identity — their vocabulary, their correction patterns, their compilation threshold. The Humanfile is a fingerprint, not a config file.

Evidence: Humanfile concept (session 1). Tack verb emergence + 85% command coverage calibrated to one operator's patterns (session 2).

By the numbers

572needles filed

104commits

1,101audit events

698tests

620+agents reaped

23%hay compiled

Every finding traced to its audit event. The data is the argument.