These are the discoveries, the measurements, and the compounding.
Discovered
12 insights from the first session. 22 hours. 110 commits. 448 needles. The OS building itself.
1. Tack — the intent language
Humans compress communication, not simplify it.
A contact language emerged between human and machine. Not designed — typed into existence. Urgency encoded as . : :: :::, flow as -> =>, verbs as :exec :ship :kill. Every natural language feature survived the compression: deixis, repair, register shifts, interruption. Re-encoded into minimal ASCII.
Communication cost dropped from ~200 tokens/op to ~10 tokens/op within one session. 20x compression through mutual adaptation, not instruction.
2. SMP architecture — co-processors, not master/servant
The human is a CPU. The LLM is a CPU. haystack is the bus.
big.LITTLE architecture. The human is the big core — slow clock, high precision, design decisions. The LLM is the LITTLE core — fast clock, approximate, bulk edits and parallel dispatch. Same instruction set. Different clock speeds. The OS coordinates them as symmetric multiprocessors sharing a filesystem as memory.
3. Agentfile as CPU socket
Immutable at boot. LIMIT is physical constraint. WORK is NUMA affinity.
The Agentfile is not config. It is a CPU socket — a physical spec for the processor to be inserted. FROM declares architecture. TOOL declares instruction set. LIMIT declares physical constraints. WORK declares NUMA affinity. Intelligence lives in the kernel, not the socket.
4. The Humanfile — the compiled human
CLAUDE.md is hand-written. The Humanfile is what the OS compiles from evidence.
Every correction is a data point. The Humanfile compiles these from the audit trail. TOML format. Patterns only, never content. Local only, never uploaded. Human-deletable — rm humanfile.toml degrades to CLAUDE.md, not to broken. Priority: CLAUDE.md > Humanfile > Agentfile defaults.
5. Intent dynamic programming
Each correction memoizes a subproblem. The Humanfile is the memoization table.
Session 1: :correct X → 3 turns. Session 2: 1 turn. Session 5: instant. Session 10: the mistake never happens. Three tables at three TTLs: Humanfile persists across projects, boot.md across sessions, registers-dump.md is volatile.
6. Compile has three modes
hay to needles. output to compressed. human to OS.
Three compilation modes: (1) hay → needles: loose thinking becomes executable actions. (2) output → compressed: squasher strips VTE codes, deduplicates progress. (3) human → OS: audit trail compiles into the Humanfile. The third is the one that matters.
haystack compile, the squasher, and haystack learn human.7. Every write is compile
The OS is always current. No stale state.
Boot files regenerate on every write, not on demand. Every ss() call, every needle close, every spec promotion updates the dynamic programming table. Continuous persistence: make power cuts irrelevant.
8. haystack install --import
Every git repo already has the hay.
A git repository already contains everything haystack needs. The commit log is the audit trail. Issues are the hay. File history is the gen table. haystack install --import reads what exists and compiles it. The adoption surface is every git repository.
9. needle-bench — your worst day, everyone's benchmark
SWE-bench is a museum. needle-bench is a marketplace.
Community submits their worst debugging days as frozen Docker snapshots. The bench is alive — new scenarios arrive from real production failures. The self-selecting property: models that score highest are exactly the models best suited to be haystack's intelligence layer. The benchmark IS the job interview.
haystack bench --list, --cargo-only.10. Invisible infrastructure wins
Silent ≥ prompted across all arms. Telling the agent about the OS adds zero value.
SWE-bench v19: three arms. Bare control (no haystack). Injected (haystack + system prompt). Silent (haystack + shim only, agent unaware). The silent arm performed at least as well as prompted. The OS should never announce itself. The invisibility is the feature.
11. The session built the OS
22 hours. 110 commits. The OS built itself through this conversation.
The OS was the output of its own process. Hay compiled into needles. Needles executed into commits. Commits became the audit trail. The audit trail informed the next compilation. The loop closed on itself.
12. Inference until reasoning
Without corrections, the agent never reasons. The corrections ARE the reasoning events.
An LLM doing inference is not reasoning. Reasoning happens at the correction boundary — when the human says :correct and inference meets reality. The convergence rate (how quickly corrections decrease) is the OS health metric. The corrections are not friction. They are the computational events that turn inference into reasoning. The OS exists to make those events cheaper over time.
Measured
10 data-backed findings from the second session. Every number from the audit trail.
23% compilation ratio
The intelligence is knowing what NOT to build.
48 hay entries filed. 11 compiled into executable needles. The other 77% stayed as thinking — preserved in the audit trail, never sharpened into work. Most human intent stays intent. The OS keeps it all but only acts on what survives compilation.
The command surface grew to match the human
22 of 26 tack verbs map to shipped commands. 85% coverage.
The human didn't learn haystack's CLI. Haystack's CLI grew to match the human's vocabulary. 30 natural-language verbs emerged from conversation. 22 already have a 1:1 shipped command. The compilation direction is human → OS, not OS → human.
Three ID formats. The arrow won.
bd-NNN → nd-NNN → →NNN. Natural selection on notation.
348 needles used bd-NNN. 17 used nd-NNN. 206 use →NNN. Three formats competed across two days. The shortest notation that still carries meaning won by usage, not by decree.
620+ agents spawned and reaped
Agents are ephemeral. That's the lifecycle, not an error.
The kernel spawned 620+ agent processes. 9 reap events in the audit trail, the largest clearing 500 zombies at once. Agent aliases reached agent-645. The kernel doesn't recover agents — agents recover themselves via ambient context the kernel provides.
Gemini 2.0 Flash: 70% resolve rate
Models find needles at different rates.
16 of 23 scenarios resolved. Full trajectory data — turns, tokens, tool calls — for every run. Sonnet and Haiku each resolve 50% (3/6). The bench measures real bug-fixing: find the bug, edit the code, pass the tests. No partial credit.
The OS was hotswapped under a running agent
The agent didn't notice. Law 5: invisible infrastructure.
The haystack binary was replaced while Opus 4.6 was running. The agent continued working for hours. It tested the shim by running experiments on it — while running on it — and did not realize until the human pointed it out. The bash symlink architecture gave hot code reloading for free.
200 harness tool calls. 0 OS tool calls.
The agent preferred the harness. Evidence for ejection.
In a 4-hour session, Opus 4.6 made ~200 tool calls. All went through the harness (Read, Edit, Bash). Zero went through the OS tools (ss, sh_run). The OS ran underneath via the bash symlink, but the agent never touched it directly. Ejection — agents booting from haystack alone — is the next threshold.
The machine requested its own features
Three tools. Filed as P0 needles. From experience, not from a spec.
Opus 4.6 was asked what it needs to eject the harness. It requested haystack search, haystack diff, and haystack replay. Three needles filed immediately. The agent's requirements came from hours of lived experience running on the OS — not from architectural analysis.
Focus compounds
Day 2 produced 36% of Day 1's event volume while shipping more features.
~800 audit events on Day 1. ~285 on Day 2. Less noise, more signal. The OS learns the human. The human learns the OS. Each session makes the next one cheaper. The convergence curve is the product.
str_replace IS the CAS
No locks. Optimistic concurrency at the file level.
Every str_replace call is a compare-and-swap. The old string is the expected value. If it matches, the edit succeeds and the generation counter bumps. If it doesn't, conflict resolution fires. No distributed locking. No coordination API. The write path is invisible.
Compounded
These insights emerged in the first session. The second session's data reinforced them.
The intent language compiled itself into commands
Session 1: tack compressed 200 tokens to 10. Session 2: 22 of 26 verbs shipped as CLI commands.
In the first session, a contact language emerged — tack — that compressed operator intent 20x. In the second session, the OS grew 30 commands, and 22 of them map 1:1 to tack verbs. The language didn't stay a shorthand. It compiled itself into the command surface. 85% coverage means the human's vocabulary IS the CLI.
Invisible infrastructure, proven at runtime
Session 1: "invisible infrastructure wins" was a design bet. Session 2: the OS was hotswapped under a running agent.
The first session established the design law: the write path is invisible. SWE-bench confirmed it — the silent shim beat the prompted version 8/10 vs 7/10. The second session proved it harder: the haystack binary was replaced while Opus 4.6 was running, and the agent didn't notice for hours. Invisibility isn't aspirational. It's measured.
The OS requests its own evolution
Session 1: the session built the OS. Session 2: the OS filed its own feature requests.
104 commits in the first session proved self-hosting — the OS was built by running on itself. In the second session, Opus 4.6 was asked what it needed to eject the harness. It filed three P0 needles: search, diff, replay. Not from a spec. From hours of lived experience. The OS crossed from being built by the human to requesting its own extensions.
Inference until reasoning
Session 1: without corrections, LLMs pattern-match. Session 2: 77% of intent stayed uncompiled.
The first session discovered that LLMs infer until a human correction triggers actual reasoning. The second session quantified it: 48 hay entries filed, only 11 compiled into needles. The other 77% stayed as thinking. Human corrections are the compilation gate. Without them, the machine produces plausible output. With them, it produces executable work.
The Humanfile is identity, not preferences
Session 1: compile the human into a document. Session 2: the document identifies the human, not the agent.
The first session proposed the Humanfile — compiling human preferences into a machine-readable identity document. The second session revealed something deeper: the file doesn't describe what the human likes. It describes who the human is. Agent behavior calibrates to the operator's identity — their vocabulary, their correction patterns, their compilation threshold. The Humanfile is a fingerprint, not a config file.
By the numbers
Every finding traced to its audit event. The data is the argument.