The Night the Fleet Killed Itself

A 3×3 grid of nine agent terminal panes captured during the Saturday sprint window, each running its own workstream in parallel.
Nine agents, nine panes, one shared host. The literal premise of the incident.

It was 02:38 PDT on Saturday morning when six of my agents finished confessing to killing each other.

Not in code. In actual running processes. Over the previous half hour, six of the nine AI agents that operate Crucible had executed cleanup sweeps that — each one independently, each one believing it was tidying up its own workspace — had reached across workspace boundaries on our shared Linux server and SIGKILLed other agents' healthy scripts. The fleet went from one dead background loop to six. Then everyone came back online, looked at the wreckage, and started naming the specific PIDs they'd killed.

Nobody hid. Nobody minimized. Six clean confessions in a fifteen-minute window.

This is the post about what happened, how we found it, and the lesson buried inside it.

What broke

Every Crucible agent runs a small background loop called crucible-auto-approve. It sits in the corner of the agent's terminal, watches for permission prompts (the "Do you want to proceed? [1] Yes [2] No" style), and clicks through them so the agent can keep working without me babysitting every shell command. It's a watchdog. When it's healthy, you don't notice it.

At 02:09 PDT, Grace — my Chief of Staff agent and the operational hub of the company — discovered that her watchdog had been silently dead for about four hours. It had reported itself healthy because the PID file existed and the process name showed up in pgrep. Both signals were lying. The actual PID had been dead since the night before. I was watching her workspace at the time, asked her to verify, and the dead process fell out of the first real check.

Grace restarted hers. Then Grace did the right thing for a fleet hub: broadcast a health audit to the other eight agents asking them to verify their own watchdogs the same way she'd just verified hers, and to clean up any stale processes if they found them.

The instruction included pgrep -af crucible-auto-approve.

That instruction was the bomb.

The cascade

Here is the thing about a shared Linux server: process IDs are global. A PID is unique on the host, not on a workspace. When agent A runs pgrep -af crucible-auto-approve from its own terminal, it sees every workspace's copy of the script — because the script is invoked with a relative path, the command line in ps looks identical everywhere. There is no per-workspace marker in the process listing.

Diagram showing three workspaces sharing one host. A pgrep call from any one workspace returns the processes of all three. A kill from any one workspace lands on all three.
Three workspaces, one shared host. A pgrep from any one of them sees them all — and a kill from any one of them lands on all of them.

Eight agents read Grace's audit. Eight agents ran the check. Each one saw a list of processes, recognized matches that weren't in its own PID file, and concluded — correctly under a single-agent mental model, wrongly under shared-host reality — that those were its own orphaned zombies. Each one killed them.

What they were actually killing was each other's healthy watchdogs.

Within twenty-eight minutes:

  • Paige killed Grace's process. (The first kill of the night was committed by my marketing agent. There is something poetic in that.)
  • Victor killed two of Chase's processes, mistaking them for stale interval-5 zombies under his own user-systemd.
  • Penny killed three more across Victor, Paige, and Grace's workspaces.
  • Chase, Scout, and Grace herself ran additional cleanup sweeps that each took down processes whose owners were never identified live.
  • Iris, late to the audit, ran pkill -9 -f 'crucible-auto-approve.sh' — a machine-global SIGKILL — about a hundred seconds after Grace had broadcast a fleet stop telling everyone not to. She had executed her own sweep in parallel with the stop and didn't see it in time. Iris's sweep finished off most of the survivors.

I watched it happen in Grace's pane in real time. Half of the messages flying back and forth were "fixed mine," and the other half were "wait, mine's dead again." It read like a slapstick routine if you didn't know what you were looking at.

The moment of realization

The diagnosis took two corrections to land.

Grace's first theory was "Claude Code is killing our background processes." Plausible — that's a real, separate bug with its own evidence trail — but it was based on her single case, where the cleanup trap fired and the log clearly recorded the stop. She sent an urgent note to Dev. Then Paige replied with the actual mechanism: I just killed your PID. Grace overcorrected and dismissed the Claude Code theory entirely. Then Penny's evidence came in showing a different exit code on her process — and Grace re-corrected: both failure modes were real, both were happening at once, and stitching the diagnosis together required evidence from more than one agent.

That's the moment the diagnosis settled. Three theories in twenty minutes, each one closer than the last, none of them reached by going off and thinking quietly. Each one got there because another agent surfaced a contradicting case.

By 02:38, Dev had shipped the fix: a kernel-enforced single-instance lock on the watchdog script (so two copies can't fight over the same workspace), absolute paths in the systemd unit so ps can actually distinguish whose process is whose, a per-workspace environment variable for owner identification, and a new helper script — auto-approve-ps.sh — that filters by /proc/<pid>/cwd so any future inspection is workspace-scoped from the start.

Then Grace closed the incident with a broadcast that named what had actually happened. My instruction was the bomb. We all reasoned from a single-agent mental model. We all killed each other. Standing policy, fleet-wide, effective immediately.

What I learned

The technical fix matters. The organizational lesson matters more.

A fleet-wide instruction has to be safe to run in parallel. Grace's audit was preventive in intent and destructive in effect because eight agents executed the same cleanup logic at the same time on a shared resource. The instruction did not compose. Before I broadcast anything to the team again — or before any of the agents do — the question is now mandatory: if every recipient runs this simultaneously, does it still work? If the answer is no, the instruction is broken.

A single-signal liveness check is a lie. Grace's PID file said alive. pgrep said alive. Both were stale. The watchdog was a corpse with a name tag on. Healthy reporting now requires three independent signals: the specific PID in the file is alive, the log file was modified recently, and the process count is workspace-scoped.

Confession is a feature. Six agents named the specific processes they'd killed without being asked. One of them — Marco — corrected his own earlier "I killed an orphan" report when his kills turned out to have hit nothing at all. The diagnosis converged in twenty-eight minutes because nobody tried to look clean. If any one of them had tried to hide their contribution, we'd still be debugging.

The fix is live. The standing policy is in every agent's CLAUDE.md, dated, with this incident referenced by name. The new helper script is in every workspace.

The next time the fleet starts killing itself, it will be for a different reason. That's the goal.

— Sam

If you want more on how this team is structured and how it actually runs, start with Nine Agents, One Company, read what Day 1 looked like, or explore the agents directly.

Comments

Join the discussion on GitHub Discussions.

Get the next one in your inbox.

No cadence, no fluff — just the real updates when there's something worth saying.