The systemd-oomd backstop added in the previous commit is INERT on this box. oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at 96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon). The very swap=0 that kills the IO storm also neuters oomd. Replace it with earlyoom, which watches free RAM (MemAvailable%) and is swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the admin's way in always survives) and --prefers the agent/browser hogs. Verified via --dryrun: fires on the RAM threshold and selects a chrome process, not a protected daemon. The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user, docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config + ManagedOOM drop-ins removed. Post-mortem updated with the finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.3 KiB
2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) became unresponsive under combined memory + IO pressure and had to be hard-killed + rebooted by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for wizard/emo/anca lost, in-flight agents killed.
- Signature on the last htop before the kill: load avg ~60 on 32 vCPU, RAM
22.5/23.5G, swap 13.9/14.0G (full), a wall of D-state (uninterruptible
IO-wait) processes, and a single
ugrepin emo's tmux holding ~10G RES / 64% CPU. Manyclaude --effort max/xhighsessions + playwright-chrome MCP instances across three users on top.
This is the "crawl" class, not the QEMU-stall class
The 2026-06-11 post-mortem (2026-06-11-devvm-qemu-io-stall.md) fixed a
different failure mode — a QEMU-userspace block-path wedge on the legacy LSI
controller. That fix shipped (verified 2026-06-22: the guest now boots on
virtio_scsi, scsihw: virtio-scsi-single + iothread). Its post-mortem
explicitly deferred this class:
The recurring crawl class (agent storms → swap-thrash; journald watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux sessions remain memory-uncontained by explicit decision (swap-only, 2026-06-10).
That explicit decision is the root cause closed here.
Root cause
Work on the devvm lives in two independent cgroup-v2 trees per user, and only one was capped:
| Tree | cgroup | Cap before today |
|---|---|---|
| t3 web sessions | system.slice/system-t3\x2dserve.slice/t3-serve@<user> |
MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue ✓ |
| ssh/tmux sessions | user.slice/user-<uid>.slice |
MemoryMax=infinity, swap unlimited ✗ |
The uncapped user-<uid>.slice was the hole. A runaway there (the 10G ugrep;
stacked max-effort agents) grew unbounded, spilled into the 14G disk swap, and
swap-thrashed the host-mbps-throttled (60/60 MB/s) virtual disk. That is the
overload chain:
uncapped tmux growth → disk-swap thrash on a throttled spindle
→ IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
i.e. memory pressure becomes the IO storm. There was also no global OOM backstop (no systemd-oomd / earlyoom) to shed the worst offender before the kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely (3 users × 16G = 48G > 32G RAM) — nothing reasoned about the whole box.
Fix (setup-devvm.sh §10, applied live 2026-06-22)
Design decisions (interviewed with the admin via /grill-me): soft-generous
per-user caps + a hard ceiling + a kill-the-worst backstop, maximising
single-user utilisation while making a box-wide wedge impossible. (The backstop
was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
proved inert with swap=0 — see Verification + Lessons.)
| Layer | What |
|---|---|
| Per-user caps, BOTH trees | user-.slice.d drop-in gives every user-<uid>.slice the same MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0 the t3 tree already had. A user is now bounded in whichever surface they work in. |
| No disk swap for work | MemorySwapMax=0 on every work cgroup → a spike OOMs locally at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
| earlyoom backstop (free-RAM threshold) | New package — used instead of systemd-oomd (which is inert with swap=0; see Lessons). Watches MemAvailable% and SIGTERMs the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). --avoid keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (the admin's way in always survives); --prefer targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
| Fair-share CPU/IO | CPUWeight/IOWeight per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
| Docker containment | Containers previously landed in system.slice — uncapped. Now cgroup-parent: docker.slice in daemon.json routes every container into a capped (MemoryMax=8G, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped system.slice. |
Durable in setup-devvm.sh (survives a VM rebuild); earlyoom added to
packages.txt. The numbers are tunable — MemoryHigh=12G will throttle a lone
heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.
Verification (live, 2026-06-22)
- Caps live on running cgroups: all three
user-<uid>.slicereportmemory.high=12G memory.max=16G memory.swap.max=0;docker.slicememory.max=8G; daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered underdocker.slice. - Stress test A (hard cap) — the PRIMARY guard: a 2G-capped, swap=0 balloon was
killed at exactly 2G by the cgroup-local OOM (
constraint=CONSTRAINT_MEMCG) with swap flat at 0MB throughout — no thrash. Same mechanism protects every user slice (16G) anddocker.slice(8G). - Soft cap observed: a balloon pushed past
MemoryHighsat at ~220M / 99% memory.pressure, throttled to a crawl, making no progress and harming nothing — a runaway is throttled, not just killed. - systemd-oomd disproven, then dropped: a self-policed balloon held
memory.pressure full avg10 = 96–99%(≫ its 20% limit) for >70s but oomd never killed it —Pgscan: 0. oomd's pressure-kill only acts on cgroups doing active reclaim, which aswap=0anon workload never does. oomd was purged. - earlyoom backstop — verified via
--dryrun: at the threshold it logslow memory! … mem 90% swap 100%(fires on RAM alone, swap ignored) and selectsSIGTERM … "chrome"(a--preferhog), never an--avoid'd daemon. Live earlyoom v1.7 confirmsSIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%.
Out of scope / follow-ups
- Alerting (tracked, fast-follow bead):
DevvmDown(closes the 90-min detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;-N /scriptcan push a metric). devvm node-exporter is already scraped (job=devvm,10.0.10.10:9100), so only alert rules are new (a monitoring-stack Terraform change). - zram cushion: considered, deferred. Could let work cgroups absorb spikes in compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
- Per-user docker isolation: containers share one
docker.slicebudget, not per-user. Fine for current usage (krr + short-lived tools). - Host-side IO: the 60/60 mbps cap + the shared
sdcHDD IO domain are host-level (beadcode-oflt); unchanged here.
Lessons
- "Swap as the safety valve" is an IO-storm amplifier on a throttled disk.
Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
local OOM for a box-wide swap-thrash wedge.
MemorySwapMax=0+ a hard cap turns the failure back into a contained, local kill. - Cap the box, not one surface. t3 sessions were capped for months while the same user's tmux was unbounded — and the caps that existed didn't sum to < RAM. Containment has to reason about every tree and the aggregate.
- A backstop must protect the operator's way in. earlyoom
--avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays reachable to recover; only the agent/browser hogs are eligible victims. - systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.
oomd's memory-pressure killer only fires on cgroups doing active reclaim
(
pgscanrising). WithMemorySwapMax=0+ anonymous memory there is nothing to reclaim, so a cgroup sat at 99%memory.pressureindefinitely and oomd never acted (proven withoomctl+ a balloon). The veryswap=0that kills the IO storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the correct pairing. A famous tool that "does OOM" still has to be proven to fire under your configuration.