infra/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop

## Impact

- devvm (VM 102, the shared multi-user Claude Code workstation) became
  unresponsive under combined memory + IO pressure and had to be **hard-killed +
  rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
  wizard/emo/anca lost, in-flight agents killed.
- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
  22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
  IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
  64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
  instances across three users on top.

## This is the "crawl" class, not the QEMU-stall class

The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
controller. That fix shipped (verified 2026-06-22: the guest now boots on
`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
explicitly deferred **this** class:

> The recurring *crawl* class (agent storms → swap-thrash; journald
> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
> sessions remain memory-uncontained by **explicit decision (swap-only,
> 2026-06-10)**.

That explicit decision is the root cause closed here.

## Root cause

Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
one was capped:

| Tree | cgroup | Cap before today |
|---|---|---|
| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |

The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
overload chain:

```
uncapped tmux growth → disk-swap thrash on a throttled spindle
   → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
```

i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.

## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)

Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
single-user utilisation while making a box-wide wedge impossible. (The backstop
was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
proved inert with `swap=0` — see Verification + Lessons.)

| Layer | What |
|---|---|
| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |

Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.

## Verification (live, 2026-06-22)

- **Caps live on running cgroups**: all three `user-<uid>.slice` report
  `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
  daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
  under `docker.slice`.
- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
  killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
  **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
  slice (16G) and `docker.slice` (8G).
- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
  memory.pressure, throttled to a crawl, making no progress and harming nothing —
  a runaway is throttled, not just killed.
- **systemd-oomd disproven, then dropped**: a self-policed balloon held
  `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never
  killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
  reclaim, which a `swap=0` anon workload never does. oomd was purged.
- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
  `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
  `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
  earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.

## Out of scope / follow-ups

- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
  detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
  early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
  `-N /script` can push a metric). devvm node-exporter is already scraped
  (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
  monitoring-stack Terraform change).
- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
  compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
- **Per-user docker isolation**: containers share one `docker.slice` budget, not
  per-user. Fine for current usage (krr + short-lived tools).
- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
  host-level (bead `code-oflt`); unchanged here.

## Lessons

- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
  Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
  local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
  the failure back into a contained, local kill.
- **Cap the box, not one surface.** t3 sessions were capped for months while the
  same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
  Containment has to reason about every tree and the aggregate.
- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
  sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
  reachable to recover; only the agent/browser hogs are eligible victims.
- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.**
  oomd's memory-pressure killer only fires on cgroups doing active reclaim
  (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
  reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
  acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
  storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
  correct pairing. A famous tool that "does OOM" still has to be proven to fire
  under *your* configuration.
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
+								# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
 								## Impact
 								- devvm (VM 102, the shared multi-user Claude Code workstation) became
 								  unresponsive under combined memory + IO pressure and had to be **hard-killed +
 								  rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
 								  wizard/emo/anca lost, in-flight agents killed.
 								- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
 .5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
 								  IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
 % CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
 								  instances across three users on top.
 								## This is the "crawl" class, not the QEMU-stall class
 								The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
 								*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
 								controller. That fix shipped (verified 2026-06-22: the guest now boots on
 								`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
 								explicitly deferred **this** class:
 								> The recurring *crawl* class (agent storms → swap-thrash; journald
 								> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
 								> sessions remain memory-uncontained by **explicit decision (swap-only,
 								> 2026-06-10)**.
 								That explicit decision is the root cause closed here.
 								## Root cause
 								Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
 								one was capped:
 								| Tree | cgroup | Cap before today |
 								|---|---|---|
 								| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
 								| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |
 								The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
 								stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
 								swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
 								overload chain:
 								```
 								uncapped tmux growth → disk-swap thrash on a throttled spindle
 								   → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
 								```
 								i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
 								backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
 								kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
 								(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
 								Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
 								single-user utilisation while making a box-wide wedge impossible. (The backstop
 								was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
 								proved inert with `swap=0` — see Verification + Lessons.)
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
 								| Layer | What |
 								|---|---|
 								| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
 								| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
+								| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
+								`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
 								heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.
 								## Verification (live, 2026-06-22)
 								- **Caps live on running cgroups**: all three `user-<uid>.slice` report
 								  `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
 								  daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
 								  under `docker.slice`.
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
 								  killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
 								  **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
 								  slice (16G) and `docker.slice` (8G).
 								- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
 								  memory.pressure, throttled to a crawl, making no progress and harming nothing —
 								  a runaway is throttled, not just killed.
 								- **systemd-oomd disproven, then dropped**: a self-policed balloon held
 								  `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never
 								  killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
 								  reclaim, which a `swap=0` anon workload never does. oomd was purged.
 								- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
 								  `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
 								  `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
 								  earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
 								## Out of scope / follow-ups
 								- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
 								  detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								  early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
 								  `-N /script` can push a metric). devvm node-exporter is already scraped
 								  (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
 								  monitoring-stack Terraform change).
-												workstation: per-user memory caps + systemd-oomd backstop on devvm

The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:25:09 +00:00
+								- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
 								  compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
 								- **Per-user docker isolation**: containers share one `docker.slice` budget, not
 								  per-user. Fine for current usage (krr + short-lived tools).
 								- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
 								  host-level (bead `code-oflt`); unchanged here.
 								## Lessons
 								- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
 								  Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
 								  local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
 								  the failure back into a contained, local kill.
 								- **Cap the box, not one surface.** t3 sessions were capped for months while the
 								  same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
 								  Containment has to reason about every tree and the aggregate.
-												workstation: switch devvm OOM backstop from systemd-oomd to earlyoom

The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-22 10:39:16 +00:00
+								- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
 								  sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
 								  reachable to recover; only the agent/browser hogs are eligible victims.
 								- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.**
 								  oomd's memory-pressure killer only fires on cgroups doing active reclaim
 								  (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
 								  reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
 								  acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
 								  storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
 								  correct pairing. A famous tool that "does OOM" still has to be proven to fire
 								  under *your* configuration.