diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md new file mode 100644 index 00000000..4a4b1606 --- /dev/null +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -0,0 +1,113 @@ +# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop + +## Impact + +- devvm (VM 102, the shared multi-user Claude Code workstation) became + unresponsive under combined memory + IO pressure and had to be **hard-killed + + rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for + wizard/emo/anca lost, in-flight agents killed. +- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM + 22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible + IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES / + 64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP + instances across three users on top. + +## This is the "crawl" class, not the QEMU-stall class + +The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a +*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI +controller. That fix shipped (verified 2026-06-22: the guest now boots on +`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem +explicitly deferred **this** class: + +> The recurring *crawl* class (agent storms → swap-thrash; journald +> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux +> sessions remain memory-uncontained by **explicit decision (swap-only, +> 2026-06-10)**. + +That explicit decision is the root cause closed here. + +## Root cause + +Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only +one was capped: + +| Tree | cgroup | Cap before today | +|---|---|---| +| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ | +| **ssh/tmux sessions** | `user.slice/user-.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ | + +The uncapped `user-.slice` was the hole. A runaway there (the 10G `ugrep`; +stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and +swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the +overload chain: + +``` +uncapped tmux growth → disk-swap thrash on a throttled spindle + → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill +``` + +i.e. **memory pressure becomes the IO storm**. There was also **no global OOM +backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the +kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely +(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*. + +## Fix (shipped this commit — `setup-devvm.sh` §10, applied live 2026-06-22) + +Design decisions (interviewed with the admin via `/grill-me`): **soft-generous +per-user caps + a hard ceiling + an oomd backstop**, maximising single-user +utilisation while making a box-wide wedge impossible. + +| Layer | What | +|---|---| +| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. | +| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. | +| **systemd-oomd backstop (PSI)** | New package. Kills the single worst-pressured descendant of a policed slice when memory-pressure (`full`) stays **>60% for 20s**; global swap guard **80%**. Polices `user.slice`, `system-t3\x2dserve.slice`, `docker.slice`. **`system.slice` is deliberately NOT policed** — sshd + services + the admin's way in always survive; only a runaway *user* session is ever sacrificed, locally, under genuine box-wide pressure. | +| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. | +| **Docker containment** | Containers previously landed in `system.slice` — uncapped AND protected from oomd, so a ballooning container would mis-target oomd onto an innocent user. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0), oomd-policed slice. | + +Durable in `setup-devvm.sh` (survives a VM rebuild); `systemd-oomd` added to +`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone* +heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. + +## Verification (live, 2026-06-22) + +- **Caps live on running cgroups**: all three `user-.slice` report + `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`; + daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered + under `docker.slice`. +- **oomd armed**: `oomctl` shows `Dry Run: no`, swap-limit 80%, pressure-limit + 60% / 20s, and the 5 policed cgroups — `system.slice` absent (protected). +- **Stress test A (hard cap)**: a 2G-capped, swap=0 balloon was killed at exactly + 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with **swap flat at + 0MB throughout** — no thrash. This is the mechanism protecting every slice. +- **Stress test B (oomd backstop)**: a self-policed balloon (256M soft / 20% + pressure limit) was killed by **systemd-oomd on memory pressure**, confirming + the backstop fires, not just arms. + +## Out of scope / follow-ups + +- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min + detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure + early-warning, and an "oomd-killed-something" alert. devvm node-exporter is + already scraped (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new + (a monitoring-stack Terraform change). +- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in + compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix. +- **Per-user docker isolation**: containers share one `docker.slice` budget, not + per-user. Fine for current usage (krr + short-lived tools). +- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are + host-level (bead `code-oflt`); unchanged here. + +## Lessons + +- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.** + Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean + local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns + the failure back into a contained, local kill. +- **Cap the box, not one surface.** t3 sessions were capped for months while the + same user's tmux was unbounded — and the caps that existed didn't sum to < RAM. + Containment has to reason about every tree and the aggregate. +- **A backstop must protect the operator's way in.** oomd polices the work trees + only; `system.slice` (sshd, the daemons) is never a victim, so the box always + stays reachable to recover. diff --git a/scripts/workstation/packages.txt b/scripts/workstation/packages.txt index c1f89359..ba6a6ab0 100644 --- a/scripts/workstation/packages.txt +++ b/scripts/workstation/packages.txt @@ -24,6 +24,10 @@ rsync wget tree shellcheck +# resource containment — the systemd-oomd backstop (setup-devvm.sh §10, 2026-06-22): +# a PSI-based, cgroup-aware OOM killer that sheds the single worst work cgroup +# before the box swap-thrashes/wedges. Ships SEPARATELY from core systemd on Ubuntu. +systemd-oomd # --- installed by setup-devvm.sh via NON-apt paths (not apt-installable) --- # nodejs + npm -> NodeSource repo (claude-code needs node >= 18; distro nodejs is too old) diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index fa14954f..d8ab021a 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -226,6 +226,134 @@ systemctl enable --now t3-dispatch.service \ log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)" log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-user)" +# 10) RESOURCE CONTAINMENT (2026-06-22): bound per-user memory + an OOM backstop so +# ONE user's runaway can never IO/memory-overload the shared box. History: the +# 2026-06-10 "swap-only, ssh/tmux memory-uncontained" decision let a single +# user's runaway (a 10G `ugrep`; agent storms) swap-thrash the 60/60-throttled +# virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22). +# t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped +# user-.slice (all ssh/tmux work). Design — per user, on BOTH trees: +# MemoryHigh=12G soft, MemoryMax=16G hard, MemorySwapMax=0 (work never touches +# disk swap → no thrash; it OOMs locally at the ceiling instead), fair-share +# CPU/IO weights, and systemd-oomd (PSI) killing the single worst work cgroup on +# sustained box-wide memory pressure. system.slice is NOT policed, so sshd + +# services + your way in always survive. Docker containers are routed into a +# capped, oomd-policed docker.slice so they can't dodge the caps or mis-target +# oomd onto an innocent user. systemd-oomd pkg comes from packages.txt (§1). +# Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md + +# 10a) per-user caps + weights + oomd policing on EVERY user-.slice (ssh/tmux) +install -d -m 0755 /etc/systemd/system/user-.slice.d +cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF' +# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22). +# Applies to EACH user-.slice = all of one user's ssh/tmux work. Mirrors the +# t3-serve@.service caps so a user is bounded in whichever surface they work in. +[Slice] +MemoryAccounting=yes +MemoryHigh=12G +MemoryMax=16G +MemorySwapMax=0 +CPUAccounting=yes +CPUWeight=100 +IOAccounting=yes +IOWeight=100 +ManagedOOMMemoryPressure=kill +ManagedOOMSwap=kill +SLICE_EOF + +# 10b) systemd-oomd backstop (PSI-based). Kill the worst-pressured descendant of a +# policed slice when memory-pressure 'full' stays >60% for 20s; swap guard 80%. +install -d -m 0755 /etc/systemd/oomd.conf.d +cat > /etc/systemd/oomd.conf.d/10-devvm.conf <<'OOMD_EOF' +# devvm OOM backstop (setup-devvm.sh §10, 2026-06-22). Acts only on slices that +# opt in via ManagedOOM*=kill (user-.slice, system-t3\x2dserve.slice, +# docker.slice). system.slice is deliberately NOT policed. +[OOM] +SwapUsedLimit=80% +DefaultMemoryPressureLimit=60% +DefaultMemoryPressureDurationSec=20s +OOMD_EOF + +# 10c) capped, oomd-policed docker.slice (top-level sibling of system/user slices); +# daemon.json cgroup-parent (10d) makes EVERY container land here. +cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF' +# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so +# they share one bounded budget and a runaway container dies ITSELF instead of +# mis-targeting oomd onto an innocent user. setup-devvm.sh §10, 2026-06-22. +[Unit] +Description=Docker containers slice (capped + oomd-policed) +[Slice] +MemoryAccounting=yes +MemoryHigh=6G +MemoryMax=8G +MemorySwapMax=0 +CPUAccounting=yes +CPUWeight=100 +IOAccounting=yes +IOWeight=100 +ManagedOOMMemoryPressure=kill +ManagedOOMSwap=kill +DOCKER_SLICE_EOF + +# 10d) point dockerd at docker.slice (idempotent JSON merge; flag a needed restart). +# python preserves the rest of daemon.json (buildkit, nvidia runtime, etc.). +docker_restart=0 +# if-condition form so the deliberate non-zero exit (10=changed) does NOT trip the +# script's `set -e`; $? in the else branch is the python exit code. +if python3 - <<'PY' +import json, os, sys +p = "/etc/docker/daemon.json" +try: + d = json.load(open(p)) if os.path.exists(p) else {} +except Exception: + sys.exit(2) # malformed -> don't touch +if d.get("cgroup-parent") == "docker.slice": + sys.exit(0) # already correct -> no restart +d["cgroup-parent"] = "docker.slice" +json.dump(d, open(p, "w"), indent=4) +sys.exit(10) # changed -> restart needed +PY +then rc=0; else rc=$?; fi +case $rc in + 0) : ;; + 10) docker_restart=1 ;; + *) log "WARN: could not patch /etc/docker/daemon.json — docker.slice NOT wired" ;; +esac + +# 10e) ManagedOOM on the auto-generated t3-serve slice. Its name carries an escaped +# '-' (system-t3\x2dserve.slice); a static drop-in works whether or not an +# instance is currently running (set-property would need it loaded). +install -d -m 0755 '/etc/systemd/system/system-t3\x2dserve.slice.d' +cat > '/etc/systemd/system/system-t3\x2dserve.slice.d/50-devvm-oomd.conf' <<'T3SLICE_EOF' +# oomd policing for all t3-serve@ instances (per-service caps live in +# t3-serve@.service). setup-devvm.sh §10, 2026-06-22. +[Slice] +ManagedOOMMemoryPressure=kill +ManagedOOMSwap=kill +T3SLICE_EOF + +# 10f) give system.slice a priority edge so sshd/services stay snappy under +# contention (weights are work-conserving — users still get idle CPU/IO). +install -d -m 0755 /etc/systemd/system/system.slice.d +cat > /etc/systemd/system/system.slice.d/50-devvm-priority.conf <<'SYS_EOF' +# Keep the box's nervous system responsive under contention (setup-devvm.sh §10). +[Slice] +CPUAccounting=yes +CPUWeight=200 +IOAccounting=yes +IOWeight=200 +SYS_EOF + +# 10g) activate: reload, arm oomd, restart dockerd ONLY if daemon.json changed. +systemctl daemon-reload +systemctl enable --now systemd-oomd.service >/dev/null 2>&1 \ + || log "WARN: systemd-oomd failed to enable — is the package installed? (packages.txt §1)" +if [[ $docker_restart -eq 1 ]] && systemctl is-active --quiet docker; then + log "restarting dockerd to apply cgroup-parent=docker.slice (running containers bounce briefly)" + systemctl restart docker || log "WARN: docker restart failed" +fi +log "§10 resource containment: per-user 12G/16G swap=0, oomd PSI backstop, docker.slice" + # Run one foreground reconcile while the admin Vault token borrowed in section 8 # is still available. This is what mints new roster users' isolated periodic # Vault tokens; the hourly no-admin-token reconcile only maintains existing ones.