infra/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
Viktor Barzin 4c532dbf97
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00

168 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) became
unresponsive under combined memory + IO pressure and had to be **hard-killed +
rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
wizard/emo/anca lost, in-flight agents killed.
- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
instances across three users on top.
## This is the "crawl" class, not the QEMU-stall class
The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
controller. That fix shipped (verified 2026-06-22: the guest now boots on
`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
explicitly deferred **this** class:
> The recurring *crawl* class (agent storms → swap-thrash; journald
> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
> sessions remain memory-uncontained by **explicit decision (swap-only,
> 2026-06-10)**.
That explicit decision is the root cause closed here.
## Root cause
Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
one was capped:
| Tree | cgroup | Cap before today |
|---|---|---|
| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |
The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
overload chain:
```
uncapped tmux growth → disk-swap thrash on a throttled spindle
→ IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
```
i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.
## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)
Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
single-user utilisation while making a box-wide wedge impossible. (The backstop
was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
proved inert with `swap=0` — see Verification + Lessons.)
| Layer | What |
|---|---|
| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |
Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
## Verification (live, 2026-06-22)
- **Caps live on running cgroups**: all three `user-<uid>.slice` report
`memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
under `docker.slice`.
- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
**swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
slice (16G) and `docker.slice` (8G).
- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
memory.pressure, throttled to a crawl, making no progress and harming nothing —
a runaway is throttled, not just killed.
- **systemd-oomd disproven, then dropped**: a self-policed balloon held
`memory.pressure full avg10 = 9699%` (≫ its 20% limit) for >70s but oomd never
killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
reclaim, which a `swap=0` anon workload never does. oomd was purged.
- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
`low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
`SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.
## Out of scope / follow-ups
- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
`-N /script` can push a metric). devvm node-exporter is already scraped
(`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
monitoring-stack Terraform change).
- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
- **Per-user docker isolation**: containers share one `docker.slice` budget, not
per-user. Fine for current usage (krr + short-lived tools).
- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
host-level (bead `code-oflt`); unchanged here.
## Lessons
- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
the failure back into a contained, local kill.
- **Cap the box, not one surface.** t3 sessions were capped for months while the
same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
Containment has to reason about every tree and the aggregate.
- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
reachable to recover; only the agent/browser hogs are eligible victims.
- **systemd-oomd is the wrong backstop for a no-swap box verify, don't assume.**
oomd's memory-pressure killer only fires on cgroups doing active reclaim
(`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.
## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
The soft-cap layer of this design was falsified in production on 2026-07-02
(~15:4216:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
every allocating task of the cgroup in `mem_cgroup_handle_over_high`
(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G1.4G
and the service recovered in seconds with no restart).
The Verification bullet above a soft-capped balloon "throttled to a crawl,
making no progress and **harming nothing**" holds only when the hog is alone
in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
OOM the design counted on, so the band converts "runaway dies" into "everyone
in the cgroup stalls forever".
**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
cgroup definitions** `scripts/t3-serve@.service`, the `user-.slice.d`
drop-in, and `docker.slice` (`setup-devvm.sh` §10a10c). A runaway now runs
unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom the layers
the stress tests actually validated are unchanged. Applied live via
`daemon-reload` + runtime `set-property` on the running cgroups; no session
restarts.
Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` it is
an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
beats throttle-and-pray for multi-tenant interactive services.