From de163aa6afdc9dc220d53b09addf5102aaa95373 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 22 Jun 2026 10:39:16 +0000 Subject: [PATCH] workstation: switch devvm OOM backstop from systemd-oomd to earlyoom MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The systemd-oomd backstop added in the previous commit is INERT on this box. oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at 96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon). The very swap=0 that kills the IO storm also neuters oomd. Replace it with earlyoom, which watches free RAM (MemAvailable%) and is swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored (-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the admin's way in always survives) and --prefers the agent/browser hogs. Verified via --dryrun: fires on the RAM threshold and selects a chrome process, not a protected daemon. The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user, docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config + ManagedOOM drop-ins removed. Post-mortem updated with the finding. Co-Authored-By: Claude Opus 4.8 --- ...06-22-devvm-mem-io-overload-containment.md | 58 +++++++----- scripts/workstation/packages.txt | 10 ++- scripts/workstation/setup-devvm.sh | 88 +++++++++---------- 3 files changed, 88 insertions(+), 68 deletions(-) diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md index 4a4b1606..664869fa 100644 --- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -52,21 +52,23 @@ backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely (3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*. -## Fix (shipped this commit — `setup-devvm.sh` §10, applied live 2026-06-22) +## Fix (`setup-devvm.sh` §10, applied live 2026-06-22) Design decisions (interviewed with the admin via `/grill-me`): **soft-generous -per-user caps + a hard ceiling + an oomd backstop**, maximising single-user -utilisation while making a box-wide wedge impossible. +per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising +single-user utilisation while making a box-wide wedge impossible. (The backstop +was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd +proved inert with `swap=0` — see Verification + Lessons.) | Layer | What | |---|---| | **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. | | **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. | -| **systemd-oomd backstop (PSI)** | New package. Kills the single worst-pressured descendant of a policed slice when memory-pressure (`full`) stays **>60% for 20s**; global swap guard **80%**. Polices `user.slice`, `system-t3\x2dserve.slice`, `docker.slice`. **`system.slice` is deliberately NOT policed** — sshd + services + the admin's way in always survive; only a runaway *user* session is ever sacrificed, locally, under genuine box-wide pressure. | +| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. | | **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. | -| **Docker containment** | Containers previously landed in `system.slice` — uncapped AND protected from oomd, so a ballooning container would mis-target oomd onto an innocent user. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0), oomd-policed slice. | +| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. | -Durable in `setup-devvm.sh` (survives a VM rebuild); `systemd-oomd` added to +Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to `packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone* heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. @@ -76,22 +78,30 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`; daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered under `docker.slice`. -- **oomd armed**: `oomctl` shows `Dry Run: no`, swap-limit 80%, pressure-limit - 60% / 20s, and the 5 policed cgroups — `system.slice` absent (protected). -- **Stress test A (hard cap)**: a 2G-capped, swap=0 balloon was killed at exactly - 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with **swap flat at - 0MB throughout** — no thrash. This is the mechanism protecting every slice. -- **Stress test B (oomd backstop)**: a self-policed balloon (256M soft / 20% - pressure limit) was killed by **systemd-oomd on memory pressure**, confirming - the backstop fires, not just arms. +- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was + killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with + **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user + slice (16G) and `docker.slice` (8G). +- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99% + memory.pressure, throttled to a crawl, making no progress and harming nothing — + a runaway is throttled, not just killed. +- **systemd-oomd disproven, then dropped**: a self-policed balloon held + `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never + killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active + reclaim, which a `swap=0` anon workload never does. oomd was purged. +- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs + `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects + `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live + earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`. ## Out of scope / follow-ups - **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure - early-warning, and an "oomd-killed-something" alert. devvm node-exporter is - already scraped (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new - (a monitoring-stack Terraform change). + early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill; + `-N /script` can push a metric). devvm node-exporter is already scraped + (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a + monitoring-stack Terraform change). - **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix. - **Per-user docker isolation**: containers share one `docker.slice` budget, not @@ -108,6 +118,14 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. - **Cap the box, not one surface.** t3 sessions were capped for months while the same user's tmux was unbounded — and the caps that existed didn't sum to < RAM. Containment has to reason about every tree and the aggregate. -- **A backstop must protect the operator's way in.** oomd polices the work trees - only; `system.slice` (sshd, the daemons) is never a victim, so the box always - stays reachable to recover. +- **A backstop must protect the operator's way in.** earlyoom `--avoid`s + sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays + reachable to recover; only the agent/browser hogs are eligible victims. +- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.** + oomd's memory-pressure killer only fires on cgroups doing active reclaim + (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to + reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never + acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO + storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the + correct pairing. A famous tool that "does OOM" still has to be proven to fire + under *your* configuration. diff --git a/scripts/workstation/packages.txt b/scripts/workstation/packages.txt index ba6a6ab0..f578ee90 100644 --- a/scripts/workstation/packages.txt +++ b/scripts/workstation/packages.txt @@ -24,10 +24,12 @@ rsync wget tree shellcheck -# resource containment — the systemd-oomd backstop (setup-devvm.sh §10, 2026-06-22): -# a PSI-based, cgroup-aware OOM killer that sheds the single worst work cgroup -# before the box swap-thrashes/wedges. Ships SEPARATELY from core systemd on Ubuntu. -systemd-oomd +# resource containment — earlyoom backstop (setup-devvm.sh §10, 2026-06-22): a +# free-RAM-threshold OOM killer used INSTEAD of systemd-oomd, which is inert with +# swap=0 (its pressure-kill needs reclaim/pgscan that no-swap anon workloads never +# produce; verified live — 99% mem.pressure, pgscan=0, no kill). earlyoom watches +# MemAvailable% and is swap-independent. +earlyoom # --- installed by setup-devvm.sh via NON-apt paths (not apt-installable) --- # nodejs + npm -> NodeSource repo (claude-code needs node >= 18; distro nodejs is too old) diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index d8ab021a..e4265269 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -233,16 +233,20 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22). # t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped # user-.slice (all ssh/tmux work). Design — per user, on BOTH trees: -# MemoryHigh=12G soft, MemoryMax=16G hard, MemorySwapMax=0 (work never touches -# disk swap → no thrash; it OOMs locally at the ceiling instead), fair-share -# CPU/IO weights, and systemd-oomd (PSI) killing the single worst work cgroup on -# sustained box-wide memory pressure. system.slice is NOT policed, so sshd + -# services + your way in always survive. Docker containers are routed into a -# capped, oomd-policed docker.slice so they can't dodge the caps or mis-target -# oomd onto an innocent user. systemd-oomd pkg comes from packages.txt (§1). +# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard, +# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at +# the ceiling instead), plus fair-share CPU/IO weights. +# BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is +# INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim +# (pgscan rising), and a no-swap anon workload never reclaims — verified live, a +# cgroup at 99% memory.pressure / pgscan=0 was never killed. earlyoom instead +# watches FREE RAM (MemAvailable%) and SIGTERMs the biggest process at 5% / -k 3%, +# swap-independent and reliable. It --avoids sshd/systemd/dockerd (your way in +# stays alive) and --prefers the agent/browser hogs. earlyoom pkg = packages.txt +# (§1). Per-cgroup MemoryMax is the PRIMARY guard; earlyoom is the aggregate net. # Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md -# 10a) per-user caps + weights + oomd policing on EVERY user-.slice (ssh/tmux) +# 10a) per-user caps + fair-share weights on EVERY user-.slice (ssh/tmux) install -d -m 0755 /etc/systemd/system/user-.slice.d cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF' # Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22). @@ -257,31 +261,31 @@ CPUAccounting=yes CPUWeight=100 IOAccounting=yes IOWeight=100 -ManagedOOMMemoryPressure=kill -ManagedOOMSwap=kill SLICE_EOF -# 10b) systemd-oomd backstop (PSI-based). Kill the worst-pressured descendant of a -# policed slice when memory-pressure 'full' stays >60% for 20s; swap guard 80%. -install -d -m 0755 /etc/systemd/oomd.conf.d -cat > /etc/systemd/oomd.conf.d/10-devvm.conf <<'OOMD_EOF' -# devvm OOM backstop (setup-devvm.sh §10, 2026-06-22). Acts only on slices that -# opt in via ManagedOOM*=kill (user-.slice, system-t3\x2dserve.slice, -# docker.slice). system.slice is deliberately NOT policed. -[OOM] -SwapUsedLimit=80% -DefaultMemoryPressureLimit=60% -DefaultMemoryPressureDurationSec=20s -OOMD_EOF +# 10b) earlyoom backstop config — RAM-threshold, swap-INDEPENDENT (see header note +# on why systemd-oomd is inert with swap=0). The Debian unit reads /etc/default. +cat > /etc/default/earlyoom <<'EARLYOOM_EOF' +# devvm aggregate OOM backstop (setup-devvm.sh §10, 2026-06-22). Watches FREE RAM +# (MemAvailable%) and kills the biggest task before the box exhausts. Unlike +# systemd-oomd it needs NO swap/reclaim, so it works with our swap=0 work cgroups. +# -m 5,3 SIGTERM the victim at MemAvailable<5%, SIGKILL at <3% +# -s 100,100 ignore swap in the decision (RAM-only; work cgroups are swap=0) +# --avoid never the box's nervous system / your way back in +# --prefer target the agent/browser/build hogs that actually exhaust RAM +# -r 3600 hourly memory report (the 60s default is log spam) +EARLYOOM_ARGS="-m 5,3 -s 100,100 -r 3600 --avoid ^(systemd|systemd-.*|sshd|dockerd|containerd|init|t3-dispatch|tmux.*)$ --prefer ^(python3|node|chrome|chromium|ugrep|rg|go|claude)$" +EARLYOOM_EOF -# 10c) capped, oomd-policed docker.slice (top-level sibling of system/user slices); -# daemon.json cgroup-parent (10d) makes EVERY container land here. +# 10c) capped docker.slice (top-level sibling of system/user slices); daemon.json +# cgroup-parent (10d) makes EVERY container land here under one bounded budget. cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF' # All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so -# they share one bounded budget and a runaway container dies ITSELF instead of -# mis-targeting oomd onto an innocent user. setup-devvm.sh §10, 2026-06-22. +# they share one bounded budget and a runaway container is capped at MemoryMax +# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice. +# setup-devvm.sh §10, 2026-06-22. [Unit] -Description=Docker containers slice (capped + oomd-policed) +Description=Docker containers slice (capped) [Slice] MemoryAccounting=yes MemoryHigh=6G @@ -291,8 +295,6 @@ CPUAccounting=yes CPUWeight=100 IOAccounting=yes IOWeight=100 -ManagedOOMMemoryPressure=kill -ManagedOOMSwap=kill DOCKER_SLICE_EOF # 10d) point dockerd at docker.slice (idempotent JSON merge; flag a needed restart). @@ -320,17 +322,9 @@ case $rc in *) log "WARN: could not patch /etc/docker/daemon.json — docker.slice NOT wired" ;; esac -# 10e) ManagedOOM on the auto-generated t3-serve slice. Its name carries an escaped -# '-' (system-t3\x2dserve.slice); a static drop-in works whether or not an -# instance is currently running (set-property would need it loaded). -install -d -m 0755 '/etc/systemd/system/system-t3\x2dserve.slice.d' -cat > '/etc/systemd/system/system-t3\x2dserve.slice.d/50-devvm-oomd.conf' <<'T3SLICE_EOF' -# oomd policing for all t3-serve@ instances (per-service caps live in -# t3-serve@.service). setup-devvm.sh §10, 2026-06-22. -[Slice] -ManagedOOMMemoryPressure=kill -ManagedOOMSwap=kill -T3SLICE_EOF +# 10e) t3-serve@ instances need no extra drop-in: their per-instance MemoryMax / +# MemorySwapMax caps live in t3-serve@.service [Service]; earlyoom (10b) is the +# box-wide net. (The earlier oomd slice-policing drop-in was removed — inert.) # 10f) give system.slice a priority edge so sshd/services stay snappy under # contention (weights are work-conserving — users still get idle CPU/IO). @@ -344,15 +338,21 @@ IOAccounting=yes IOWeight=200 SYS_EOF -# 10g) activate: reload, arm oomd, restart dockerd ONLY if daemon.json changed. +# 10g) activate: reload, arm earlyoom, restart dockerd ONLY if daemon.json changed. systemctl daemon-reload -systemctl enable --now systemd-oomd.service >/dev/null 2>&1 \ - || log "WARN: systemd-oomd failed to enable — is the package installed? (packages.txt §1)" +# earlyoom reads /etc/default/earlyoom (10b); enable + restart so new args take effect +# even on a re-run where it was already running. +systemctl enable --now earlyoom.service >/dev/null 2>&1 \ + || log "WARN: earlyoom failed to enable — is the package installed? (packages.txt §1)" +systemctl restart earlyoom.service 2>/dev/null || true +# systemd-oomd is inert with swap=0 (see header) — ensure it isn't also running from +# an earlier iteration of this section. No-op if the package was never installed. +systemctl disable --now systemd-oomd.service >/dev/null 2>&1 || true if [[ $docker_restart -eq 1 ]] && systemctl is-active --quiet docker; then log "restarting dockerd to apply cgroup-parent=docker.slice (running containers bounce briefly)" systemctl restart docker || log "WARN: docker restart failed" fi -log "§10 resource containment: per-user 12G/16G swap=0, oomd PSI backstop, docker.slice" +log "§10 resource containment: per-user 12G/16G swap=0, earlyoom RAM backstop, docker.slice" # Run one foreground reconcile while the admin Vault token borrowed in section 8 # is still available. This is what mints new roster users' isolated periodic