diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md index 664869fa..27a4484a 100644 --- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -129,3 +129,40 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the correct pairing. A famous tool that "does OOM" still has to be proven to fire under *your* configuration. + +## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed + +The soft-cap layer of this design was falsified in production on 2026-07-02 +(~15:42–16:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide +alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside +t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With +`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked +every allocating task of the cgroup in `mem_cgroup_handle_over_high` +(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`) +— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept +queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104] +Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`, +and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by +hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G +and the service recovered in seconds with no restart). + +The Verification bullet above — a soft-capped balloon "throttled to a crawl, +making no progress and **harming nothing**" — holds only when the hog is alone +in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl +IS the harm: a hog that stabilises below `MemoryMax` never triggers the local +OOM the design counted on, so the band converts "runaway dies" into "everyone +in the cgroup stalls forever". + +**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work +cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d` +drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs +unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately +(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills +the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers +the stress tests actually validated — are unchanged. Applied live via +`daemon-reload` + runtime `set-property` on the running cgroups; no session +restarts. + +Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is +an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill +beats throttle-and-pray for multi-tenant interactive services. diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md index df4cef09..e05f163b 100644 --- a/docs/runbooks/t3-drop-attribution.md +++ b/docs/runbooks/t3-drop-attribution.md @@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) node_memory_SwapFree_bytes{instance="devvm"} ``` -Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit -`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — -a runaway agent now OOMs alone inside the cgroup instead of taking the box -(and the WS server) with it. +Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`): +per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and +`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog +plateauing between high and max never OOMs and the kernel high-throttle stalls +the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on +2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch +`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`, +`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable). +A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling +the WS server with it. Post-mortem addendum: +`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`. ## 4. Known root causes (2026-06-10 investigation) diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 7f3d765d..0ab84e74 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -21,12 +21,19 @@ WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure RestartSec=5 -# Memory containment (2026-06-10): agent children live in this cgroup; a -# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm — -# every >20s stall fires the t3 client watchdog (visible "disconnects") — -# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally, -# and forbid swap so stalls can't smear into minutes-long freezes. -MemoryHigh=12G +# Memory containment (2026-06-10, amended 2026-07-02): agent children live in +# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the +# whole devvm — every >20s stall fires the t3 client watchdog (visible +# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early +# and locally, and forbid swap so stalls can't smear into minutes-long freezes. +# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax: +# with swap=0 a hog that plateaus between high and max is unreclaimable but +# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup +# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked +# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at +# MemoryMax is the containment; OOMPolicy=continue below keeps the server up. +# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum. +MemoryHigh=infinity MemoryMax=16G MemorySwapMax=0 # Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10 diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 02bd9257..3e05b8a0 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -244,9 +244,15 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22). # t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped # user-.slice (all ssh/tmux work). Design — per user, on BOTH trees: -# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard, -# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at -# the ceiling instead), plus fair-share CPU/IO weights. +# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no +# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus +# fair-share CPU/IO weights. +# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"): +# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but +# never OOMs — the kernel parks every task of the cgroup in +# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G +# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way. +# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum. # BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is # INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim # (pgscan rising), and a no-swap anon workload never reclaims — verified live, a @@ -260,12 +266,16 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us # 10a) per-user caps + fair-share weights on EVERY user-.slice (ssh/tmux) install -d -m 0755 /etc/systemd/system/user-.slice.d cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF' -# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22). -# Applies to EACH user-.slice = all of one user's ssh/tmux work. Mirrors the -# t3-serve@.service caps so a user is bounded in whichever surface they work in. +# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22; +# MemoryHigh dropped 2026-07-02). Applies to EACH user-.slice = all of one +# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded +# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a +# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux +# session of that user) instead of dying — straight-to-OOM at MemoryMax is the +# containment (see post-mortem addendum 2026-07-02). [Slice] MemoryAccounting=yes -MemoryHigh=12G +MemoryHigh=infinity MemoryMax=16G MemorySwapMax=0 CPUAccounting=yes @@ -294,12 +304,14 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF' # All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so # they share one bounded budget and a runaway container is capped at MemoryMax # (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice. -# setup-devvm.sh §10, 2026-06-22. +# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container +# plateauing in the high..max band would throttle-livelock EVERY container in +# the slice (see post-mortem addendum); MemoryMax OOM is the containment. [Unit] Description=Docker containers slice (capped) [Slice] MemoryAccounting=yes -MemoryHigh=6G +MemoryHigh=infinity MemoryMax=8G MemorySwapMax=0 CPUAccounting=yes