devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful

t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-07-02 16:59:38 +00:00
parent 684ca4527c
commit 4c532dbf97
4 changed files with 82 additions and 19 deletions

View file

@ -129,3 +129,40 @@ heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.
## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
The soft-cap layer of this design was falsified in production on 2026-07-02
(~15:4216:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
every allocating task of the cgroup in `mem_cgroup_handle_over_high`
(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G
and the service recovered in seconds with no restart).
The Verification bullet above — a soft-capped balloon "throttled to a crawl,
making no progress and **harming nothing**" — holds only when the hog is alone
in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
OOM the design counted on, so the band converts "runaway dies" into "everyone
in the cgroup stalls forever".
**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d`
drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs
unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers
the stress tests actually validated — are unchanged. Applied live via
`daemon-reload` + runtime `set-property` on the running cgroups; no session
restarts.
Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is
an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
beats throttle-and-pray for multi-tenant interactive services.

View file

@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
node_memory_SwapFree_bytes{instance="devvm"}
```
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`
a runaway agent now OOMs alone inside the cgroup instead of taking the box
(and the WS server) with it.
Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
plateauing between high and max never OOMs and the kernel high-throttle stalls
the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
the WS server with it. Post-mortem addendum:
`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.
## 4. Known root causes (2026-06-10 investigation)