devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
684ca4527c
commit
4c532dbf97
4 changed files with 82 additions and 19 deletions
|
|
@ -129,3 +129,40 @@ heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.
|
||||||
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
|
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
|
||||||
correct pairing. A famous tool that "does OOM" still has to be proven to fire
|
correct pairing. A famous tool that "does OOM" still has to be proven to fire
|
||||||
under *your* configuration.
|
under *your* configuration.
|
||||||
|
|
||||||
|
## Addendum (2026-07-02): the MemoryHigh throttle band livelocks — removed
|
||||||
|
|
||||||
|
The soft-cap layer of this design was falsified in production on 2026-07-02
|
||||||
|
(~15:42–16:35 UTC): an agent-spawned `ugrep` (12.35G RSS; `-o` with wide
|
||||||
|
alternation captures over a multi-GB `.jsonl` transcript) **plateaued inside
|
||||||
|
t3-serve@wizard's `MemoryHigh=12G..MemoryMax=16G` band**. With
|
||||||
|
`MemorySwapMax=0` its anonymous pages were unreclaimable, so the kernel parked
|
||||||
|
every allocating task of the cgroup in `mem_cgroup_handle_over_high`
|
||||||
|
(`memory.pressure full avg60 ≈ 80%`, `memory.events high=882948`, `oom_kill=0`)
|
||||||
|
— including the `t3 serve` event loop (~0.5G RSS, pure collateral). The accept
|
||||||
|
queue backed up (21 pending connections), t3-probe logged `t3serve: [Errno 104]
|
||||||
|
Connection reset by peer`, t3-dispatch logged `proxy error: context canceled`,
|
||||||
|
and t3.viktorbarzin.me was dead for its user until the hog was SIGKILLed by
|
||||||
|
hand (the D-state high-throttle sleep IS killable; the cgroup dropped 14G→1.4G
|
||||||
|
and the service recovered in seconds with no restart).
|
||||||
|
|
||||||
|
The Verification bullet above — a soft-capped balloon "throttled to a crawl,
|
||||||
|
making no progress and **harming nothing**" — holds only when the hog is alone
|
||||||
|
in its cgroup. Sharing the cgroup with a latency-sensitive server, the crawl
|
||||||
|
IS the harm: a hog that stabilises below `MemoryMax` never triggers the local
|
||||||
|
OOM the design counted on, so the band converts "runaway dies" into "everyone
|
||||||
|
in the cgroup stalls forever".
|
||||||
|
|
||||||
|
**Fix (same day, admin-approved): `MemoryHigh=infinity` on all three work
|
||||||
|
cgroup definitions** — `scripts/t3-serve@.service`, the `user-.slice.d`
|
||||||
|
drop-in, and `docker.slice` (`setup-devvm.sh` §10a/§10c). A runaway now runs
|
||||||
|
unthrottled into `MemoryMax` and is cgroup-OOM-killed immediately
|
||||||
|
(`OOMPolicy=continue` keeps t3-serve itself alive; in slices the kernel kills
|
||||||
|
the biggest task). `MemoryMax`, `MemorySwapMax=0`, and earlyoom — the layers
|
||||||
|
the stress tests actually validated — are unchanged. Applied live via
|
||||||
|
`daemon-reload` + runtime `set-property` on the running cgroups; no session
|
||||||
|
restarts.
|
||||||
|
|
||||||
|
Lesson: **with `swap=0`, `memory.high` is not a gentler `memory.max` — it is
|
||||||
|
an unbounded stall injector for everything sharing the cgroup.** Cap-and-kill
|
||||||
|
beats throttle-and-pray for multi-tenant interactive services.
|
||||||
|
|
|
||||||
|
|
@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
|
||||||
node_memory_SwapFree_bytes{instance="devvm"}
|
node_memory_SwapFree_bytes{instance="devvm"}
|
||||||
```
|
```
|
||||||
|
|
||||||
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
|
Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
|
||||||
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` —
|
per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
|
||||||
a runaway agent now OOMs alone inside the cgroup instead of taking the box
|
`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
|
||||||
(and the WS server) with it.
|
plateauing between high and max never OOMs and the kernel high-throttle stalls
|
||||||
|
the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
|
||||||
|
2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
|
||||||
|
`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
|
||||||
|
`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
|
||||||
|
A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
|
||||||
|
the WS server with it. Post-mortem addendum:
|
||||||
|
`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.
|
||||||
|
|
||||||
## 4. Known root causes (2026-06-10 investigation)
|
## 4. Known root causes (2026-06-10 investigation)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -21,12 +21,19 @@ WorkingDirectory=/home/%i
|
||||||
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
||||||
Restart=on-failure
|
Restart=on-failure
|
||||||
RestartSec=5
|
RestartSec=5
|
||||||
# Memory containment (2026-06-10): agent children live in this cgroup; a
|
# Memory containment (2026-06-10, amended 2026-07-02): agent children live in
|
||||||
# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
|
# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the
|
||||||
# every >20s stall fires the t3 client watchdog (visible "disconnects") —
|
# whole devvm — every >20s stall fires the t3 client watchdog (visible
|
||||||
# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
|
# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early
|
||||||
# and forbid swap so stalls can't smear into minutes-long freezes.
|
# and locally, and forbid swap so stalls can't smear into minutes-long freezes.
|
||||||
MemoryHigh=12G
|
# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:
|
||||||
|
# with swap=0 a hog that plateaus between high and max is unreclaimable but
|
||||||
|
# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup
|
||||||
|
# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked
|
||||||
|
# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at
|
||||||
|
# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.
|
||||||
|
# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.
|
||||||
|
MemoryHigh=infinity
|
||||||
MemoryMax=16G
|
MemoryMax=16G
|
||||||
MemorySwapMax=0
|
MemorySwapMax=0
|
||||||
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
|
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
|
||||||
|
|
|
||||||
|
|
@ -244,9 +244,15 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
|
||||||
# virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
|
# virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
|
||||||
# t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
|
# t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
|
||||||
# user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
|
# user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
|
||||||
# MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard,
|
# MemoryMax=16G hard + MemorySwapMax=0 (work never touches disk swap → no
|
||||||
# MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at
|
# thrash; a runaway is cgroup-OOM-killed locally at the ceiling), plus
|
||||||
# the ceiling instead), plus fair-share CPU/IO weights.
|
# fair-share CPU/IO weights.
|
||||||
|
# NO MemoryHigh soft band (removed 2026-07-02; was 12G "throttle to a crawl"):
|
||||||
|
# with swap=0, a hog that PLATEAUS between high and max is unreclaimable but
|
||||||
|
# never OOMs — the kernel parks every task of the cgroup in
|
||||||
|
# mem_cgroup_handle_over_high and the whole tree stalls indefinitely. A 12.3G
|
||||||
|
# agent ugrep livelocked t3-serve@wizard (t3 down ~50min) exactly this way.
|
||||||
|
# Cap-and-kill, never throttle-and-pray — see the post-mortem addendum.
|
||||||
# BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
|
# BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
|
||||||
# INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
|
# INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
|
||||||
# (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
|
# (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
|
||||||
|
|
@ -260,12 +266,16 @@ log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-us
|
||||||
# 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
|
# 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
|
||||||
install -d -m 0755 /etc/systemd/system/user-.slice.d
|
install -d -m 0755 /etc/systemd/system/user-.slice.d
|
||||||
cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
|
cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
|
||||||
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22).
|
# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22;
|
||||||
# Applies to EACH user-<uid>.slice = all of one user's ssh/tmux work. Mirrors the
|
# MemoryHigh dropped 2026-07-02). Applies to EACH user-<uid>.slice = all of one
|
||||||
# t3-serve@.service caps so a user is bounded in whichever surface they work in.
|
# user's ssh/tmux work. Mirrors the t3-serve@.service caps so a user is bounded
|
||||||
|
# in whichever surface they work in. MemoryHigh stays infinity: with swap=0 a
|
||||||
|
# hog plateauing in a high..max band livelocks the entire slice (every ssh/tmux
|
||||||
|
# session of that user) instead of dying — straight-to-OOM at MemoryMax is the
|
||||||
|
# containment (see post-mortem addendum 2026-07-02).
|
||||||
[Slice]
|
[Slice]
|
||||||
MemoryAccounting=yes
|
MemoryAccounting=yes
|
||||||
MemoryHigh=12G
|
MemoryHigh=infinity
|
||||||
MemoryMax=16G
|
MemoryMax=16G
|
||||||
MemorySwapMax=0
|
MemorySwapMax=0
|
||||||
CPUAccounting=yes
|
CPUAccounting=yes
|
||||||
|
|
@ -294,12 +304,14 @@ cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF'
|
||||||
# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
|
# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
|
||||||
# they share one bounded budget and a runaway container is capped at MemoryMax
|
# they share one bounded budget and a runaway container is capped at MemoryMax
|
||||||
# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
|
# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
|
||||||
# setup-devvm.sh §10, 2026-06-22.
|
# setup-devvm.sh §10, 2026-06-22; MemoryHigh dropped 2026-07-02 — a container
|
||||||
|
# plateauing in the high..max band would throttle-livelock EVERY container in
|
||||||
|
# the slice (see post-mortem addendum); MemoryMax OOM is the containment.
|
||||||
[Unit]
|
[Unit]
|
||||||
Description=Docker containers slice (capped)
|
Description=Docker containers slice (capped)
|
||||||
[Slice]
|
[Slice]
|
||||||
MemoryAccounting=yes
|
MemoryAccounting=yes
|
||||||
MemoryHigh=6G
|
MemoryHigh=infinity
|
||||||
MemoryMax=8G
|
MemoryMax=8G
|
||||||
MemorySwapMax=0
|
MemorySwapMax=0
|
||||||
CPUAccounting=yes
|
CPUAccounting=yes
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue