devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
684ca4527c
commit
4c532dbf97
4 changed files with 82 additions and 19 deletions
|
|
@ -109,10 +109,17 @@ rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
|
|||
node_memory_SwapFree_bytes{instance="devvm"}
|
||||
```
|
||||
|
||||
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
|
||||
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` —
|
||||
a runaway agent now OOMs alone inside the cgroup instead of taking the box
|
||||
(and the WS server) with it.
|
||||
Guardrails in place (2026-06-10, hardened 2026-07-02; `scripts/t3-serve@.service`):
|
||||
per-unit `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`, and
|
||||
`MemoryHigh=infinity` — deliberately NO soft throttle band. With swap=0, a hog
|
||||
plateauing between high and max never OOMs and the kernel high-throttle stalls
|
||||
the whole unit: a 12.3G agent `ugrep` livelocked t3-serve@wizard for ~50min on
|
||||
2026-07-02 (signature: probe `t3serve` leg `Connection reset by peer`, dispatch
|
||||
`proxy error: context canceled`, server D-state in `mem_cgroup_handle_over_high`,
|
||||
`ss` backlog on the serve port; fix: SIGKILL the hog — the D-state is killable).
|
||||
A runaway agent now OOMs alone at 16G inside the cgroup instead of throttling
|
||||
the WS server with it. Post-mortem addendum:
|
||||
`docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md`.
|
||||
|
||||
## 4. Known root causes (2026-06-10 investigation)
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue