From b5c663927212838f5c689431a2be65e35c214999 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 10 Jun 2026 21:00:06 +0000 Subject: [PATCH] t3-serve@: contain agent memory storms; survive child OOM kills Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving. --- scripts/t3-serve@.service | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 9cdb4ad9..4109b36b 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -15,6 +15,17 @@ WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure RestartSec=5 +# Memory containment (2026-06-10): agent children live in this cgroup; a +# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm — +# every >20s stall fires the t3 client watchdog (visible "disconnects") — +# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally, +# and forbid swap so stalls can't smear into minutes-long freezes. +MemoryHigh=12G +MemoryMax=16G +MemorySwapMax=0 +# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10 +# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays. +OOMPolicy=continue [Install] WantedBy=multi-user.target