t3-serve@: contain agent memory storms; survive child OOM kills
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful

Same t3-disconnect root-cause work: a runaway claude agent child grew to
10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off
its spinning disk (system-wide multi-10s freezes = every t3 client's 20s
watchdog firing = the 'frequent disconnects that self-recover'), then
the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min
because the default OOMPolicy=stop fails the unit when ANY cgroup child
is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid
swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue
so a runaway agent dies alone while the WS server keeps serving.
This commit is contained in:
Viktor Barzin 2026-06-10 21:00:06 +00:00
parent d5fdc7ffe9
commit b5c6639272

View file

@ -15,6 +15,17 @@ WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure
RestartSec=5
# Memory containment (2026-06-10): agent children live in this cgroup; a
# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
# every >20s stall fires the t3 client watchdog (visible "disconnects") —
# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
# and forbid swap so stalls can't smear into minutes-long freezes.
MemoryHigh=12G
MemoryMax=16G
MemorySwapMax=0
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays.
OOMPolicy=continue
[Install]
WantedBy=multi-user.target