Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving.
31 lines
1.1 KiB
Desktop File
31 lines
1.1 KiB
Desktop File
[Unit]
|
|
Description=T3 Code server for %i (t3 serve, per-user)
|
|
Documentation=https://github.com/pingdotgg/t3code
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=%i
|
|
Group=%i
|
|
Environment=HOME=/home/%i
|
|
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
|
|
Environment=NODE_ENV=production
|
|
EnvironmentFile=/etc/t3-serve/%i.env
|
|
WorkingDirectory=/home/%i
|
|
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
|
Restart=on-failure
|
|
RestartSec=5
|
|
# Memory containment (2026-06-10): agent children live in this cgroup; a
|
|
# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm —
|
|
# every >20s stall fires the t3 client watchdog (visible "disconnects") —
|
|
# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally,
|
|
# and forbid swap so stalls can't smear into minutes-long freezes.
|
|
MemoryHigh=12G
|
|
MemoryMax=16G
|
|
MemorySwapMax=0
|
|
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
|
|
# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays.
|
|
OOMPolicy=continue
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|