infra/scripts/t3-serve@.service

[Unit]
Description=T3 Code server for %i (t3 serve, per-user)
Documentation=https://github.com/pingdotgg/t3code
After=network.target

[Service]
Type=simple
User=%i
Group=%i
Environment=HOME=/home/%i
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
Environment=NODE_ENV=production
EnvironmentFile=/etc/t3-serve/%i.env
# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
# users on the normal per-user Enterprise-SSO credential flow).
EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure
RestartSec=5
# Memory containment (2026-06-10, amended 2026-07-02): agent children live in
# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the
# whole devvm — every >20s stall fires the t3 client watchdog (visible
# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early
# and locally, and forbid swap so stalls can't smear into minutes-long freezes.
# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:
# with swap=0 a hog that plateaus between high and max is unreclaimable but
# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup
# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked
# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at
# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.
# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.
MemoryHigh=infinity
MemoryMax=16G
MemorySwapMax=0
# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10
# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays.
OOMPolicy=continue

[Install]
WantedBy=multi-user.target
fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-09 08:45:33 +00:00			`[Unit]`
			`Description=T3 Code server for %i (t3 serve, per-user)`
			`Documentation=https://github.com/pingdotgg/t3code`
			`After=network.target`

			`[Service]`
			`Type=simple`
			`User=%i`
			`Group=%i`
			`Environment=HOME=/home/%i`
			`Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin`
			`Environment=NODE_ENV=production`
			`EnvironmentFile=/etc/t3-serve/%i.env`
workstation: per-user long-lived Claude token to end concurrent-refresh logout A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance, all sharing one ~/.claude/.credentials.json. When the shared access token expires the processes refresh simultaneously; OAuth refresh-token rotation makes the losing writer persist an EMPTY refresh token, logging the user out roughly every access-token lifetime (~8h). Re-issuing the credential never sticks — the race recurs (this is why emo's "standalone token" fix kept regressing). Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope user:inference) kept in the user's OWN Vault path (field `setup_token`). claude-auth-sync materializes it to a user-owned ~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the rotating-credential validate/backup/restore (so no false WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating token and there is nothing to race on. Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so users on the normal per-user Enterprise-SSO flow are unaffected. This is each user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook documents enable/disable/rotate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-28 08:07:43 +00:00			`# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by`
			`# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's`
			`# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe`
			`# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for`
			`# users on the normal per-user Enterprise-SSO credential flow).`
			`EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env`
fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-09 08:45:33 +00:00			`WorkingDirectory=/home/%i`
			`ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3`
			`Restart=on-failure`
			`RestartSec=5`
devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-07-02 16:59:38 +00:00			`# Memory containment (2026-06-10, amended 2026-07-02): agent children live in`
			`# this cgroup; a runaway agent (10.8G anon on a 23G host) swap-thrashed the`
			`# whole devvm — every >20s stall fires the t3 client watchdog (visible`
			`# "disconnects") — then global-OOMed. Cap the cgroup so a runaway OOMs early`
			`# and locally, and forbid swap so stalls can't smear into minutes-long freezes.`
			`# MemoryHigh is DELIBERATELY infinity — do not add a soft band below MemoryMax:`
			`# with swap=0 a hog that plateaus between high and max is unreclaimable but`
			`# never OOMs, and the kernel's high-throttle stalls EVERY task in the cgroup`
			`# (the t3 event loop included) indefinitely. A 12.3G agent ugrep livelocked`
			`# this unit for ~50min on 2026-07-02 exactly this way. Straight-to-OOM at`
			`# MemoryMax is the containment; OOMPolicy=continue below keeps the server up.`
			`# See docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md addendum.`
			`MemoryHigh=infinity`
t3-serve@: contain agent memory storms; survive child OOM kills Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving. 2026-06-10 21:00:06 +00:00			`MemoryMax=16G`
			`MemorySwapMax=0`
			`# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10`
			`# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays.`
			`OOMPolicy=continue`
fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-09 08:45:33 +00:00
			`[Install]`
			`WantedBy=multi-user.target`