infra/scripts/workstation
Viktor Barzin 4c532dbf97
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00
..
claude-hooks homelab v0.8.2: fix memory recall truncating multibyte UTF-8 mid-character 2026-06-28 09:40:51 +00:00
claude-skills devvm: personalize emo's cluster-health skill for ha-sofia 2026-06-26 16:03:14 +00:00
playwright workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
skel workstation: per-user long-lived Claude token to end concurrent-refresh logout 2026-06-28 08:07:43 +00:00
.gitignore fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
claude-auth-sync.sh workstation: per-user long-lived Claude token to end concurrent-refresh logout 2026-06-28 08:07:43 +00:00
managed-settings.json fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc 2026-06-26 08:25:33 +00:00
packages.txt workstation: switch devvm OOM backstop from systemd-oomd to earlyoom 2026-06-22 10:39:16 +00:00
roster.yaml workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit) 2026-06-10 18:05:31 +00:00
roster_engine.py workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
setup-devvm.sh devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM 2026-07-02 16:59:38 +00:00
test_roster_engine.py workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00