infra/scripts/workstation
Viktor Barzin 3a59f4a8bf
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
workstation: per-user memory caps + systemd-oomd backstop on devvm
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
..
claude-hooks workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store 2026-06-22 09:24:42 +00:00
playwright workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
skel workstation: put ~/.local/bin on PATH so the launcher finds native claude 2026-06-15 17:20:03 +00:00
.gitignore fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
claude-auth-sync.sh Add per-user Claude auth renewal 2026-06-20 20:10:40 +00:00
managed-settings.json workstation: default Claude model fable-5 → opus-4-8 for all devvm users 2026-06-12 20:59:03 +00:00
packages.txt workstation: per-user memory caps + systemd-oomd backstop on devvm 2026-06-22 10:25:09 +00:00
roster.yaml workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit) 2026-06-10 18:05:31 +00:00
roster_engine.py workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00
setup-devvm.sh workstation: per-user memory caps + systemd-oomd backstop on devvm 2026-06-22 10:25:09 +00:00
test_roster_engine.py workstation: per-user playwright browser MCP for all users, reproducible from git 2026-06-16 20:33:47 +00:00