infra/docs/post-mortems
Viktor Barzin de163aa6af
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-16-nfs-csi-cascade-failure.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-14-postmortem-pipeline-test.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-18-authentik-outpost-shm-full.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-19-registry-orphan-index.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-22-vault-raft-leader-deadlock.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-09-io-pressure-stale-nfs.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-17-gpu-driver-ubuntu2604-mismatch.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-25-immich-anca-elements-io-storm.md immich: remove one-shot anca-elements-import Job + its PVC 2026-06-16 22:11:27 +00:00
2026-05-30-redis-split-brain.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-31-kured-sentinel-gate-oom.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-cloudflared-stale-traefik-origin.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-keel-match-tag-image-swap.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-09-t3-nightly-autoupdate-auth-outage.md t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog) 2026-06-16 11:33:49 +00:00
2026-06-10-authentik-downgrade-boot-storm.md authentik: incident hardening after the signin-speedup rollout storm 2026-06-11 00:26:52 +00:00
2026-06-10-forgejo-retention-orphaned-indexes.md forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] 2026-06-10 09:22:47 +00:00
2026-06-10-tuya-bridge-forgejo-pull-hairpin.md coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] 2026-06-10 16:21:34 +00:00
2026-06-11-devvm-qemu-io-stall.md apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] 2026-06-11 18:00:08 +00:00
2026-06-22-devvm-mem-io-overload-containment.md workstation: switch devvm OOM backstop from systemd-oomd to earlyoom 2026-06-22 10:39:16 +00:00
index.html fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00