infra

Viktor Barzin 9c68d147e0 Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-25 15:23:15 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-03-16-nfs-csi-cascade-failure.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-04-14-postmortem-pipeline-test.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-04-18-authentik-outpost-shm-full.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-04-19-registry-orphan-index.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-04-22-vault-raft-leader-deadlock.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-09-io-pressure-stale-nfs.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-17-gpu-driver-ubuntu2604-mismatch.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-25-immich-anca-elements-io-storm.md	immich: remove one-shot anca-elements-import Job + its PVC	2026-06-16 22:11:27 +00:00
2026-05-30-redis-split-brain.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-05-31-kured-sentinel-gate-oom.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-06-01-cloudflared-stale-traefik-origin.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-06-01-keel-match-tag-image-swap.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
2026-06-09-t3-nightly-autoupdate-auth-outage.md	t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)	2026-06-16 11:33:49 +00:00
2026-06-10-authentik-downgrade-boot-storm.md	authentik: incident hardening after the signin-speedup rollout storm	2026-06-11 00:26:52 +00:00
2026-06-10-forgejo-retention-orphaned-indexes.md	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]	2026-06-10 09:22:47 +00:00
2026-06-10-tuya-bridge-forgejo-pull-hairpin.md	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip]	2026-06-10 16:21:34 +00:00
2026-06-11-devvm-qemu-io-stall.md	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]	2026-06-11 18:00:08 +00:00
2026-06-22-devvm-mem-io-overload-containment.md	workstation: switch devvm OOM backstop from systemd-oomd to earlyoom	2026-06-22 10:39:16 +00:00
2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)	2026-06-25 15:23:15 +00:00
index.html	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00