infra/docs/post-mortems
Viktor Barzin 0480477f44 nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00
2026-03-16-nfs-csi-cascade-failure.md docs: move post-mortems to docs/post-mortems/ 2026-04-14 08:20:09 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] 2026-04-14 18:09:11 +00:00
2026-04-14-postmortem-pipeline-test.md fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
2026-04-18-authentik-outpost-shm-full.md docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items 2026-05-22 14:16:41 +00:00
2026-04-19-registry-orphan-index.md [registry] bulk-clean 34 orphan manifests + beads-server image bump 2026-04-19 23:16:34 +00:00
2026-04-22-vault-raft-leader-deadlock.md vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC 2026-04-25 17:10:00 +00:00
2026-05-09-io-pressure-stale-nfs.md mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB 2026-05-10 11:12:38 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16) 2026-05-22 14:16:48 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) 2026-05-22 14:16:56 +00:00
index.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00