infra/stacks/nfs-csi
Viktor Barzin a3bebd5c0d nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:11:09 +00:00
..
modules/nfs-csi nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) 2026-05-17 09:11:09 +00:00
main.tf extract remaining 19 modules from platform, complete stack split [ci skip] 2026-03-17 21:42:16 +00:00
secrets extract remaining 19 modules from platform, complete stack split [ci skip] 2026-03-17 21:42:16 +00:00
terragrunt.hcl extract remaining 19 modules from platform, complete stack split [ci skip] 2026-03-17 21:42:16 +00:00