Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| 2026-03-16-kured-containerd-cascade-outage.html | ||
| 2026-03-16-nfs-csi-cascade-failure.md | ||
| 2026-04-14-nfs-fsid0-dns-vault-outage.md | ||
| 2026-04-14-postmortem-pipeline-test.md | ||
| 2026-04-18-authentik-outpost-shm-full.md | ||
| 2026-04-19-registry-orphan-index.md | ||
| 2026-04-22-vault-raft-leader-deadlock.md | ||
| 2026-05-09-io-pressure-stale-nfs.md | ||
| 2026-05-16-kured-stalled-and-anubis-ha.md | ||
| 2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md | ||
| index.html | ||