From 0480477f44663f2e6e1cb4d3f6e7290a84a4b946 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 17 May 2026 09:11:09 +0000 Subject: [PATCH] nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 --- ...s-csi-keel-upgrade-master-port-conflict.md | 133 ++++++++++++++++++ stacks/nfs-csi/modules/nfs-csi/main.tf | 44 ++++++ 2 files changed, 177 insertions(+) create mode 100644 docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md diff --git a/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md b/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md new file mode 100644 index 00000000..df9e8c0a --- /dev/null +++ b/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md @@ -0,0 +1,133 @@ +# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI + +**Date:** 2026-05-17 +**Author:** Viktor Barzin / Claude (incident response) +**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping) +**Duration:** ~2 hours from first detection to all-green + +## Summary + +The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled +`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both +replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas +used `hostNetwork: true` and tried to bind the same host ports +(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one +controller pod CrashLoopBackOff'd with `bind: address already in use`. The +upgrade also left behind multiple orphan controller pods in containerd that +kubelet could no longer reconcile — they held the host ports even after the +helm rollback removed them from K8s state. + +The `csi-nfs-node` DaemonSet pod on master then could not start either: its +own `node-driver-registrar` and `liveness-probe` containers tried to bind +the same host ports and lost to the zombies. + +## Impact + +- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts) +- CSI plugin unregistered on master → no NFS volumes could be mounted on + master-hosted pods (calico-typha cert mount failed, etcd backup CronJob + failed) +- Controller flap (2 replicas fighting) → intermittent + `csi-resizer`/`csi-snapshotter` failure for the whole cluster +- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter, + csi-node-driver (Calico) all bounced on master while kubelet thrashed + +No data loss; no production-facing outages observed (CSI mounts on the four +worker nodes kept working). + +## Timeline (Europe/Sofia, UTC+3) + +- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under + the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade +- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment + scheduled (no `affinity` block → both replicas land on `k8s-master`) +- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in + CrashLoopBackOff +- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation + begins +- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers + on master deleted via K8s, but containerd retains them as live sandboxes +- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist` + added to the controller Deployment via patch (controllers now correctly + schedule on node1+node3) +- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653 + held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944, + plus 5× csi-provisioner from zombie controller pods) +- 09:00 — Privileged pkill via `hostPID: true` pod failed + (`permission denied` from runc — containerd refused to signal init in the + zombie containers) +- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared + the orphan containers via cgroup GC; ports freed +- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green +- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add + `controller.affinity` to values + +## Root Causes + +1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had + no `version = ...` field, so it floated to whatever the chart repo + advertised. Keel polled this and rolled forward. +2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1 + shipped with. Without it, the K8s scheduler chose master for both + controller replicas. +3. **Two controller replicas + hostNetwork = port conflict on the same node.** + The chart did not add `podAntiAffinity` between the replicas. Live state + has it now; TF now does too. +4. **Helm rollback does not always clean containerd sandboxes.** When the + prior revision's pods are abandoned mid-flight (image-pull-pending, etc.), + containerd can keep multiple sandbox instances for the same pod-UID. + Kubelet GC is the only thing that reliably reaps these — restarting it + forces a reconciliation pass that drops orphans. + +## What We Fixed + +- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit): + - `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace + is already excluded from Kyverno-Keel injection, but the chart could still + drift on a `terraform apply` without a pin) + - `controller.affinity` block with `podAntiAffinity` (different hosts for + replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`) + - Inline comments explaining both decisions +- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude + list (decision from authentik incident 2026-05-17). Verified still there + in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`. + +## Recovery Procedure (next time) + +If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`: + +1. **Find which host ports are bound** — `lsof -i :19809`, `lsof -i :29653` + (from a privileged hostPID pod on the affected node). +2. **Try `crictl rmp -f `** on zombie pods (those K8s no longer + tracks). Will fail with `unable to signal init: permission denied` if + the containers are sufficiently stuck. +3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u + systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC + reconciles containerd state and reaps the orphans. +4. **Force-delete the DaemonSet pod** to clear the back-off + (`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`). + DaemonSet recreates it; with the ports free, containers start cleanly. + +## Action Items + +- [x] Pin `csi-driver-nfs` chart version in TF +- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude) +- [x] Document recovery procedure (this post-mortem) +- [ ] Audit other unpinned `helm_release` blocks — every chart used in + Kyverno-excluded namespaces should still be pinned to prevent + `terraform apply` drift. (Filed as follow-up — not blocking.) +- [ ] Consider adding a `kured` or daily script that detects orphan + containerd sandboxes whose pod-UID is unknown to the apiserver and + reaps them automatically. (Filed as follow-up — not blocking.) + +## Lessons + +- **Keel exclusion ≠ chart pin.** The namespace was already excluded from + Keel injection, but the helm_release was unpinned — so a `terraform apply` + alone could re-trigger the same break. Both layers needed locking down. +- **`crictl rmp -f` is not always sufficient.** When containerd refuses to + signal init, kubelet restart is the next escalation step before SSH/reboot. +- **The Keel rollout phase 2-6 design ASSUMED stateful operators were + excluded.** CSI was correctly excluded — but the chart version itself was + still a moving target via plain `terraform apply`. The exclude-list catches + Keel; the version pin catches everything else. diff --git a/stacks/nfs-csi/modules/nfs-csi/main.tf b/stacks/nfs-csi/modules/nfs-csi/main.tf index 22fd6a27..d030be2f 100644 --- a/stacks/nfs-csi/modules/nfs-csi/main.tf +++ b/stacks/nfs-csi/modules/nfs-csi/main.tf @@ -23,10 +23,54 @@ resource "helm_release" "nfs_csi_driver" { repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts" chart = "csi-driver-nfs" + # Pinned 2026-05-17. Keel polled and rolled csi-driver-nfs 4.13.1 → 4.13.2, + # which broke the cluster: + # * Controller pods ended up on k8s-master because the new chart removed + # control-plane exclusion from the default node selector. + # * Two controller replicas on the same node fought over hostNetwork ports + # 19809 (node-driver-registrar) and 29653 (liveness-probe). One replica + # CrashLoopBackOff'd with `bind: address already in use`. + # * Rolling back live (helm rollback) left zombie containerd containers + # holding the ports — only a kubelet restart cleared them. + # nfs-csi namespace is in the Kyverno keel exclude list (keel-annotations.tf) + # so Keel will not touch it again. This version pin is the second line of + # defense against accidental floating-version drift on `terraform apply`. + version = "4.13.1" values = [yamlencode({ controller = { replicas = 2 + # Required to coexist with the v4.13.1 chart on a 1-master + 4-worker + # cluster: + # * podAntiAffinity forces the 2 controller replicas onto DIFFERENT + # hosts (host network ports 19809/29653 are per-host). + # * nodeAffinity excludes the control-plane node entirely so the + # scheduler can't pick master when a worker is briefly NotReady. + # Without these, Kubernetes can schedule both replicas on the same node + # (port conflict) or on master itself (which already runs the DaemonSet + # pod and would conflict with it). + affinity = { + nodeAffinity = { + requiredDuringSchedulingIgnoredDuringExecution = { + nodeSelectorTerms = [{ + matchExpressions = [{ + key = "node-role.kubernetes.io/control-plane" + operator = "DoesNotExist" + }] + }] + } + } + podAntiAffinity = { + requiredDuringSchedulingIgnoredDuringExecution = [{ + labelSelector = { + matchLabels = { + app = "csi-nfs-controller" + } + } + topologyKey = "kubernetes.io/hostname" + }] + } + } livenessProbe = { httpPort = 29653 }