6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.9 KiB
Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
Date: 2026-05-17 Author: Viktor Barzin / Claude (incident response) Severity: SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping) Duration: ~2 hours from first detection to all-green
Summary
The Keel auto-update operator polled the csi-driver-nfs Helm chart and rolled
v4.13.1 → v4.13.2. The new chart's controller Deployment scheduled both
replicas onto k8s-master (no built-in control-plane exclusion). Both replicas
used hostNetwork: true and tried to bind the same host ports
(19809 for node-driver-registrar, 29653 for liveness-probe), so one
controller pod CrashLoopBackOff'd with bind: address already in use. The
upgrade also left behind multiple orphan controller pods in containerd that
kubelet could no longer reconcile — they held the host ports even after the
helm rollback removed them from K8s state.
The csi-nfs-node DaemonSet pod on master then could not start either: its
own node-driver-registrar and liveness-probe containers tried to bind
the same host ports and lost to the zombies.
Impact
- 1×
csi-nfs-nodepod onk8s-masterstuck CrashLoopBackOff (16+ restarts) - CSI plugin unregistered on master → no NFS volumes could be mounted on master-hosted pods (calico-typha cert mount failed, etcd backup CronJob failed)
- Controller flap (2 replicas fighting) → intermittent
csi-resizer/csi-snapshotterfailure for the whole cluster - Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter, csi-node-driver (Calico) all bounced on master while kubelet thrashed
No data loss; no production-facing outages observed (CSI mounts on the four worker nodes kept working).
Timeline (Europe/Sofia, UTC+3)
- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
the
csi-driver-nfs4.13.xchannel, triggers Helm upgrade - 07:46:16 —
helm upgrade csi-driver-nfsruns; new controller Deployment scheduled (noaffinityblock → both replicas land onk8s-master) - ~07:50 — Controller replicas fight for ports
19809,29653; one stays in CrashLoopBackOff - ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation begins
- 08:15 —
helm rollback csi-driver-nfsto revision 8 (v4.13.1) — controllers on master deleted via K8s, but containerd retains them as live sandboxes - 08:30 — Live
podAntiAffinity+nodeAffinity: control-plane DoesNotExistadded to the controller Deployment via patch (controllers now correctly schedule on node1+node3) - 08:40 —
csi-nfs-nodemaster pod still CrashLoopBackOff; ports 19809/29653 held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944, plus 5× csi-provisioner from zombie controller pods) - 09:00 — Privileged pkill via
hostPID: truepod failed (permission deniedfrom runc — containerd refused to signal init in the zombie containers) - 09:03 —
nsenter -t 1 -m -p -u systemctl restart kubeleton master cleared the orphan containers via cgroup GC; ports freed - 09:04 —
csi-nfs-nodemaster pod reaches 3/3 Ready; cluster green - 09:09 — Terraform
apply: pinhelm_release.version = "4.13.1", addcontroller.affinityto values
Root Causes
csi-driver-nfsHelm chart in TF was unpinned. Thehelm_releasehad noversion = ...field, so it floated to whatever the chart repo advertised. Keel polled this and rolled forward.- Chart v4.13.2 dropped the implicit control-plane exclusion that v4.13.1 shipped with. Without it, the K8s scheduler chose master for both controller replicas.
- Two controller replicas + hostNetwork = port conflict on the same node.
The chart did not add
podAntiAffinitybetween the replicas. Live state has it now; TF now does too. - Helm rollback does not always clean containerd sandboxes. When the prior revision's pods are abandoned mid-flight (image-pull-pending, etc.), containerd can keep multiple sandbox instances for the same pod-UID. Kubelet GC is the only thing that reliably reaps these — restarting it forces a reconciliation pass that drops orphans.
What We Fixed
stacks/nfs-csi/modules/nfs-csi/main.tf(this commit):version = "4.13.1"pin on thehelm_release(defense in depth — namespace is already excluded from Kyverno-Keel injection, but the chart could still drift on aterraform applywithout a pin)controller.affinityblock withpodAntiAffinity(different hosts for replicas) andnodeAffinity(excludenode-role.kubernetes.io/control-plane)- Inline comments explaining both decisions
- Kyverno keel-annotations:
nfs-csiwas already in the namespace exclude list (decision from authentik incident 2026-05-17). Verified still there instacks/kyverno/modules/kyverno/keel-annotations.tf:91.
Recovery Procedure (next time)
If csi-nfs-node on a node CrashLoopBackOff with bind: address already in use:
- Find which host ports are bound —
lsof -i :19809,lsof -i :29653(from a privileged hostPID pod on the affected node). - Try
crictl rmp -f <pod-id>on zombie pods (those K8s no longer tracks). Will fail withunable to signal init: permission deniedif the containers are sufficiently stuck. - Restart kubelet on the affected node via
nsenter -t 1 -m -p -u systemctl restart kubelet(privileged hostPID pod). Kubelet's GC reconciles containerd state and reaps the orphans. - Force-delete the DaemonSet pod to clear the back-off
(
kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0). DaemonSet recreates it; with the ports free, containers start cleanly.
Action Items
- Pin
csi-driver-nfschart version in TF - Add
controller.affinityto TF (podAntiAffinity + control-plane exclude) - Document recovery procedure (this post-mortem)
- Audit other unpinned
helm_releaseblocks — every chart used in Kyverno-excluded namespaces should still be pinned to preventterraform applydrift. (Filed as follow-up — not blocking.) - Consider adding a
kuredor daily script that detects orphan containerd sandboxes whose pod-UID is unknown to the apiserver and reaps them automatically. (Filed as follow-up — not blocking.)
Lessons
- Keel exclusion ≠ chart pin. The namespace was already excluded from
Keel injection, but the helm_release was unpinned — so a
terraform applyalone could re-trigger the same break. Both layers needed locking down. crictl rmp -fis not always sufficient. When containerd refuses to signal init, kubelet restart is the next escalation step before SSH/reboot.- The Keel rollout phase 2-6 design ASSUMED stateful operators were
excluded. CSI was correctly excluded — but the chart version itself was
still a moving target via plain
terraform apply. The exclude-list catches Keel; the version pin catches everything else.