Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-09 08:45:33 +00:00

6.9 KiB

Raw Blame History

Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI

Date: 2026-05-17 Author: Viktor Barzin / Claude (incident response) Severity: SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping) Duration: ~2 hours from first detection to all-green

Summary

The Keel auto-update operator polled the csi-driver-nfs Helm chart and rolled v4.13.1 → v4.13.2. The new chart's controller Deployment scheduled both replicas onto k8s-master (no built-in control-plane exclusion). Both replicas used hostNetwork: true and tried to bind the same host ports (19809 for node-driver-registrar, 29653 for liveness-probe), so one controller pod CrashLoopBackOff'd with bind: address already in use. The upgrade also left behind multiple orphan controller pods in containerd that kubelet could no longer reconcile — they held the host ports even after the helm rollback removed them from K8s state.

The csi-nfs-node DaemonSet pod on master then could not start either: its own node-driver-registrar and liveness-probe containers tried to bind the same host ports and lost to the zombies.

Impact

1× csi-nfs-node pod on k8s-master stuck CrashLoopBackOff (16+ restarts)
CSI plugin unregistered on master → no NFS volumes could be mounted on master-hosted pods (calico-typha cert mount failed, etcd backup CronJob failed)
Controller flap (2 replicas fighting) → intermittent csi-resizer/csi-snapshotter failure for the whole cluster
Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter, csi-node-driver (Calico) all bounced on master while kubelet thrashed

No data loss; no production-facing outages observed (CSI mounts on the four worker nodes kept working).

Timeline (Europe/Sofia, UTC+3)

~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under the csi-driver-nfs 4.13.x channel, triggers Helm upgrade
07:46:16 — helm upgrade csi-driver-nfs runs; new controller Deployment scheduled (no affinity block → both replicas land on k8s-master)
~07:50 — Controller replicas fight for ports 19809, 29653; one stays in CrashLoopBackOff
~08:00 — User notices "CSI issue ... due to the upgrade"; investigation begins
08:15 — helm rollback csi-driver-nfs to revision 8 (v4.13.1) — controllers on master deleted via K8s, but containerd retains them as live sandboxes
08:30 — Live podAntiAffinity + nodeAffinity: control-plane DoesNotExist added to the controller Deployment via patch (controllers now correctly schedule on node1+node3)
08:40 — csi-nfs-node master pod still CrashLoopBackOff; ports 19809/29653 held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944, plus 5× csi-provisioner from zombie controller pods)
09:00 — Privileged pkill via hostPID: true pod failed (permission denied from runc — containerd refused to signal init in the zombie containers)
09:03 — nsenter -t 1 -m -p -u systemctl restart kubelet on master cleared the orphan containers via cgroup GC; ports freed
09:04 — csi-nfs-node master pod reaches 3/3 Ready; cluster green
09:09 — Terraform apply: pin helm_release.version = "4.13.1", add controller.affinity to values

Root Causes

csi-driver-nfs Helm chart in TF was unpinned. The helm_release had no version = ... field, so it floated to whatever the chart repo advertised. Keel polled this and rolled forward.
Chart v4.13.2 dropped the implicit control-plane exclusion that v4.13.1 shipped with. Without it, the K8s scheduler chose master for both controller replicas.
Two controller replicas + hostNetwork = port conflict on the same node. The chart did not add podAntiAffinity between the replicas. Live state has it now; TF now does too.
Helm rollback does not always clean containerd sandboxes. When the prior revision's pods are abandoned mid-flight (image-pull-pending, etc.), containerd can keep multiple sandbox instances for the same pod-UID. Kubelet GC is the only thing that reliably reaps these — restarting it forces a reconciliation pass that drops orphans.

What We Fixed

stacks/nfs-csi/modules/nfs-csi/main.tf (this commit):
- version = "4.13.1" pin on the helm_release (defense in depth — namespace is already excluded from Kyverno-Keel injection, but the chart could still drift on a terraform apply without a pin)
- controller.affinity block with podAntiAffinity (different hosts for replicas) and nodeAffinity (exclude node-role.kubernetes.io/control-plane)
- Inline comments explaining both decisions
Kyverno keel-annotations: nfs-csi was already in the namespace exclude list (decision from authentik incident 2026-05-17). Verified still there in stacks/kyverno/modules/kyverno/keel-annotations.tf:91.

Recovery Procedure (next time)

If csi-nfs-node on a node CrashLoopBackOff with bind: address already in use:

Find which host ports are bound — lsof -i :19809, lsof -i :29653 (from a privileged hostPID pod on the affected node).
Try crictl rmp -f <pod-id> on zombie pods (those K8s no longer tracks). Will fail with unable to signal init: permission denied if the containers are sufficiently stuck.
Restart kubelet on the affected node via nsenter -t 1 -m -p -u systemctl restart kubelet (privileged hostPID pod). Kubelet's GC reconciles containerd state and reaps the orphans.
Force-delete the DaemonSet pod to clear the back-off (kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0). DaemonSet recreates it; with the ports free, containers start cleanly.

Action Items

Pin csi-driver-nfs chart version in TF
Add controller.affinity to TF (podAntiAffinity + control-plane exclude)
Document recovery procedure (this post-mortem)
Audit other unpinned helm_release blocks — every chart used in Kyverno-excluded namespaces should still be pinned to prevent terraform apply drift. (Filed as follow-up — not blocking.)
Consider adding a kured or daily script that detects orphan containerd sandboxes whose pod-UID is unknown to the apiserver and reaps them automatically. (Filed as follow-up — not blocking.)

Lessons

Keel exclusion ≠ chart pin. The namespace was already excluded from Keel injection, but the helm_release was unpinned — so a terraform apply alone could re-trigger the same break. Both layers needed locking down.
crictl rmp -f is not always sufficient. When containerd refuses to signal init, kubelet restart is the next escalation step before SSH/reboot.
The Keel rollout phase 2-6 design ASSUMED stateful operators were excluded. CSI was correctly excluded — but the chart version itself was still a moving target via plain terraform apply. The exclude-list catches Keel; the version pin catches everything else.

6.9 KiB Raw Blame History Unescape Escape