nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped control-plane exclusion from the controller Deployment, so both replicas landed on k8s-master, fought for hostNetwork ports 19809/29653, and one went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes holding the ports — only a kubelet restart on master cleared them. - Pin helm_release.version = "4.13.1" so terraform apply can't drift to the broken chart (defense in depth; nfs-csi namespace is already in the Kyverno-Keel exclude list) - Add controller.affinity: podAntiAffinity between replicas + nodeAffinity excluding node-role.kubernetes.io/control-plane - docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md captures the root cause + recovery procedure (kubelet restart via nsenter is the escalation path when crictl rmp -f fails) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
98b7ef40fd
commit
a3bebd5c0d
2 changed files with 177 additions and 0 deletions
|
|
@ -0,0 +1,133 @@
|
||||||
|
# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
|
||||||
|
|
||||||
|
**Date:** 2026-05-17
|
||||||
|
**Author:** Viktor Barzin / Claude (incident response)
|
||||||
|
**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
|
||||||
|
**Duration:** ~2 hours from first detection to all-green
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
|
||||||
|
`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
|
||||||
|
replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
|
||||||
|
used `hostNetwork: true` and tried to bind the same host ports
|
||||||
|
(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
|
||||||
|
controller pod CrashLoopBackOff'd with `bind: address already in use`. The
|
||||||
|
upgrade also left behind multiple orphan controller pods in containerd that
|
||||||
|
kubelet could no longer reconcile — they held the host ports even after the
|
||||||
|
helm rollback removed them from K8s state.
|
||||||
|
|
||||||
|
The `csi-nfs-node` DaemonSet pod on master then could not start either: its
|
||||||
|
own `node-driver-registrar` and `liveness-probe` containers tried to bind
|
||||||
|
the same host ports and lost to the zombies.
|
||||||
|
|
||||||
|
## Impact
|
||||||
|
|
||||||
|
- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
|
||||||
|
- CSI plugin unregistered on master → no NFS volumes could be mounted on
|
||||||
|
master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
|
||||||
|
failed)
|
||||||
|
- Controller flap (2 replicas fighting) → intermittent
|
||||||
|
`csi-resizer`/`csi-snapshotter` failure for the whole cluster
|
||||||
|
- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
|
||||||
|
csi-node-driver (Calico) all bounced on master while kubelet thrashed
|
||||||
|
|
||||||
|
No data loss; no production-facing outages observed (CSI mounts on the four
|
||||||
|
worker nodes kept working).
|
||||||
|
|
||||||
|
## Timeline (Europe/Sofia, UTC+3)
|
||||||
|
|
||||||
|
- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
|
||||||
|
the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
|
||||||
|
- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
|
||||||
|
scheduled (no `affinity` block → both replicas land on `k8s-master`)
|
||||||
|
- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
|
||||||
|
CrashLoopBackOff
|
||||||
|
- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
|
||||||
|
begins
|
||||||
|
- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
|
||||||
|
on master deleted via K8s, but containerd retains them as live sandboxes
|
||||||
|
- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
|
||||||
|
added to the controller Deployment via patch (controllers now correctly
|
||||||
|
schedule on node1+node3)
|
||||||
|
- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
|
||||||
|
held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
|
||||||
|
plus 5× csi-provisioner from zombie controller pods)
|
||||||
|
- 09:00 — Privileged pkill via `hostPID: true` pod failed
|
||||||
|
(`permission denied` from runc — containerd refused to signal init in the
|
||||||
|
zombie containers)
|
||||||
|
- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
|
||||||
|
the orphan containers via cgroup GC; ports freed
|
||||||
|
- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
|
||||||
|
- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
|
||||||
|
`controller.affinity` to values
|
||||||
|
|
||||||
|
## Root Causes
|
||||||
|
|
||||||
|
1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
|
||||||
|
no `version = ...` field, so it floated to whatever the chart repo
|
||||||
|
advertised. Keel polled this and rolled forward.
|
||||||
|
2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
|
||||||
|
shipped with. Without it, the K8s scheduler chose master for both
|
||||||
|
controller replicas.
|
||||||
|
3. **Two controller replicas + hostNetwork = port conflict on the same node.**
|
||||||
|
The chart did not add `podAntiAffinity` between the replicas. Live state
|
||||||
|
has it now; TF now does too.
|
||||||
|
4. **Helm rollback does not always clean containerd sandboxes.** When the
|
||||||
|
prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
|
||||||
|
containerd can keep multiple sandbox instances for the same pod-UID.
|
||||||
|
Kubelet GC is the only thing that reliably reaps these — restarting it
|
||||||
|
forces a reconciliation pass that drops orphans.
|
||||||
|
|
||||||
|
## What We Fixed
|
||||||
|
|
||||||
|
- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
|
||||||
|
- `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
|
||||||
|
is already excluded from Kyverno-Keel injection, but the chart could still
|
||||||
|
drift on a `terraform apply` without a pin)
|
||||||
|
- `controller.affinity` block with `podAntiAffinity` (different hosts for
|
||||||
|
replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
|
||||||
|
- Inline comments explaining both decisions
|
||||||
|
- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
|
||||||
|
list (decision from authentik incident 2026-05-17). Verified still there
|
||||||
|
in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
|
||||||
|
|
||||||
|
## Recovery Procedure (next time)
|
||||||
|
|
||||||
|
If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
|
||||||
|
|
||||||
|
1. **Find which host ports are bound** — `lsof -i :19809`, `lsof -i :29653`
|
||||||
|
(from a privileged hostPID pod on the affected node).
|
||||||
|
2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
|
||||||
|
tracks). Will fail with `unable to signal init: permission denied` if
|
||||||
|
the containers are sufficiently stuck.
|
||||||
|
3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
|
||||||
|
systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
|
||||||
|
reconciles containerd state and reaps the orphans.
|
||||||
|
4. **Force-delete the DaemonSet pod** to clear the back-off
|
||||||
|
(`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
|
||||||
|
DaemonSet recreates it; with the ports free, containers start cleanly.
|
||||||
|
|
||||||
|
## Action Items
|
||||||
|
|
||||||
|
- [x] Pin `csi-driver-nfs` chart version in TF
|
||||||
|
- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
|
||||||
|
- [x] Document recovery procedure (this post-mortem)
|
||||||
|
- [ ] Audit other unpinned `helm_release` blocks — every chart used in
|
||||||
|
Kyverno-excluded namespaces should still be pinned to prevent
|
||||||
|
`terraform apply` drift. (Filed as follow-up — not blocking.)
|
||||||
|
- [ ] Consider adding a `kured` or daily script that detects orphan
|
||||||
|
containerd sandboxes whose pod-UID is unknown to the apiserver and
|
||||||
|
reaps them automatically. (Filed as follow-up — not blocking.)
|
||||||
|
|
||||||
|
## Lessons
|
||||||
|
|
||||||
|
- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
|
||||||
|
Keel injection, but the helm_release was unpinned — so a `terraform apply`
|
||||||
|
alone could re-trigger the same break. Both layers needed locking down.
|
||||||
|
- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
|
||||||
|
signal init, kubelet restart is the next escalation step before SSH/reboot.
|
||||||
|
- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
|
||||||
|
excluded.** CSI was correctly excluded — but the chart version itself was
|
||||||
|
still a moving target via plain `terraform apply`. The exclude-list catches
|
||||||
|
Keel; the version pin catches everything else.
|
||||||
|
|
@ -23,10 +23,54 @@ resource "helm_release" "nfs_csi_driver" {
|
||||||
|
|
||||||
repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts"
|
repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts"
|
||||||
chart = "csi-driver-nfs"
|
chart = "csi-driver-nfs"
|
||||||
|
# Pinned 2026-05-17. Keel polled and rolled csi-driver-nfs 4.13.1 → 4.13.2,
|
||||||
|
# which broke the cluster:
|
||||||
|
# * Controller pods ended up on k8s-master because the new chart removed
|
||||||
|
# control-plane exclusion from the default node selector.
|
||||||
|
# * Two controller replicas on the same node fought over hostNetwork ports
|
||||||
|
# 19809 (node-driver-registrar) and 29653 (liveness-probe). One replica
|
||||||
|
# CrashLoopBackOff'd with `bind: address already in use`.
|
||||||
|
# * Rolling back live (helm rollback) left zombie containerd containers
|
||||||
|
# holding the ports — only a kubelet restart cleared them.
|
||||||
|
# nfs-csi namespace is in the Kyverno keel exclude list (keel-annotations.tf)
|
||||||
|
# so Keel will not touch it again. This version pin is the second line of
|
||||||
|
# defense against accidental floating-version drift on `terraform apply`.
|
||||||
|
version = "4.13.1"
|
||||||
|
|
||||||
values = [yamlencode({
|
values = [yamlencode({
|
||||||
controller = {
|
controller = {
|
||||||
replicas = 2
|
replicas = 2
|
||||||
|
# Required to coexist with the v4.13.1 chart on a 1-master + 4-worker
|
||||||
|
# cluster:
|
||||||
|
# * podAntiAffinity forces the 2 controller replicas onto DIFFERENT
|
||||||
|
# hosts (host network ports 19809/29653 are per-host).
|
||||||
|
# * nodeAffinity excludes the control-plane node entirely so the
|
||||||
|
# scheduler can't pick master when a worker is briefly NotReady.
|
||||||
|
# Without these, Kubernetes can schedule both replicas on the same node
|
||||||
|
# (port conflict) or on master itself (which already runs the DaemonSet
|
||||||
|
# pod and would conflict with it).
|
||||||
|
affinity = {
|
||||||
|
nodeAffinity = {
|
||||||
|
requiredDuringSchedulingIgnoredDuringExecution = {
|
||||||
|
nodeSelectorTerms = [{
|
||||||
|
matchExpressions = [{
|
||||||
|
key = "node-role.kubernetes.io/control-plane"
|
||||||
|
operator = "DoesNotExist"
|
||||||
|
}]
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
podAntiAffinity = {
|
||||||
|
requiredDuringSchedulingIgnoredDuringExecution = [{
|
||||||
|
labelSelector = {
|
||||||
|
matchLabels = {
|
||||||
|
app = "csi-nfs-controller"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
topologyKey = "kubernetes.io/hostname"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
}
|
||||||
livenessProbe = {
|
livenessProbe = {
|
||||||
httpPort = 29653
|
httpPort = 29653
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue