From a1b4a2106d88b50d77c29590b5bf173bce3ed177 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 23 Mar 2026 02:25:11 +0200 Subject: [PATCH] add NFS/CSI cascade failure post-mortem [ci skip] --- .../2026-03-16-nfs-csi-cascade-failure.md | 130 ++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 .claude/post-mortems/2026-03-16-nfs-csi-cascade-failure.md diff --git a/.claude/post-mortems/2026-03-16-nfs-csi-cascade-failure.md b/.claude/post-mortems/2026-03-16-nfs-csi-cascade-failure.md new file mode 100644 index 00000000..4706abd6 --- /dev/null +++ b/.claude/post-mortems/2026-03-16-nfs-csi-cascade-failure.md @@ -0,0 +1,130 @@ +# Post-Mortem: NFS CSI Cascade Failure + +| Field | Value | +|-------|-------| +| **Date** | 2026-03-16 | +| **Duration** | ~47h (ongoing) | +| **Severity** | SEV1 | +| **Affected Services** | 40+ pods across 20+ namespaces | +| **Status** | Draft | + +## Summary + +The NFS CSI driver entered a crash-loop due to a liveness-probe port conflict (~47h ago), causing all NFS-backed PV mounts to fail. This cascaded into 15+ pods stuck in ContainerCreating, MySQL InnoDB data file lock failures, Vault Raft storage timeouts, and MetalLB speaker port conflicts. Critical path services (Traefik, PostgreSQL) remained partially operational. + +## Impact + +- **User-facing**: Services dependent on NFS storage (calibre, forgejo, pgadmin, uptime-kuma, etc.) completely unavailable. Grafana down (MySQL dependency). Vault unavailable. +- **Services affected**: 40+ pods across 20+ namespaces — storage layer, databases, monitoring, CI/CD +- **Duration**: ~47h and ongoing at time of investigation +- **Data loss**: None confirmed; MySQL ibdata1 lock issue may indicate risk if not resolved cleanly + +## Timeline (UTC) + +| Time | Event | Source | +|------|-------|--------| +| ~47h ago | NFS CSI driver starts crash-looping — liveness-probe port 29653 conflict | cluster-health-checker (pod age + restart count) | +| ~30h ago | iSCSI CSI controller deployed (stable) | pod age | +| ~27h ago | Vault, MySQL, Headscale, Woodpecker, CrowdSec agents start failing | pod ages, events | +| ~26h ago | mysql-cluster-0 enters ContainerStatusUnknown | sre investigation | +| ~20-22h ago | Cascade of service restarts across cluster | pod restart timestamps | +| ~1h ago | Latest wave of pod rescheduling/restarts | events | + +## Root Cause + +**NFS CSI driver liveness-probe port conflict**: The liveness-probe containers on all worker nodes fail with `listen tcp 127.0.0.1:29653: bind: address already in use`. The port conflict suggests a previous liveness-probe process did not cleanly terminate, or pods were restarted while old processes lingered in the network namespace. + +**Impact chain**: NFS CSI not registered on nodes → all NFS PV mounts fail → 15+ pods stuck in ContainerCreating with "driver name nfs.csi.k8s.io not found in the list of registered CSI drivers" + +## Contributing Factors + +- **MySQL InnoDB data corruption**: Cannot open `ibdata1` (OS error 11 — EAGAIN). Likely caused by NFS storage instability or stale lock from mysql-cluster-0 in ContainerStatusUnknown +- **Vault Raft lock timeout**: vault-0 fails with "failed to open bolt file: timeout" — BoltDB locked by previous instance. vault-2 cannot mount NFS volume at all +- **MetalLB speaker port conflicts**: 3/4 speakers fail with memberlist port 7946 already in use — same pattern as NFS CSI, suggesting containerd instability ~47h ago +- **Node3 memory pressure**: 80% utilization, hosting both mysql-cluster-1 and mysql-cluster-2 + +## Detection + +- **How detected**: Manual investigation (this post-mortem) +- **Time to detect**: ~47h from start of NFS CSI crash-loop +- **Gap analysis**: No alerting on CSI driver health. Existing pod alerts likely firing but root cause (CSI driver) not surfaced. Need CSI-specific health alerts. + +## Resolution + +Not yet resolved at time of investigation. Recommended steps: + +1. Delete NFS CSI node pods one at a time (DaemonSet will recreate with clean port allocation) +2. Delete MetalLB crash-looping speaker pods (same approach) +3. Force-delete mysql-cluster-0 (ContainerStatusUnknown) to release ibdata1 lock +4. Once NFS healthy, vault-2 should start; vault-0 may need BoltDB lock file cleared + +## Action Items + +### Preventive (stop recurrence) + +| Priority | Action | Type | Details | +|----------|--------|------|---------| +| P1 | Investigate containerd health on worker nodes | Investigation | Port conflicts across NFS CSI + MetalLB suggest containerd restart/instability ~47h ago | +| P1 | Fix NFS CSI liveness probe port allocation | Config | Use ephemeral ports or add port uniqueness checks to avoid stale port conflicts | +| P2 | Add node anti-affinity for MySQL replicas | Terraform | mysql-cluster-1 and -2 both on node3 (80% memory) — spread across nodes | + +### Detective (catch faster) + +| Priority | Action | Type | Details | +|----------|--------|------|---------| +| P1 | Add CSI driver health alerting | Alert | PrometheusRule for CSI driver pod crash-loops — `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="democratic-csi"}` | +| P1 | Add NFS mount failure alerting | Alert | Alert on pods stuck in ContainerCreating > 10min with volume mount errors | +| P2 | Add Uptime Kuma monitor for Vault | Monitor | Vault health endpoint check | + +### Mitigative (reduce blast radius) + +| Priority | Action | Type | Details | +|----------|--------|------|---------| +| P2 | Add PDB for NFS CSI node DaemonSet | Config | Ensure at least N-1 nodes always have healthy CSI driver | +| P2 | Document NFS CSI recovery runbook | Runbook | Steps to recover from port conflict scenario | +| P3 | Evaluate moving MySQL off NFS to iSCSI | Architecture | iSCSI remained stable throughout; MySQL on NFS is fragile | + +## Lessons Learned + +- **Went well**: Critical path services (Traefik 3/3, Authentik 2/3, PostgreSQL) survived due to not depending on NFS CSI +- **Went poorly**: 47h detection gap — no alerting on CSI driver health despite it being a single point of failure for all NFS workloads +- **Got lucky**: iSCSI CSI remained stable, keeping PostgreSQL (CNPG) operational. If both CSI drivers had failed, the entire cluster would have been fully down + +## Raw Investigation Data + +
+Cluster State Summary + +- **Nodes**: All 5 Ready. k8s-node2 metrics ``. k8s-node3 at 80% memory. +- **Tier 1 (Critical)**: Traefik OK (3/3), Authentik degraded (2/3), PostgreSQL degraded (1/2), Vault DOWN, Redis starting, MetalLB degraded (1/4) +- **Tier 2 (Storage)**: NFS CSI FAILING (port conflict on all workers), iSCSI CSI OK +- **Tier 3 (Apps)**: 15+ ContainerCreating (NFS), 5+ Pending (GPU/memory), 10+ CrashLoopBackOff (DB deps) +- **Tier 4 (Databases)**: PostgreSQL recovering, MySQL DOWN (ibdata1 lock) + +
+ +
+NFS CSI Error Details + +Controller pods (2): CrashLoopBackOff — `listen tcp 127.0.0.1:29653: bind: address already in use` +Node DaemonSet: 4/5 crash-looping (same error). Only k8s-master healthy. +Age: ~47h with 96+ restarts. + +
+ +
+MySQL Error Details + +mysql-cluster-0: ContainerStatusUnknown +mysql-cluster-1, mysql-cluster-2: CrashLoopBackOff — `Cannot open datafile './ibdata1'` (OS error 11: Resource temporarily unavailable) +mysql-cluster-router: crash-looping (no healthy backend) + +
+ +
+Vault Error Details + +vault-0: CrashLoopBackOff — "failed to open bolt file: timeout" (Raft BoltDB lock) +vault-2: ContainerCreating — NFS mount failure + +