Viktor Barzin bdba15a387 docs: move post-mortems to docs/post-mortems/

Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 08:20:09 +00:00

6.7 KiB

Raw Blame History

Post-Mortem: NFS CSI Cascade Failure

Field	Value
Date	2026-03-16
Duration	~47h (ongoing)
Severity	SEV1
Affected Services	40+ pods across 20+ namespaces
Status	Draft

Summary

The NFS CSI driver entered a crash-loop due to a liveness-probe port conflict (~47h ago), causing all NFS-backed PV mounts to fail. This cascaded into 15+ pods stuck in ContainerCreating, MySQL InnoDB data file lock failures, Vault Raft storage timeouts, and MetalLB speaker port conflicts. Critical path services (Traefik, PostgreSQL) remained partially operational.

Impact

User-facing: Services dependent on NFS storage (calibre, forgejo, pgadmin, uptime-kuma, etc.) completely unavailable. Grafana down (MySQL dependency). Vault unavailable.
Services affected: 40+ pods across 20+ namespaces — storage layer, databases, monitoring, CI/CD
Duration: ~47h and ongoing at time of investigation
Data loss: None confirmed; MySQL ibdata1 lock issue may indicate risk if not resolved cleanly

Timeline (UTC)

Time	Event	Source
~47h ago	NFS CSI driver starts crash-looping — liveness-probe port 29653 conflict	cluster-health-checker (pod age + restart count)
~30h ago	iSCSI CSI controller deployed (stable)	pod age
~27h ago	Vault, MySQL, Headscale, Woodpecker, CrowdSec agents start failing	pod ages, events
~26h ago	mysql-cluster-0 enters ContainerStatusUnknown	sre investigation
~20-22h ago	Cascade of service restarts across cluster	pod restart timestamps
~1h ago	Latest wave of pod rescheduling/restarts	events

Root Cause

NFS CSI driver liveness-probe port conflict: The liveness-probe containers on all worker nodes fail with listen tcp 127.0.0.1:29653: bind: address already in use. The port conflict suggests a previous liveness-probe process did not cleanly terminate, or pods were restarted while old processes lingered in the network namespace.

Impact chain: NFS CSI not registered on nodes → all NFS PV mounts fail → 15+ pods stuck in ContainerCreating with "driver name nfs.csi.k8s.io not found in the list of registered CSI drivers"

Contributing Factors

MySQL InnoDB data corruption: Cannot open ibdata1 (OS error 11 — EAGAIN). Likely caused by NFS storage instability or stale lock from mysql-cluster-0 in ContainerStatusUnknown
Vault Raft lock timeout: vault-0 fails with "failed to open bolt file: timeout" — BoltDB locked by previous instance. vault-2 cannot mount NFS volume at all
MetalLB speaker port conflicts: 3/4 speakers fail with memberlist port 7946 already in use — same pattern as NFS CSI, suggesting containerd instability ~47h ago
Node3 memory pressure: 80% utilization, hosting both mysql-cluster-1 and mysql-cluster-2

Detection

How detected: Manual investigation (this post-mortem)
Time to detect: ~47h from start of NFS CSI crash-loop
Gap analysis: No alerting on CSI driver health. Existing pod alerts likely firing but root cause (CSI driver) not surfaced. Need CSI-specific health alerts.

Resolution

Not yet resolved at time of investigation. Recommended steps:

Delete NFS CSI node pods one at a time (DaemonSet will recreate with clean port allocation)
Delete MetalLB crash-looping speaker pods (same approach)
Force-delete mysql-cluster-0 (ContainerStatusUnknown) to release ibdata1 lock
Once NFS healthy, vault-2 should start; vault-0 may need BoltDB lock file cleared

Action Items

Preventive (stop recurrence)

Priority	Action	Type	Details
P1	Investigate containerd health on worker nodes	Investigation	Port conflicts across NFS CSI + MetalLB suggest containerd restart/instability ~47h ago
P1	Fix NFS CSI liveness probe port allocation	Config	Use ephemeral ports or add port uniqueness checks to avoid stale port conflicts
P2	Add node anti-affinity for MySQL replicas	Terraform	mysql-cluster-1 and -2 both on node3 (80% memory) — spread across nodes

Detective (catch faster)

Priority	Action	Type	Details
P1	Add CSI driver health alerting	Alert	PrometheusRule for CSI driver pod crash-loops — `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="democratic-csi"}`
P1	Add NFS mount failure alerting	Alert	Alert on pods stuck in ContainerCreating > 10min with volume mount errors
P2	Add Uptime Kuma monitor for Vault	Monitor	Vault health endpoint check

Mitigative (reduce blast radius)

Priority	Action	Type	Details
P2	Add PDB for NFS CSI node DaemonSet	Config	Ensure at least N-1 nodes always have healthy CSI driver
P2	Document NFS CSI recovery runbook	Runbook	Steps to recover from port conflict scenario
P3	Evaluate moving MySQL off NFS to iSCSI	Architecture	iSCSI remained stable throughout; MySQL on NFS is fragile

Lessons Learned

Went well: Critical path services (Traefik 3/3, Authentik 2/3, PostgreSQL) survived due to not depending on NFS CSI
Went poorly: 47h detection gap — no alerting on CSI driver health despite it being a single point of failure for all NFS workloads
Got lucky: iSCSI CSI remained stable, keeping PostgreSQL (CNPG) operational. If both CSI drivers had failed, the entire cluster would have been fully down

Raw Investigation Data

Cluster State Summary

Nodes: All 5 Ready. k8s-node2 metrics <unknown>. k8s-node3 at 80% memory.
Tier 1 (Critical): Traefik OK (3/3), Authentik degraded (2/3), PostgreSQL degraded (1/2), Vault DOWN, Redis starting, MetalLB degraded (1/4)
Tier 2 (Storage): NFS CSI FAILING (port conflict on all workers), iSCSI CSI OK
Tier 3 (Apps): 15+ ContainerCreating (NFS), 5+ Pending (GPU/memory), 10+ CrashLoopBackOff (DB deps)
Tier 4 (Databases): PostgreSQL recovering, MySQL DOWN (ibdata1 lock)

NFS CSI Error Details

Controller pods (2): CrashLoopBackOff — listen tcp 127.0.0.1:29653: bind: address already in use Node DaemonSet: 4/5 crash-looping (same error). Only k8s-master healthy. Age: ~47h with 96+ restarts.

MySQL Error Details

mysql-cluster-0: ContainerStatusUnknown mysql-cluster-1, mysql-cluster-2: CrashLoopBackOff — Cannot open datafile './ibdata1' (OS error 11: Resource temporarily unavailable) mysql-cluster-router: crash-looping (no healthy backend)

Vault Error Details

vault-0: CrashLoopBackOff — "failed to open bolt file: timeout" (Raft BoltDB lock) vault-2: ContainerCreating — NFS mount failure

6.7 KiB Raw Blame History