docs: move post-mortems to docs/post-mortems/

Consolidate all outage reports under docs/ for better discoverability. Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/ (repo documentation). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00 · 2026-04-14 08:20:09 +00:00 · bdba15a387
commit bdba15a387
parent 68c8c5b4a0
2 changed files with 0 additions and 0 deletions
--- a/docs/post-mortems/2026-03-16-nfs-csi-cascade-failure.md
+++ b/docs/post-mortems/2026-03-16-nfs-csi-cascade-failure.md
@ -0,0 +1,130 @@
+# Post-Mortem: NFS CSI Cascade Failure
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-03-16 |
+| **Duration** | ~47h (ongoing) |
+| **Severity** | SEV1 |
+| **Affected Services** | 40+ pods across 20+ namespaces |
+| **Status** | Draft |
+
+## Summary
+
+The NFS CSI driver entered a crash-loop due to a liveness-probe port conflict (~47h ago), causing all NFS-backed PV mounts to fail. This cascaded into 15+ pods stuck in ContainerCreating, MySQL InnoDB data file lock failures, Vault Raft storage timeouts, and MetalLB speaker port conflicts. Critical path services (Traefik, PostgreSQL) remained partially operational.
+
+## Impact
+
+- **User-facing**: Services dependent on NFS storage (calibre, forgejo, pgadmin, uptime-kuma, etc.) completely unavailable. Grafana down (MySQL dependency). Vault unavailable.
+- **Services affected**: 40+ pods across 20+ namespaces — storage layer, databases, monitoring, CI/CD
+- **Duration**: ~47h and ongoing at time of investigation
+- **Data loss**: None confirmed; MySQL ibdata1 lock issue may indicate risk if not resolved cleanly
+
+## Timeline (UTC)
+
+| Time | Event | Source |
+|------|-------|--------|
+| ~47h ago | NFS CSI driver starts crash-looping — liveness-probe port 29653 conflict | cluster-health-checker (pod age + restart count) |
+| ~30h ago | iSCSI CSI controller deployed (stable) | pod age |
+| ~27h ago | Vault, MySQL, Headscale, Woodpecker, CrowdSec agents start failing | pod ages, events |
+| ~26h ago | mysql-cluster-0 enters ContainerStatusUnknown | sre investigation |
+| ~20-22h ago | Cascade of service restarts across cluster | pod restart timestamps |
+| ~1h ago | Latest wave of pod rescheduling/restarts | events |
+
+## Root Cause
+
+**NFS CSI driver liveness-probe port conflict**: The liveness-probe containers on all worker nodes fail with `listen tcp 127.0.0.1:29653: bind: address already in use`. The port conflict suggests a previous liveness-probe process did not cleanly terminate, or pods were restarted while old processes lingered in the network namespace.
+
+**Impact chain**: NFS CSI not registered on nodes → all NFS PV mounts fail → 15+ pods stuck in ContainerCreating with "driver name nfs.csi.k8s.io not found in the list of registered CSI drivers"
+
+## Contributing Factors
+
+- **MySQL InnoDB data corruption**: Cannot open `ibdata1` (OS error 11 — EAGAIN). Likely caused by NFS storage instability or stale lock from mysql-cluster-0 in ContainerStatusUnknown
+- **Vault Raft lock timeout**: vault-0 fails with "failed to open bolt file: timeout" — BoltDB locked by previous instance. vault-2 cannot mount NFS volume at all
+- **MetalLB speaker port conflicts**: 3/4 speakers fail with memberlist port 7946 already in use — same pattern as NFS CSI, suggesting containerd instability ~47h ago
+- **Node3 memory pressure**: 80% utilization, hosting both mysql-cluster-1 and mysql-cluster-2
+
+## Detection
+
+- **How detected**: Manual investigation (this post-mortem)
+- **Time to detect**: ~47h from start of NFS CSI crash-loop
+- **Gap analysis**: No alerting on CSI driver health. Existing pod alerts likely firing but root cause (CSI driver) not surfaced. Need CSI-specific health alerts.
+
+## Resolution
+
+Not yet resolved at time of investigation. Recommended steps:
+
+1. Delete NFS CSI node pods one at a time (DaemonSet will recreate with clean port allocation)
+2. Delete MetalLB crash-looping speaker pods (same approach)
+3. Force-delete mysql-cluster-0 (ContainerStatusUnknown) to release ibdata1 lock
+4. Once NFS healthy, vault-2 should start; vault-0 may need BoltDB lock file cleared
+
+## Action Items
+
+### Preventive (stop recurrence)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P1 | Investigate containerd health on worker nodes | Investigation | Port conflicts across NFS CSI + MetalLB suggest containerd restart/instability ~47h ago |
+| P1 | Fix NFS CSI liveness probe port allocation | Config | Use ephemeral ports or add port uniqueness checks to avoid stale port conflicts |
+| P2 | Add node anti-affinity for MySQL replicas | Terraform | mysql-cluster-1 and -2 both on node3 (80% memory) — spread across nodes |
+
+### Detective (catch faster)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P1 | Add CSI driver health alerting | Alert | PrometheusRule for CSI driver pod crash-loops — `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="democratic-csi"}` |
+| P1 | Add NFS mount failure alerting | Alert | Alert on pods stuck in ContainerCreating > 10min with volume mount errors |
+| P2 | Add Uptime Kuma monitor for Vault | Monitor | Vault health endpoint check |
+
+### Mitigative (reduce blast radius)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P2 | Add PDB for NFS CSI node DaemonSet | Config | Ensure at least N-1 nodes always have healthy CSI driver |
+| P2 | Document NFS CSI recovery runbook | Runbook | Steps to recover from port conflict scenario |
+| P3 | Evaluate moving MySQL off NFS to iSCSI | Architecture | iSCSI remained stable throughout; MySQL on NFS is fragile |
+
+## Lessons Learned
+
+- **Went well**: Critical path services (Traefik 3/3, Authentik 2/3, PostgreSQL) survived due to not depending on NFS CSI
+- **Went poorly**: 47h detection gap — no alerting on CSI driver health despite it being a single point of failure for all NFS workloads
+- **Got lucky**: iSCSI CSI remained stable, keeping PostgreSQL (CNPG) operational. If both CSI drivers had failed, the entire cluster would have been fully down
+
+## Raw Investigation Data
+
+<details>
+<summary>Cluster State Summary</summary>
+
+- **Nodes**: All 5 Ready. k8s-node2 metrics `<unknown>`. k8s-node3 at 80% memory.
+- **Tier 1 (Critical)**: Traefik OK (3/3), Authentik degraded (2/3), PostgreSQL degraded (1/2), Vault DOWN, Redis starting, MetalLB degraded (1/4)
+- **Tier 2 (Storage)**: NFS CSI FAILING (port conflict on all workers), iSCSI CSI OK
+- **Tier 3 (Apps)**: 15+ ContainerCreating (NFS), 5+ Pending (GPU/memory), 10+ CrashLoopBackOff (DB deps)
+- **Tier 4 (Databases)**: PostgreSQL recovering, MySQL DOWN (ibdata1 lock)
+
+</details>
+
+<details>
+<summary>NFS CSI Error Details</summary>
+
+Controller pods (2): CrashLoopBackOff — `listen tcp 127.0.0.1:29653: bind: address already in use`
+Node DaemonSet: 4/5 crash-looping (same error). Only k8s-master healthy.
+Age: ~47h with 96+ restarts.
+
+</details>
+
+<details>
+<summary>MySQL Error Details</summary>
+
+mysql-cluster-0: ContainerStatusUnknown
+mysql-cluster-1, mysql-cluster-2: CrashLoopBackOff — `Cannot open datafile './ibdata1'` (OS error 11: Resource temporarily unavailable)
+mysql-cluster-router: crash-looping (no healthy backend)
+
+</details>
+
+<details>
+<summary>Vault Error Details</summary>
+
+vault-0: CrashLoopBackOff — "failed to open bolt file: timeout" (Raft BoltDB lock)
+vault-2: ContainerCreating — NFS mount failure
+
+</details>
--- a/docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md
+++ b/docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md
@ -0,0 +1,142 @@
+# Post-Mortem: NFS fsid=0 Cascade — DNS + Vault + Multi-Service Outage
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-14 |
+| **Duration** | ~5h (NFS broken for k8s: 05:37–10:40 EEST). Stale mounts persisted after fix until pod restarts. |
+| **Severity** | SEV1 |
+| **Affected Services** | 25+ pods across 15+ namespaces. DNS (primary), Vault (2/3 pods), Alertmanager, Grafana, ebooks, phpipam, poison-fountain, email monitoring, status page, backup CronJobs |
+| **Status** | Complete |
+
+## Summary
+
+An `fsid=0` flag in the PVE host's `/etc/exports` caused all NFSv4 subdirectory mount paths from k8s to fail. Combined with a `lockd` configuration failure that broke NFSv3 fallback, this made ALL new NFS mounts impossible and ALL existing NFS mounts stale after the NFS server had been restarted on Apr 11. The failure cascaded into DNS (primary unreachable), Vault (lost Raft quorum), Alertmanager (monitoring blind), and 20+ other services.
+
+## Impact
+
+- **User-facing**: DNS intermittent failures for 192.168.1.x network users (primary down, secondary/tertiary covered ~66% of queries). Vault-dependent services unable to rotate secrets.
+- **Blast radius**: 57,405 NFS error messages across 4 k8s nodes + PVE host. 53 NFS-backed PVs at risk.
+- **Duration**: ~5h of active NFS failure. Some services (ebooks) were in CreateContainerError for 2d22h before detection.
+- **Data loss**: None. NFS data remained intact on disk; only the NFS service was broken.
+- **Monitoring gap**: Alertmanager itself was on NFS storage, so it couldn't alert about the NFS failure.
+
+## Timeline (EEST)
+
+| Time | Event |
+|------|-------|
+| **Apr 11 00:44:52** | NFS server restarts on PVE. Logs `exportfs: can't open /etc/exports for reading` and `nfsdctl: lockd configuration failure`. Server starts with broken export configuration. |
+| **Apr 11 (later)** | `/etc/exports` recreated with `fsid=0` on `/srv/nfs` and `fsid=1` on `/srv/nfs-ssd`. Exports reloaded. Existing k8s NFS mounts continue working (cached file handles). |
+| **Apr 13 23:52** | TrueNAS (10.0.10.15) goes completely unreachable. PVE host's hard NFS mounts to TrueNAS start blocking with D-state kernel process. |
+| **Apr 14 05:04** | `daily-backup.service` starts on PVE. 25/88 LUKS PVC snapshots fail to mount. |
+| **Apr 14 05:37** | **CASCADE BEGINS**: k8s-node1, node3, node4 simultaneously start reporting `nfs: server 192.168.1.127 not responding, timed out`. Existing cached file handles expire; new mount operations hit the fsid=0 + lockd issues. |
+| **Apr 14 ~07:30** | DNS outage reported by users on 192.168.1.x network. Investigation begins. |
+| **Apr 14 10:40** | **FIX**: `fsid=0` removed from `/etc/exports`, NFS server restarted. New mounts work. |
+| **Apr 14 10:41** | Vault pods restarted — all 3 come up 2/2 Running. Raft quorum restored. |
+| **Apr 14 10:42** | Technitium primary pod restarted with fresh NFS mount. DNS fully restored. |
+| **Apr 14 10:44–11:04** | Technitium primary migrated from NFS to `proxmox-lvm-encrypted` PVC. Data restored (104.9M including all 11 zones). Terraform state reconciled. |
+
+## Root Cause Chain
+
+```
+[1] fsid=0 in /etc/exports (introduced Apr 11)
+ └─> NFSv4 treats /srv/nfs as pseudo-root
+     └─> Subdirectory paths (/srv/nfs/technitium) resolve incorrectly → ENOENT
+         ├─> [2] lockd configuration failure (since Apr 11)
+         │    └─> NFSv3 fallback fails (statd not running, no 'nolock' in CSI mount options)
+         │         └─> ALL new NFS mounts fail with "No such file or directory"
+         └─> [3] Existing mounts go stale (cached file handles expire ~Apr 14 05:37)
+              ├─> Technitium primary: I/O errors on /etc/dns → degraded DNS
+              ├─> Vault 0+1: CreateContainerError → lost Raft quorum → Vault down
+              ├─> Alertmanager: I/O errors → monitoring blind spot
+              ├─> Grafana: CrashLoopBackOff (MySQL password rotation failed during Vault outage)
+              ├─> ebooks: CreateContainerError (2d22h undetected)
+              └─> 20+ CronJobs and services: Error/FailedMount
+```
+
+### Why fsid=0 was there
+
+During the TrueNAS → Proxmox NFS migration on Apr 11, the NFS server was restarted. At startup, `/etc/exports` was missing/unreadable. When the exports file was recreated, `fsid=0` was included — likely copied from an NFSv4 example. This flag is only appropriate for dedicated NFSv4-only exports, not for NFS CSI dynamic provisioning which mounts subdirectories.
+
+### Why the failure was delayed 3 days
+
+Existing NFS mounts from before the Apr 11 restart continued working because:
+- NFSv3 mounts cache file handles and don't re-negotiate protocol on every I/O
+- The `soft,timeo=30` mount options return errors after timeout, but cached operations succeed
+- Only when cached handles expired or new mounts were needed did the failure manifest
+
+The trigger was likely the `daily-backup.service` at 05:04 on Apr 14, which accesses NFS exports and may have caused the NFS server to recycle state, invalidating cached handles cluster-wide.
+
+## Contributing Factors
+
+1. **TrueNAS unreachable since Apr 13 23:52**: PVE host has `hard` NFS mounts to TrueNAS that will retry forever. D-state kernel process stuck since Apr 13. This may have contributed to NFS thread contention on the PVE host.
+
+2. **Alertmanager on NFS storage**: The very system meant to alert about storage failures was stored on the failing storage. Circular dependency.
+
+3. **`/etc/exports` not managed by Terraform or git**: Changes to this critical configuration file are untracked, making it impossible to audit when `fsid=0` was introduced.
+
+4. **No NFS-specific health alerts**: While CLAUDE.md mentions "NFS responsiveness" alerts, no alert fired during this 5+ hour outage. The Prometheus rule may not cover mount failures from the k8s node perspective.
+
+5. **CSI mount options lack `nfsvers=3` and `nolock`**: The NFS CSI driver uses default mount options that rely on version auto-negotiation. When NFSv4 fails (fsid=0) and NFSv3 fails (lockd), there's no fallback path.
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| No alert on NFS mount failures from k8s nodes | 5h to detection | Add PrometheusRule: `node_nfs_requests_total` error rate |
+| Alertmanager on NFS storage | Alerting blind during NFS outage | Move Alertmanager to `proxmox-lvm-encrypted` |
+| `/etc/exports` not in git/Terraform | Can't audit config changes | Manage via Ansible or TF `remote-exec` |
+| No TrueNAS reachability alert | 11h unnoticed before cascade | Add ping/ICMP monitor in Uptime Kuma |
+| No CSI mount failure alert | Pods stuck for days unnoticed | Alert on `kube_pod_container_status_waiting_reason{reason="ContainerCreating"}` > 10m |
+| ebooks pods broken for 2d22h | Zero notification | Above alert covers this |
+| Grafana down 37h | Dashboard monitoring blind | Uptime Kuma HTTP check already exists; verify it alerts |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure from recurring
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Remove `fsid=0` from `/etc/exports` on PVE host | Done | Completed Apr 14 |
+| Fix `lockd configuration failure` on PVE NFS server | TODO | Check `nfs-common` package, `/etc/nfs.conf` lockd settings |
+| Force-unmount hung TrueNAS NFS mounts on PVE (`umount -f -l /mnt/truenas-src /mnt/truenas-ssd`) | TODO | TrueNAS is sunset; remove mounts entirely |
+| Manage `/etc/exports` in git (add to `infra/scripts/` and deploy via PVE provisioning) | TODO | Prevents untracked config drift |
+| Add NFS CSI mount options `nfsvers=3,nolock` to `nfs-proxmox` StorageClass | TODO | Prevents NFSv4 pseudo-root issues + lockd dependency |
+
+### P1 — Eliminate the NFS single point of failure for critical services
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Migrate Technitium primary to `proxmox-lvm-encrypted` | Done | Completed Apr 14 |
+| Migrate Alertmanager PV from NFS to `proxmox-lvm-encrypted` | TODO | Prevents circular alerting dependency |
+| Migrate Vault PVCs from `nfs-proxmox` to `proxmox-lvm-encrypted` | TODO | Vault is too critical for NFS dependency |
+| Review all 53 NFS PVs — identify which are critical-path and migrate | TODO | Reduce NFS blast radius |
+
+### P2 — Detect NFS failures before users notice
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Add PrometheusRule: NFS mount errors from node kernel logs | TODO | `node_nfs_rpc_retransmissions_total` rate > threshold |
+| Add PrometheusRule: Pods in ContainerCreating > 10 minutes | TODO | Catches CSI mount failures |
+| Add Uptime Kuma monitor: TrueNAS ping (10.0.10.15) | TODO | Catches TrueNAS outage early |
+| Add Uptime Kuma monitor: PVE NFS port 2049 TCP check | TODO | Catches NFS service failures |
+| Verify Grafana Uptime Kuma alert actually fires | TODO | Was down 37h unnoticed |
+
+### P3 — Improve NFS resilience
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Remove hung TrueNAS `hard` mounts from PVE fstab (TrueNAS is sunset) | TODO | Eliminates D-state kernel process risk |
+| Add NFS export health check to daily-backup script | TODO | Backup script should verify NFS before starting |
+| Document NFS CSI mount option requirements in CLAUDE.md | TODO | Prevents future misconfigurations |
+
+## Lessons Learned
+
+1. **NFSv4 `fsid=0` is dangerous for CSI subdirectory mounts**: It changes path resolution semantics in non-obvious ways. Never use it on exports that serve dynamic subdirectory mounts.
+
+2. **Critical monitoring infrastructure must not depend on the thing it monitors**: Alertmanager on NFS cannot alert about NFS failures. This is the same anti-pattern as "DNS depends on DNS" or "monitoring depends on the monitored database".
+
+3. **Stale NFS mounts have delayed-action failure modes**: The 3-day gap between the config change (Apr 11) and the outage (Apr 14) made root cause analysis much harder. Cached file handles mask configuration errors.
+
+4. **`/etc/exports` is a single-point-of-configuration**: Unmanaged, unversioned, no review process. A single flag caused a cluster-wide outage.
+
+5. **This is the SECOND DNS outage related to NFS migration** (first was Apr 6 — unbound PVC). Storage migrations for DNS infrastructure need extra scrutiny and pre-migration testing.