From 57250cfda202bebbb47d3ac6a467977ebff00b26 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 9 May 2026 17:01:57 +0000 Subject: [PATCH] mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit). innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom without changing the buffer pool config. /srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV 1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB free. The post-mortem also covers the stale-NFS-client trigger (legacy /usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP) and the resulting wedged kthread on the PVE host. Script removed and node_exporter restarted out-of-band; kthread will clear at next PVE reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md. Co-Authored-By: Claude Opus 4.7 --- docs/architecture/storage.md | 8 +-- .../2026-05-09-io-pressure-stale-nfs.md | 56 +++++++++++++++++++ stacks/dbaas/modules/dbaas/main.tf | 4 +- 3 files changed, 62 insertions(+), 6 deletions(-) create mode 100644 docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md diff --git a/docs/architecture/storage.md b/docs/architecture/storage.md index df1e89f9..de0b2111 100644 --- a/docs/architecture/storage.md +++ b/docs/architecture/storage.md @@ -1,6 +1,6 @@ # Storage Architecture -Last updated: 2026-04-15 +Last updated: 2026-05-09 ## Overview @@ -13,7 +13,7 @@ The cluster uses two storage backends: **Proxmox CSI** for database block storag All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on 2026-04-15. This eliminates the previous double-CoW (ZFS + LVM-thin) path and ensures data-at-rest encryption. **NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127`. Two NFS export roots exist: -- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB) — bulk media and backup targets +- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (3TB) — bulk media and backup targets - **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML) Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. @@ -31,7 +31,7 @@ graph TB subgraph Proxmox["Proxmox Host (192.168.1.127)"] sdc["sdc: 10.7TB RAID1 HDD
VG pve, LV data (thin pool)
~67 proxmox-lvm PVCs
~28 proxmox-lvm-encrypted PVCs"] sda["sda: 1.1TB RAID1 SAS
VG backup, LV data (ext4)
/mnt/backup"] - NFS_HDD["LV pve/nfs-data (2TB ext4)
/srv/nfs
~100 NFS shares
Media + backup targets"] + NFS_HDD["LV pve/nfs-data (3TB ext4)
/srv/nfs
~100 NFS shares
Media + backup targets"] NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)
/srv/nfs-ssd
High-performance data
(Immich ML)"] NFS_Exports["NFS Exports
managed by /etc/exports"] NFS_HDD --> NFS_Exports @@ -74,7 +74,7 @@ graph TB | **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug | | **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Non-sensitive stateful apps | | **StorageClass `proxmox-lvm-encrypted`** | RWO, WaitForFirstConsumer, LUKS2 | Cluster-wide | **All sensitive data** (databases, auth, email, passwords, git) | -| Proxmox NFS (HDD) | LV `pve/nfs-data`, 2TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services | +| Proxmox NFS (HDD) | LV `pve/nfs-data`, 3TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services | | Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) | | nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver | | StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host | diff --git a/docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md b/docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md new file mode 100644 index 00000000..de0b5719 --- /dev/null +++ b/docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md @@ -0,0 +1,56 @@ +# Post-Mortem: IO Pressure Stalls from Stale NFS Client to Decommissioned TrueNAS + +| Field | Value | +|-------|-------| +| **Date** | 2026-05-09 (issue first observable in journal at 2026-05-08 00:00:04) | +| **Duration** | Intermittent IO PSI stalls and kubectl TLS handshake timeouts during the session; PVE host loadavg ~15 sustained. No user-visible outage. | +| **Severity** | SEV3 (degraded host I/O, no service down) | +| **Affected Components** | PVE host (192.168.1.127), `node_exporter` (PID 1479, D-state), kernel NFS kthread `[10.0.10.15-manager]`, k8s-node3 (downstream IO PSI). | +| **Status** | Resolved structurally. Stale connection source removed; recurring trigger eliminated. Wedged kthread persists in kernel queue — clears on next PVE reboot. | + +## Summary + +The PVE host's NFS client was retaining a wedged connection to `10.0.10.15` — the IP of the TrueNAS VM that was operationally decommissioned 2026-04-13 (storage migrated to `192.168.1.127:/srv/nfs`). The connection was created by `/usr/local/bin/weekly-backup`, a legacy script left over from before the NFS migration that had never been removed. Its kernel kthread `[10.0.10.15-manager]` parked itself in `rpc_wait_bit_killable` and stayed there. Any process that touched `/proc/mountstats` — including `node_exporter` — got dragged into D-state alongside it, which in turn fed back into IO pressure metrics. cluster-health surfaced this as `k8s-node3 full avg10=23%` and PVE loadavg sustained at ~15. + +## Impact + +- **User-facing**: None directly. Intermittent kubectl TLS handshake timeouts during the session, attributable to the elevated PVE loadavg. +- **Blast radius**: Single PVE host. node_exporter (PID 1479) wedged in D-state with the kthread. k8s-node3 downstream IO PSI peaked at `full avg10=23%`. +- **Data loss**: None. +- **Observability gap**: No alert fired for "stale NFS connection to decommissioned host". The IO PSI watchdog caught the symptom, not the cause. + +## Root Cause + +`/usr/local/bin/weekly-backup` was an artifact of the pre-2026-04-13 backup pipeline (when TrueNAS at `10.0.10.15` was the NFS server). After the TrueNAS decommission and migration to host NFS at `192.168.1.127`, the script was never deleted. It executed at least once recently (manually, or via a cron entry that has since been pruned), opening an NFS RPC session to `10.0.10.15`. With no peer answering, the kernel's RPC retry timer parked the manager kthread in `rpc_wait_bit_killable`. The kthread holds a lock that any reader of `/proc/mountstats` must take — `node_exporter` reads that file every scrape interval, so its scrape goroutine wedged in D-state too. + +## Resolution + +1. `lvextend -L +1T /dev/pve/nfs-data` + `resize2fs` — `/srv/nfs` 2 TiB → 3 TiB (90% → 60% used). Unrelated to the IO issue but bundled because `/srv/nfs` was at 90% and the user picked "grow LV" over "diet Immich". Thinpool (sdc) had ~4.6 TiB free. +2. `rm /usr/local/bin/weekly-backup` — eliminates the trigger. Backup pipeline is now `daily-backup.service` + `offsite-sync-backup.service` + per-app CronJobs (mysql/postgres/vault/etc.); `weekly-backup` was fully redundant. +3. `systemctl restart node_exporter` — replaces the wedged process. New PID 183319 healthy, `:9100/metrics` responsive. +4. `mysql-standalone` memory bump 2 Gi → 4 Gi limit, 1.5 Gi → 3 Gi request (commit forthcoming). Coincident May 8 18:05 OOM, not caused by this incident — `innodb_buffer_pool_size=1Gi` plus connection buffers and InnoDB internals didn't fit in 2 Gi. + +## Open / Out-of-Scope + +- **Wedged kthread `[10.0.10.15-manager]` (PID 3796184)** persists in the kernel queue. The kernel will eventually reap it once the RPC retry timer gives up, or it clears at next PVE reboot. With the script gone, no new ops queue against it. **Plan**: if PVE host PSI does not fully clear within 24 h, fold a PVE reboot into the next maintenance window. Not done in this change. +- **Transient OOMs unrelated to this incident**: + - `mysql-standalone-0` May 8 18:05 (anon-rss 2 GB at 2 Gi limit) — addressed by the limit bump above. + - postgres helpers May 9 12:37 — anon-rss <8 MB, pods no longer exist, no recurrence. No action. + - python pod May 9 13:36 (anon-rss 518 MB on k8s-node2) — pod no longer exists, no recurrence. No action. +- **Pre-existing TF drift**: `null_resource.pg_job_hunter_db` in `stacks/dbaas/modules/dbaas/main.tf` execs against `pg-cluster-1`, but the current CNPG primary is `pg-cluster-2`. Unrelated to this incident; surfaced during the targeted MySQL apply. Fix is a separate ticket — should resolve the primary dynamically (e.g., via the `cnpg.io/instanceRole=primary` selector) instead of hardcoding pod ordinal. + +## Action Items + +- [x] Delete `/usr/local/bin/weekly-backup` on PVE host. +- [x] Restart `node_exporter.service` on PVE host. +- [x] Grow `pve/nfs-data` LV to 3 TiB; online `resize2fs`. +- [x] Bump `mysql-standalone` memory request/limit to 3 Gi / 4 Gi. +- [x] Update `docs/architecture/storage.md` to record the new LV size. +- [ ] Reboot PVE host at next maintenance window if `[10.0.10.15-manager]` kthread does not clear within 24 h. +- [ ] (Separate ticket) Fix `null_resource.pg_*_db` resources to target the actual CNPG primary instead of hardcoding `pg-cluster-1`. + +## Related + +- TrueNAS decommission: memory `id=674` (2026-04-13). +- Prior LV grow on `pve/nfs-data` (2 TiB out-of-band): memory `id=691` (2026-04-12). +- Architecture: `docs/architecture/storage.md`, `docs/architecture/backup-dr.md`. diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 1ae6f415..43a128de 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -188,10 +188,10 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" { resources { requests = { cpu = "250m" - memory = "1536Mi" + memory = "3Gi" } limits = { - memory = "2Gi" + memory = "4Gi" } }