fix: NFS outage recovery — migrate to NFSv4, add alerting

NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00 · 2026-04-14 10:28:27 +00:00 · ea18116da9
commit ea18116da9
parent 92900b5e08
4 changed files with 21 additions and 1 deletions
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1700,6 +1700,23 @@ serverFiles:
            annotations:
              summary: "NFS CSI controller down — new NFS volume provisioning broken"
          # ISCSICSIControllerDown alert removed — democratic-csi replaced by proxmox-csi (2026-04-05)
+          - alert: NFSCSINodeDown
+            expr: kube_daemonset_status_number_unavailable{namespace="nfs-csi", daemonset="csi-nfs-node"} > 0
+            for: 10m
+            labels:
+              severity: critical
+            annotations:
+              summary: "{{ $value }} NFS CSI node pod(s) unavailable — NFS mounts will fail on affected nodes"
+          - alert: NFSMountFailures
+            expr: |
+              count(kube_pod_container_status_waiting_reason{reason="ContainerCreating"} == 1) > 5
+              and on()
+              count(kube_pod_container_status_waiting_reason{reason="ContainerCreating"} == 1) > 2 * count(kube_pod_container_status_waiting_reason{reason="ContainerCreating"} offset 10m == 1 or on() vector(0))
+            for: 10m
+            labels:
+              severity: critical
+            annotations:
+              summary: ">5 pods stuck in ContainerCreating with sudden increase — possible NFS or storage outage"
      - name: "Application Health"
        rules:
          - alert: MailServerDown