fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
This commit is contained in:
parent
ca2680c189
commit
4498f61402
3 changed files with 78 additions and 0 deletions
|
|
@ -158,6 +158,13 @@ Choose storage class based on workload type:
|
|||
|
||||
**Migration note**: CSI PV `volumeAttributes` are immutable — cannot update NFS server in place. New PV/PVC pairs required (convention: append `-host` to PV name).
|
||||
|
||||
**NFS CSI mount option requirements** (learned from [PM-2026-04-14]):
|
||||
- **ALWAYS set `nfsvers=4`** in CSI mount options. NFSv3 is disabled on the PVE host (`vers3=n` in `/etc/nfs.conf`). Without this, mounts fail silently if kernel NFS client state is corrupt.
|
||||
- **NEVER use `fsid=0`** in `/etc/exports` on `/srv/nfs`. `fsid=0` designates the NFSv4 pseudo-root, which breaks subdirectory path resolution for all CSI mounts. Only `fsid=1` (unique ID) is safe on `/srv/nfs-ssd`.
|
||||
- **`/etc/exports` is git-managed** at `infra/scripts/pve-nfs-exports`. Deploy: `scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra`
|
||||
- **Critical services MUST NOT use NFS storage** — circular dependency risk. Alertmanager, Prometheus, and any monitoring that should alert about NFS must use `proxmox-lvm-encrypted`. Technitium DNS primary uses `proxmox-lvm-encrypted` (migrated 2026-04-14).
|
||||
- **NFS PV template** (in `modules/kubernetes/nfs_volume/`): always include `mountOptions: ["nfsvers=4", "soft", "actimeo=5", "retrans=3", "timeo=30"]`
|
||||
|
||||
**proxmox-lvm PVC template** (Terraform):
|
||||
```hcl
|
||||
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
||||
|
|
|
|||
|
|
@ -50,6 +50,45 @@ resolve_pvc_name() {
|
|||
' "${MAPPING_CACHE}" 2>/dev/null
|
||||
}
|
||||
|
||||
# --- NFS Export Health Check ---
|
||||
# Verify NFS exports are healthy before starting backup.
|
||||
# Detects: missing /etc/exports, incorrect fsid=0 flag, unexpected exports.
|
||||
# Added 2026-04-14 [PM-2026-04-14]: backup script accessed NFS causing stale handle
|
||||
# propagation during the fsid=0 outage. Early check prevents cascading failures.
|
||||
check_nfs_exports() {
|
||||
local exports_file="/etc/exports"
|
||||
local status=0
|
||||
|
||||
if [ ! -f "${exports_file}" ]; then
|
||||
log "WARN: ${exports_file} does not exist — NFS exports may be unconfigured"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Check for dangerous fsid=0 on /srv/nfs (breaks NFSv4 subdirectory path resolution)
|
||||
if grep -E '^/srv/nfs[[:space:]].*fsid=0' "${exports_file}" 2>/dev/null; then
|
||||
log "ERROR: /etc/exports contains fsid=0 on /srv/nfs — this will break all k8s NFS mounts!"
|
||||
log "ERROR: Remove fsid=0 and run: exportfs -ra && systemctl restart nfs-server"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Verify NFS server is active
|
||||
if ! systemctl is-active --quiet nfs-server 2>/dev/null; then
|
||||
log "WARN: nfs-server is not running — NFS mounts will fail"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Verify exports are actually loaded (exportfs -s lists active exports)
|
||||
local active_exports
|
||||
active_exports=$(exportfs -s 2>/dev/null | grep -c '/srv/nfs' || true)
|
||||
if [ "${active_exports:-0}" -eq 0 ]; then
|
||||
log "WARN: No /srv/nfs exports active in kernel — run: exportfs -ra"
|
||||
return 1
|
||||
fi
|
||||
|
||||
log "NFS export health check passed (${active_exports} /srv/nfs export(s) active)"
|
||||
return 0
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
log "=== Weekly backup starting ==="
|
||||
|
||||
|
|
@ -57,6 +96,12 @@ if ! mountpoint -q "${BACKUP_ROOT}"; then
|
|||
die "${BACKUP_ROOT} is not mounted"
|
||||
fi
|
||||
|
||||
# NFS export health check — warn but don't abort (backup can proceed with block storage PVCs)
|
||||
check_nfs_exports || {
|
||||
log "WARN: NFS export health check failed — NFS-backed PVC backups may fail"
|
||||
STATUS=1
|
||||
}
|
||||
|
||||
STATUS=0
|
||||
TOTAL_BYTES=0
|
||||
|
||||
|
|
|
|||
26
scripts/pve-nfs-exports
Normal file
26
scripts/pve-nfs-exports
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# /etc/exports — NFS export configuration for Proxmox VE host
|
||||
# Managed in git: infra/scripts/pve-nfs-exports
|
||||
# Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
|
||||
#
|
||||
# CRITICAL NOTES (learned from 2026-04-14 outage [PM-2026-04-14]):
|
||||
# - NEVER add fsid=0 to /srv/nfs or /srv/nfs-ssd exports. fsid=0 designates the
|
||||
# NFSv4 pseudo-root which changes path resolution for ALL subdirectory mounts.
|
||||
# When CSI mounts use paths like /srv/nfs/technitium, fsid=0 makes them resolve
|
||||
# as the root itself, causing ENOENT on all subdirectory mounts.
|
||||
# - fsid=1 is acceptable on /srv/nfs-ssd (unique ID, not root).
|
||||
# - The NFS CSI driver mounts subdirectories — never use fsid=0 on any export
|
||||
# that serves dynamic path mounts.
|
||||
# - NFSv3 is disabled on this host (vers3=n in /etc/nfs.conf) — all k8s mounts
|
||||
# must use nfsvers=4 mount option.
|
||||
#
|
||||
# Mount options explanation:
|
||||
# rw — read/write access (required for PVCs)
|
||||
# async — async writes safe: UPS protects host + Vault Raft replication +
|
||||
# databases on block storage. Only NFS metadata at risk.
|
||||
# no_subtree_check — disable subtree checking for performance and reliability
|
||||
# no_root_squash — k8s CSI driver runs as root; squashing breaks PVC writes
|
||||
# insecure — allow source ports >1024 (required: pfSense VLAN NAT uses
|
||||
# unprivileged ports for VLAN 10 → 192.168.1.x traffic)
|
||||
#
|
||||
/srv/nfs *(rw,async,no_subtree_check,no_root_squash,insecure)
|
||||
/srv/nfs-ssd *(rw,sync,no_subtree_check,no_root_squash,insecure,fsid=1)
|
||||
Loading…
Add table
Add a link
Reference in a new issue