# NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts **Date**: 2026-03-01 **Status**: Draft **Complements**: `2026-02-28-storage-reliability-design.md` (databases → local disk) **Goal**: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services ## Problem 56 services use inline NFS volumes (`nfs {}` in pod specs). This pattern has three compounding issues: 1. **Stale mounts hang forever**: Inline NFS defaults to `hard,timeo=600` mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show `Running 1/1` but are completely frozen with zero listening sockets. The only fix is force-deleting the pod. 2. **No mount health checking**: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler. 3. **No storage abstraction**: NFS server IP and export paths are hardcoded into every pod spec via `var.nfs_server`. Changing the backend (different NFS server, different protocol) requires editing 56 stacks. ## Constraints - Zero data migration — same NFS paths, same TrueNAS server, same directories - Services must keep working during migration (no downtime per service beyond a pod restart) - Must work with existing Terragrunt architecture (per-stack state isolation) - Must not break services that will later move to local disk (per storage-reliability design) ## Design ### Architecture ``` BEFORE: Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS (no health check, hangs on stale mount, server IP in every stack) AFTER: Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS (health-checked, fails fast on stale mount, server IP in module only) ``` ### Component 1: NFS CSI Driver (Helm chart in platform stack) Deploy `csi-driver-nfs` v4.11+ via Helm in `stacks/platform/modules/nfs-csi/`. The driver runs as: - **Controller**: 1 replica (handles PV provisioning) - **Node DaemonSet**: 1 per node (handles mount/unmount operations) Resource footprint: ~50MB RAM per node, ~10m CPU idle. The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is: - Mount options are configurable per-StorageClass (not hardcoded kernel defaults) - CSI health checking can detect unhealthy volumes - Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes ### Component 2: StorageClass ```hcl resource "kubernetes_storage_class" "nfs_truenas" { metadata { name = "nfs-truenas" } provisioner = "nfs.csi.k8s.io" reclaim_policy = "Retain" volume_binding_mode = "Immediate" mount_options = [ "soft", # Return -EIO instead of hanging forever "timeo=30", # 3-second timeout per NFS RPC call "retrans=3", # Retry 3 times before giving up (~9 sec total) "actimeo=5", # 5-second attribute cache (balance freshness vs perf) ] parameters = { server = var.nfs_server share = "/mnt/main" } } ``` Key mount option differences vs current defaults: | Option | Current (inline) | New (CSI) | Effect | |--------|-----------------|-----------|--------| | `hard` vs `soft` | `hard` (default) | `soft` | I/O errors instead of infinite hang | | `timeo` | 600 (60 sec) | 30 (3 sec) | Faster failure detection | | `retrans` | 3 | 3 | Same retry count, but 3s per attempt not 60s | | `actimeo` | 3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache | | Total stale detection | **~3 minutes** | **~9 seconds** | 20x faster | ### Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`) Creates a PV + PVC pair for each NFS mount point. Hides boilerplate. **Interface**: ```hcl module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "myservice-data" # PV and PVC name (must be unique cluster-wide) namespace = "myservice" # PVC namespace nfs_server = var.nfs_server # From terraform.tfvars nfs_path = "/mnt/main/myservice" # NFS export path # Optional: # storage = "10Gi" # Default: 10Gi (informational for NFS) # access_modes = ["ReadWriteMany"] # Default: RWX } ``` **Outputs**: - `claim_name` — PVC name to reference in pod spec **Module creates**: 1. `kubernetes_persistent_volume` — CSI-backed, references StorageClass mount options 2. `kubernetes_persistent_volume_claim` — bound to the PV, namespaced PVs are cluster-scoped, so `name` must be globally unique. Convention: `-` (e.g., `openclaw-tools`, `privatebin-data`). ### Component 4: Stack Migration (Mechanical Change) Each stack changes from: ```hcl # OLD: inline NFS volume { name = "data" nfs { server = var.nfs_server path = "/mnt/main/myservice" } } ``` To: ```hcl # NEW: module call (outside pod spec) module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "myservice-data" namespace = "myservice" nfs_server = var.nfs_server nfs_path = "/mnt/main/myservice" } # NEW: PVC reference (in pod spec, replaces nfs {} block) volume { name = "data" persistent_volume_claim { claim_name = module.nfs_data.claim_name } } ``` Volume mount blocks (`volume_mount {}`) are **completely unchanged**. ### Component 5: Platform Module Migration Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is `../../../modules/kubernetes/nfs_volume` (one extra level deep). The `nfs_server` variable is already passed through `stacks/platform/main.tf`. Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source. ### What Does NOT Change - NFS export paths on TrueNAS (no `nfs_directories.txt` changes) - NFS server configuration - Volume mount paths inside containers - Sub-path usage patterns - Container images or application config - Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely ## Migration Order Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch. ### Phase 0: Infrastructure 1. Deploy NFS CSI driver Helm chart (platform module) 2. Create `nfs-truenas` StorageClass 3. Create `modules/kubernetes/nfs_volume/` shared module ### Phase 1: Low-Risk Pilot (3 services) Pick 3 simple, single-volume services to validate the pattern: - `privatebin` (1 volume, low traffic) - `echo` — actually stateless, skip. Use `resume` instead (1 volume, personal site) - `speedtest` (1 volume, low traffic) ### Phase 2: Simple Services (single NFS volume each, ~20 services) Mechanical migration of all single-volume stacks. Can be parallelized. ### Phase 3: Multi-Volume Services (~15 services) Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern. ### Phase 4: Platform Modules (~9 modules) Monitoring stack, Redis, dbaas PVs, etc. These live in `stacks/platform/modules/` and need the module path adjusted. ### Phase 5: Cleanup - Update CLAUDE.md documentation (new NFS volume pattern) - Update `setup-project` skill to use module pattern for new services - Verify all services healthy ## Rollback Per-service rollback: revert the stack to inline `nfs {}` and `terragrunt apply`. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service. Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact. ## Risks 1. **`soft` mount I/O errors**: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips. 2. **PV naming conflicts**: PV names are cluster-global. Must ensure uniqueness. Convention `-` handles this. 3. **Terraform state churn**: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The `terragrunt apply` will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated. 4. **CSI driver resource overhead**: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable. ## Success Criteria - [ ] NFS CSI driver deployed and healthy on all 5 nodes - [ ] `nfs-truenas` StorageClass created with soft mount options - [ ] `modules/kubernetes/nfs_volume/` module created and tested - [ ] All 56 NFS-dependent services migrated from inline to PV/PVC - [ ] No service downtime beyond a single pod restart during migration - [ ] Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang) - [ ] Documentation and skills updated for new pattern