8.9 KiB
NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts
Date: 2026-03-01
Status: Draft
Complements: 2026-02-28-storage-reliability-design.md (databases → local disk)
Goal: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services
Problem
56 services use inline NFS volumes (nfs {} in pod specs). This pattern has three compounding issues:
-
Stale mounts hang forever: Inline NFS defaults to
hard,timeo=600mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods showRunning 1/1but are completely frozen with zero listening sockets. The only fix is force-deleting the pod. -
No mount health checking: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler.
-
No storage abstraction: NFS server IP and export paths are hardcoded into every pod spec via
var.nfs_server. Changing the backend (different NFS server, different protocol) requires editing 56 stacks.
Constraints
- Zero data migration — same NFS paths, same TrueNAS server, same directories
- Services must keep working during migration (no downtime per service beyond a pod restart)
- Must work with existing Terragrunt architecture (per-stack state isolation)
- Must not break services that will later move to local disk (per storage-reliability design)
Design
Architecture
BEFORE:
Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS
(no health check, hangs on stale mount, server IP in every stack)
AFTER:
Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC
CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS
(health-checked, fails fast on stale mount, server IP in module only)
Component 1: NFS CSI Driver (Helm chart in platform stack)
Deploy csi-driver-nfs v4.11+ via Helm in stacks/platform/modules/nfs-csi/.
The driver runs as:
- Controller: 1 replica (handles PV provisioning)
- Node DaemonSet: 1 per node (handles mount/unmount operations)
Resource footprint: ~50MB RAM per node, ~10m CPU idle.
The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is:
- Mount options are configurable per-StorageClass (not hardcoded kernel defaults)
- CSI health checking can detect unhealthy volumes
- Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes
Component 2: StorageClass
resource "kubernetes_storage_class" "nfs_truenas" {
metadata { name = "nfs-truenas" }
provisioner = "nfs.csi.k8s.io"
reclaim_policy = "Retain"
volume_binding_mode = "Immediate"
mount_options = [
"soft", # Return -EIO instead of hanging forever
"timeo=30", # 3-second timeout per NFS RPC call
"retrans=3", # Retry 3 times before giving up (~9 sec total)
"actimeo=5", # 5-second attribute cache (balance freshness vs perf)
]
parameters = {
server = var.nfs_server
share = "/mnt/main"
}
}
Key mount option differences vs current defaults:
| Option | Current (inline) | New (CSI) | Effect |
|---|---|---|---|
hard vs soft |
hard (default) |
soft |
I/O errors instead of infinite hang |
timeo |
600 (60 sec) | 30 (3 sec) | Faster failure detection |
retrans |
3 | 3 | Same retry count, but 3s per attempt not 60s |
actimeo |
3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache |
| Total stale detection | ~3 minutes | ~9 seconds | 20x faster |
Component 3: Shared Terraform Module (modules/kubernetes/nfs_volume/)
Creates a PV + PVC pair for each NFS mount point. Hides boilerplate.
Interface:
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "myservice-data" # PV and PVC name (must be unique cluster-wide)
namespace = "myservice" # PVC namespace
nfs_server = var.nfs_server # From terraform.tfvars
nfs_path = "/mnt/main/myservice" # NFS export path
# Optional:
# storage = "10Gi" # Default: 10Gi (informational for NFS)
# access_modes = ["ReadWriteMany"] # Default: RWX
}
Outputs:
claim_name— PVC name to reference in pod spec
Module creates:
kubernetes_persistent_volume— CSI-backed, references StorageClass mount optionskubernetes_persistent_volume_claim— bound to the PV, namespaced
PVs are cluster-scoped, so name must be globally unique. Convention: <service>-<purpose> (e.g., openclaw-tools, privatebin-data).
Component 4: Stack Migration (Mechanical Change)
Each stack changes from:
# OLD: inline NFS
volume {
name = "data"
nfs {
server = var.nfs_server
path = "/mnt/main/myservice"
}
}
To:
# NEW: module call (outside pod spec)
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "myservice-data"
namespace = "myservice"
nfs_server = var.nfs_server
nfs_path = "/mnt/main/myservice"
}
# NEW: PVC reference (in pod spec, replaces nfs {} block)
volume {
name = "data"
persistent_volume_claim {
claim_name = module.nfs_data.claim_name
}
}
Volume mount blocks (volume_mount {}) are completely unchanged.
Component 5: Platform Module Migration
Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is ../../../modules/kubernetes/nfs_volume (one extra level deep). The nfs_server variable is already passed through stacks/platform/main.tf.
Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source.
What Does NOT Change
- NFS export paths on TrueNAS (no
nfs_directories.txtchanges) - NFS server configuration
- Volume mount paths inside containers
- Sub-path usage patterns
- Container images or application config
- Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely
Migration Order
Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch.
Phase 0: Infrastructure
- Deploy NFS CSI driver Helm chart (platform module)
- Create
nfs-truenasStorageClass - Create
modules/kubernetes/nfs_volume/shared module
Phase 1: Low-Risk Pilot (3 services)
Pick 3 simple, single-volume services to validate the pattern:
privatebin(1 volume, low traffic)echo— actually stateless, skip. Useresumeinstead (1 volume, personal site)speedtest(1 volume, low traffic)
Phase 2: Simple Services (single NFS volume each, ~20 services)
Mechanical migration of all single-volume stacks. Can be parallelized.
Phase 3: Multi-Volume Services (~15 services)
Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern.
Phase 4: Platform Modules (~9 modules)
Monitoring stack, Redis, dbaas PVs, etc. These live in stacks/platform/modules/ and need the module path adjusted.
Phase 5: Cleanup
- Update CLAUDE.md documentation (new NFS volume pattern)
- Update
setup-projectskill to use module pattern for new services - Verify all services healthy
Rollback
Per-service rollback: revert the stack to inline nfs {} and terragrunt apply. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service.
Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact.
Risks
-
softmount I/O errors: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips. -
PV naming conflicts: PV names are cluster-global. Must ensure uniqueness. Convention
<service>-<purpose>handles this. -
Terraform state churn: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The
terragrunt applywill show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated. -
CSI driver resource overhead: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable.
Success Criteria
- NFS CSI driver deployed and healthy on all 5 nodes
nfs-truenasStorageClass created with soft mount optionsmodules/kubernetes/nfs_volume/module created and tested- All 56 NFS-dependent services migrated from inline to PV/PVC
- No service downtime beyond a single pod restart during migration
- Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang)
- Documentation and skills updated for new pattern