Viktor Barzin 910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan

2026-03-01 23:30:27 +00:00

8.9 KiB

Raw Permalink Blame History

NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts

Date: 2026-03-01 Status: Draft Complements: 2026-02-28-storage-reliability-design.md (databases → local disk) Goal: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services

Problem

56 services use inline NFS volumes (nfs {} in pod specs). This pattern has three compounding issues:

Stale mounts hang forever: Inline NFS defaults to hard,timeo=600 mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show Running 1/1 but are completely frozen with zero listening sockets. The only fix is force-deleting the pod.
No mount health checking: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler.
No storage abstraction: NFS server IP and export paths are hardcoded into every pod spec via var.nfs_server. Changing the backend (different NFS server, different protocol) requires editing 56 stacks.

Constraints

Zero data migration — same NFS paths, same TrueNAS server, same directories
Services must keep working during migration (no downtime per service beyond a pod restart)
Must work with existing Terragrunt architecture (per-stack state isolation)
Must not break services that will later move to local disk (per storage-reliability design)

Design

Architecture

BEFORE:
  Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS
  (no health check, hangs on stale mount, server IP in every stack)

AFTER:
  Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC
  CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS
  (health-checked, fails fast on stale mount, server IP in module only)

Component 1: NFS CSI Driver (Helm chart in platform stack)

Deploy csi-driver-nfs v4.11+ via Helm in stacks/platform/modules/nfs-csi/.

The driver runs as:

Controller: 1 replica (handles PV provisioning)
Node DaemonSet: 1 per node (handles mount/unmount operations)

Resource footprint: ~50MB RAM per node, ~10m CPU idle.

The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is:

Mount options are configurable per-StorageClass (not hardcoded kernel defaults)
CSI health checking can detect unhealthy volumes
Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes

Component 2: StorageClass

resource "kubernetes_storage_class" "nfs_truenas" {
  metadata { name = "nfs-truenas" }
  provisioner       = "nfs.csi.k8s.io"
  reclaim_policy    = "Retain"
  volume_binding_mode = "Immediate"

  mount_options = [
    "soft",       # Return -EIO instead of hanging forever
    "timeo=30",   # 3-second timeout per NFS RPC call
    "retrans=3",  # Retry 3 times before giving up (~9 sec total)
    "actimeo=5",  # 5-second attribute cache (balance freshness vs perf)
  ]

  parameters = {
    server = var.nfs_server
    share  = "/mnt/main"
  }
}

Key mount option differences vs current defaults:

Option	Current (inline)	New (CSI)	Effect
`hard` vs `soft`	`hard` (default)	`soft`	I/O errors instead of infinite hang
`timeo`	600 (60 sec)	30 (3 sec)	Faster failure detection
`retrans`	3	3	Same retry count, but 3s per attempt not 60s
`actimeo`	3600 (1 hour, varies)	5 (5 sec)	Fresher attribute cache
Total stale detection	~3 minutes	~9 seconds	20x faster

Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`)

Creates a PV + PVC pair for each NFS mount point. Hides boilerplate.

Interface:

module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "myservice-data"       # PV and PVC name (must be unique cluster-wide)
  namespace  = "myservice"            # PVC namespace
  nfs_server = var.nfs_server         # From terraform.tfvars
  nfs_path   = "/mnt/main/myservice"  # NFS export path
  # Optional:
  # storage      = "10Gi"             # Default: 10Gi (informational for NFS)
  # access_modes = ["ReadWriteMany"]  # Default: RWX
}

Outputs:

claim_name — PVC name to reference in pod spec

Module creates:

kubernetes_persistent_volume — CSI-backed, references StorageClass mount options
kubernetes_persistent_volume_claim — bound to the PV, namespaced

PVs are cluster-scoped, so name must be globally unique. Convention: <service>-<purpose> (e.g., openclaw-tools, privatebin-data).

Component 4: Stack Migration (Mechanical Change)

Each stack changes from:

# OLD: inline NFS
volume {
  name = "data"
  nfs {
    server = var.nfs_server
    path   = "/mnt/main/myservice"
  }
}

To:

# NEW: module call (outside pod spec)
module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "myservice-data"
  namespace  = "myservice"
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/myservice"
}

# NEW: PVC reference (in pod spec, replaces nfs {} block)
volume {
  name = "data"
  persistent_volume_claim {
    claim_name = module.nfs_data.claim_name
  }
}

Volume mount blocks (volume_mount {}) are completely unchanged.

Component 5: Platform Module Migration

Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is ../../../modules/kubernetes/nfs_volume (one extra level deep). The nfs_server variable is already passed through stacks/platform/main.tf.

Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source.

What Does NOT Change

NFS export paths on TrueNAS (no nfs_directories.txt changes)
NFS server configuration
Volume mount paths inside containers
Sub-path usage patterns
Container images or application config
Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely

Migration Order

Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch.

Phase 0: Infrastructure

Deploy NFS CSI driver Helm chart (platform module)
Create nfs-truenas StorageClass
Create modules/kubernetes/nfs_volume/ shared module

Phase 1: Low-Risk Pilot (3 services)

Pick 3 simple, single-volume services to validate the pattern:

privatebin (1 volume, low traffic)
echo — actually stateless, skip. Use resume instead (1 volume, personal site)
speedtest (1 volume, low traffic)

Phase 2: Simple Services (single NFS volume each, ~20 services)

Mechanical migration of all single-volume stacks. Can be parallelized.

Phase 3: Multi-Volume Services (~15 services)

Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern.

Phase 4: Platform Modules (~9 modules)

Monitoring stack, Redis, dbaas PVs, etc. These live in stacks/platform/modules/ and need the module path adjusted.

Phase 5: Cleanup

Update CLAUDE.md documentation (new NFS volume pattern)
Update setup-project skill to use module pattern for new services
Verify all services healthy

Rollback

Per-service rollback: revert the stack to inline nfs {} and terragrunt apply. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service.

Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact.

Risks

soft mount I/O errors: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips.
PV naming conflicts: PV names are cluster-global. Must ensure uniqueness. Convention <service>-<purpose> handles this.
Terraform state churn: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The terragrunt apply will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated.
CSI driver resource overhead: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable.

Success Criteria

NFS CSI driver deployed and healthy on all 5 nodes
nfs-truenas StorageClass created with soft mount options
modules/kubernetes/nfs_volume/ module created and tested
All 56 NFS-dependent services migrated from inline to PV/PVC
No service downtime beyond a single pod restart during migration
Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang)
Documentation and skills updated for new pattern

8.9 KiB Raw Permalink Blame History