infra/docs/plans/2026-03-01-nfs-csi-migration-design.md

219 lines
8.9 KiB
Markdown

# NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts
**Date**: 2026-03-01
**Status**: Draft
**Complements**: `2026-02-28-storage-reliability-design.md` (databases → local disk)
**Goal**: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services
## Problem
56 services use inline NFS volumes (`nfs {}` in pod specs). This pattern has three compounding issues:
1. **Stale mounts hang forever**: Inline NFS defaults to `hard,timeo=600` mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show `Running 1/1` but are completely frozen with zero listening sockets. The only fix is force-deleting the pod.
2. **No mount health checking**: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler.
3. **No storage abstraction**: NFS server IP and export paths are hardcoded into every pod spec via `var.nfs_server`. Changing the backend (different NFS server, different protocol) requires editing 56 stacks.
## Constraints
- Zero data migration — same NFS paths, same TrueNAS server, same directories
- Services must keep working during migration (no downtime per service beyond a pod restart)
- Must work with existing Terragrunt architecture (per-stack state isolation)
- Must not break services that will later move to local disk (per storage-reliability design)
## Design
### Architecture
```
BEFORE:
Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS
(no health check, hangs on stale mount, server IP in every stack)
AFTER:
Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC
CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS
(health-checked, fails fast on stale mount, server IP in module only)
```
### Component 1: NFS CSI Driver (Helm chart in platform stack)
Deploy `csi-driver-nfs` v4.11+ via Helm in `stacks/platform/modules/nfs-csi/`.
The driver runs as:
- **Controller**: 1 replica (handles PV provisioning)
- **Node DaemonSet**: 1 per node (handles mount/unmount operations)
Resource footprint: ~50MB RAM per node, ~10m CPU idle.
The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is:
- Mount options are configurable per-StorageClass (not hardcoded kernel defaults)
- CSI health checking can detect unhealthy volumes
- Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes
### Component 2: StorageClass
```hcl
resource "kubernetes_storage_class" "nfs_truenas" {
metadata { name = "nfs-truenas" }
provisioner = "nfs.csi.k8s.io"
reclaim_policy = "Retain"
volume_binding_mode = "Immediate"
mount_options = [
"soft", # Return -EIO instead of hanging forever
"timeo=30", # 3-second timeout per NFS RPC call
"retrans=3", # Retry 3 times before giving up (~9 sec total)
"actimeo=5", # 5-second attribute cache (balance freshness vs perf)
]
parameters = {
server = var.nfs_server
share = "/mnt/main"
}
}
```
Key mount option differences vs current defaults:
| Option | Current (inline) | New (CSI) | Effect |
|--------|-----------------|-----------|--------|
| `hard` vs `soft` | `hard` (default) | `soft` | I/O errors instead of infinite hang |
| `timeo` | 600 (60 sec) | 30 (3 sec) | Faster failure detection |
| `retrans` | 3 | 3 | Same retry count, but 3s per attempt not 60s |
| `actimeo` | 3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache |
| Total stale detection | **~3 minutes** | **~9 seconds** | 20x faster |
### Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`)
Creates a PV + PVC pair for each NFS mount point. Hides boilerplate.
**Interface**:
```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "myservice-data" # PV and PVC name (must be unique cluster-wide)
namespace = "myservice" # PVC namespace
nfs_server = var.nfs_server # From terraform.tfvars
nfs_path = "/mnt/main/myservice" # NFS export path
# Optional:
# storage = "10Gi" # Default: 10Gi (informational for NFS)
# access_modes = ["ReadWriteMany"] # Default: RWX
}
```
**Outputs**:
- `claim_name` — PVC name to reference in pod spec
**Module creates**:
1. `kubernetes_persistent_volume` — CSI-backed, references StorageClass mount options
2. `kubernetes_persistent_volume_claim` — bound to the PV, namespaced
PVs are cluster-scoped, so `name` must be globally unique. Convention: `<service>-<purpose>` (e.g., `openclaw-tools`, `privatebin-data`).
### Component 4: Stack Migration (Mechanical Change)
Each stack changes from:
```hcl
# OLD: inline NFS
volume {
name = "data"
nfs {
server = var.nfs_server
path = "/mnt/main/myservice"
}
}
```
To:
```hcl
# NEW: module call (outside pod spec)
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "myservice-data"
namespace = "myservice"
nfs_server = var.nfs_server
nfs_path = "/mnt/main/myservice"
}
# NEW: PVC reference (in pod spec, replaces nfs {} block)
volume {
name = "data"
persistent_volume_claim {
claim_name = module.nfs_data.claim_name
}
}
```
Volume mount blocks (`volume_mount {}`) are **completely unchanged**.
### Component 5: Platform Module Migration
Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is `../../../modules/kubernetes/nfs_volume` (one extra level deep). The `nfs_server` variable is already passed through `stacks/platform/main.tf`.
Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source.
### What Does NOT Change
- NFS export paths on TrueNAS (no `nfs_directories.txt` changes)
- NFS server configuration
- Volume mount paths inside containers
- Sub-path usage patterns
- Container images or application config
- Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely
## Migration Order
Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch.
### Phase 0: Infrastructure
1. Deploy NFS CSI driver Helm chart (platform module)
2. Create `nfs-truenas` StorageClass
3. Create `modules/kubernetes/nfs_volume/` shared module
### Phase 1: Low-Risk Pilot (3 services)
Pick 3 simple, single-volume services to validate the pattern:
- `privatebin` (1 volume, low traffic)
- `echo` — actually stateless, skip. Use `resume` instead (1 volume, personal site)
- `speedtest` (1 volume, low traffic)
### Phase 2: Simple Services (single NFS volume each, ~20 services)
Mechanical migration of all single-volume stacks. Can be parallelized.
### Phase 3: Multi-Volume Services (~15 services)
Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern.
### Phase 4: Platform Modules (~9 modules)
Monitoring stack, Redis, dbaas PVs, etc. These live in `stacks/platform/modules/` and need the module path adjusted.
### Phase 5: Cleanup
- Update CLAUDE.md documentation (new NFS volume pattern)
- Update `setup-project` skill to use module pattern for new services
- Verify all services healthy
## Rollback
Per-service rollback: revert the stack to inline `nfs {}` and `terragrunt apply`. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service.
Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact.
## Risks
1. **`soft` mount I/O errors**: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips.
2. **PV naming conflicts**: PV names are cluster-global. Must ensure uniqueness. Convention `<service>-<purpose>` handles this.
3. **Terraform state churn**: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The `terragrunt apply` will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated.
4. **CSI driver resource overhead**: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable.
## Success Criteria
- [ ] NFS CSI driver deployed and healthy on all 5 nodes
- [ ] `nfs-truenas` StorageClass created with soft mount options
- [ ] `modules/kubernetes/nfs_volume/` module created and tested
- [ ] All 56 NFS-dependent services migrated from inline to PV/PVC
- [ ] No service downtime beyond a single pod restart during migration
- [ ] Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang)
- [ ] Documentation and skills updated for new pattern