infra/docs/plans/2026-03-01-nfs-csi-migration-design.md

# NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts

**Date**: 2026-03-01
**Status**: Draft
**Complements**: `2026-02-28-storage-reliability-design.md` (databases → local disk)
**Goal**: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services

## Problem

56 services use inline NFS volumes (`nfs {}` in pod specs). This pattern has three compounding issues:

1. **Stale mounts hang forever**: Inline NFS defaults to `hard,timeo=600` mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show `Running 1/1` but are completely frozen with zero listening sockets. The only fix is force-deleting the pod.

2. **No mount health checking**: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler.

3. **No storage abstraction**: NFS server IP and export paths are hardcoded into every pod spec via `var.nfs_server`. Changing the backend (different NFS server, different protocol) requires editing 56 stacks.

## Constraints

- Zero data migration — same NFS paths, same TrueNAS server, same directories
- Services must keep working during migration (no downtime per service beyond a pod restart)
- Must work with existing Terragrunt architecture (per-stack state isolation)
- Must not break services that will later move to local disk (per storage-reliability design)

## Design

### Architecture

```
BEFORE:
  Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS
  (no health check, hangs on stale mount, server IP in every stack)

AFTER:
  Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC
  CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS
  (health-checked, fails fast on stale mount, server IP in module only)
```

### Component 1: NFS CSI Driver (Helm chart in platform stack)

Deploy `csi-driver-nfs` v4.11+ via Helm in `stacks/platform/modules/nfs-csi/`.

The driver runs as:
- **Controller**: 1 replica (handles PV provisioning)
- **Node DaemonSet**: 1 per node (handles mount/unmount operations)

Resource footprint: ~50MB RAM per node, ~10m CPU idle.

The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is:
- Mount options are configurable per-StorageClass (not hardcoded kernel defaults)
- CSI health checking can detect unhealthy volumes
- Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes

### Component 2: StorageClass

```hcl
resource "kubernetes_storage_class" "nfs_truenas" {
  metadata { name = "nfs-truenas" }
  provisioner       = "nfs.csi.k8s.io"
  reclaim_policy    = "Retain"
  volume_binding_mode = "Immediate"

  mount_options = [
    "soft",       # Return -EIO instead of hanging forever
    "timeo=30",   # 3-second timeout per NFS RPC call
    "retrans=3",  # Retry 3 times before giving up (~9 sec total)
    "actimeo=5",  # 5-second attribute cache (balance freshness vs perf)
  ]

  parameters = {
    server = var.nfs_server
    share  = "/mnt/main"
  }
}
```

Key mount option differences vs current defaults:

| Option | Current (inline) | New (CSI) | Effect |
|--------|-----------------|-----------|--------|
| `hard` vs `soft` | `hard` (default) | `soft` | I/O errors instead of infinite hang |
| `timeo` | 600 (60 sec) | 30 (3 sec) | Faster failure detection |
| `retrans` | 3 | 3 | Same retry count, but 3s per attempt not 60s |
| `actimeo` | 3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache |
| Total stale detection | **~3 minutes** | **~9 seconds** | 20x faster |

### Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`)

Creates a PV + PVC pair for each NFS mount point. Hides boilerplate.

**Interface**:
```hcl
module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "myservice-data"       # PV and PVC name (must be unique cluster-wide)
  namespace  = "myservice"            # PVC namespace
  nfs_server = var.nfs_server         # From terraform.tfvars
  nfs_path   = "/mnt/main/myservice"  # NFS export path
  # Optional:
  # storage      = "10Gi"             # Default: 10Gi (informational for NFS)
  # access_modes = ["ReadWriteMany"]  # Default: RWX
}
```

**Outputs**:
- `claim_name` — PVC name to reference in pod spec

**Module creates**:
1. `kubernetes_persistent_volume` — CSI-backed, references StorageClass mount options
2. `kubernetes_persistent_volume_claim` — bound to the PV, namespaced

PVs are cluster-scoped, so `name` must be globally unique. Convention: `<service>-<purpose>` (e.g., `openclaw-tools`, `privatebin-data`).

### Component 4: Stack Migration (Mechanical Change)

Each stack changes from:
```hcl
# OLD: inline NFS
volume {
  name = "data"
  nfs {
    server = var.nfs_server
    path   = "/mnt/main/myservice"
  }
}
```

To:
```hcl
# NEW: module call (outside pod spec)
module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "myservice-data"
  namespace  = "myservice"
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/myservice"
}

# NEW: PVC reference (in pod spec, replaces nfs {} block)
volume {
  name = "data"
  persistent_volume_claim {
    claim_name = module.nfs_data.claim_name
  }
}
```

Volume mount blocks (`volume_mount {}`) are **completely unchanged**.

### Component 5: Platform Module Migration

Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is `../../../modules/kubernetes/nfs_volume` (one extra level deep). The `nfs_server` variable is already passed through `stacks/platform/main.tf`.

Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source.

### What Does NOT Change

- NFS export paths on TrueNAS (no `nfs_directories.txt` changes)
- NFS server configuration
- Volume mount paths inside containers
- Sub-path usage patterns
- Container images or application config
- Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely

## Migration Order

Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch.

### Phase 0: Infrastructure
1. Deploy NFS CSI driver Helm chart (platform module)
2. Create `nfs-truenas` StorageClass
3. Create `modules/kubernetes/nfs_volume/` shared module

### Phase 1: Low-Risk Pilot (3 services)
Pick 3 simple, single-volume services to validate the pattern:
- `privatebin` (1 volume, low traffic)
- `echo` — actually stateless, skip. Use `resume` instead (1 volume, personal site)
- `speedtest` (1 volume, low traffic)

### Phase 2: Simple Services (single NFS volume each, ~20 services)
Mechanical migration of all single-volume stacks. Can be parallelized.

### Phase 3: Multi-Volume Services (~15 services)
Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern.

### Phase 4: Platform Modules (~9 modules)
Monitoring stack, Redis, dbaas PVs, etc. These live in `stacks/platform/modules/` and need the module path adjusted.

### Phase 5: Cleanup
- Update CLAUDE.md documentation (new NFS volume pattern)
- Update `setup-project` skill to use module pattern for new services
- Verify all services healthy

## Rollback

Per-service rollback: revert the stack to inline `nfs {}` and `terragrunt apply`. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service.

Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact.

## Risks

1. **`soft` mount I/O errors**: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips.

2. **PV naming conflicts**: PV names are cluster-global. Must ensure uniqueness. Convention `<service>-<purpose>` handles this.

3. **Terraform state churn**: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The `terragrunt apply` will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated.

4. **CSI driver resource overhead**: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable.

## Success Criteria

- [ ] NFS CSI driver deployed and healthy on all 5 nodes
- [ ] `nfs-truenas` StorageClass created with soft mount options
- [ ] `modules/kubernetes/nfs_volume/` module created and tested
- [ ] All 56 NFS-dependent services migrated from inline to PV/PVC
- [ ] No service downtime beyond a single pod restart during migration
- [ ] Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang)
- [ ] Documentation and skills updated for new pattern