# NFS CSI Driver Migration Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. **Goal:** Replace all inline NFS volumes with CSI-backed PV/PVC using soft mount options to eliminate stale mount hangs. **Architecture:** Deploy the NFS CSI driver as a platform Helm module, create a shared Terraform module for PV/PVC boilerplate, then mechanically migrate all 56 NFS-dependent services from inline `nfs {}` to `persistent_volume_claim {}` referencing the shared module. **Tech Stack:** csi-driver-nfs (Helm), Terraform/Terragrunt, Kubernetes PV/PVC/StorageClass **Design doc:** `docs/plans/2026-03-01-nfs-csi-migration-design.md` --- ## Task 1: Create the NFS CSI Driver Platform Module **Files:** - Create: `stacks/platform/modules/nfs-csi/main.tf` - Modify: `stacks/platform/main.tf` (add module block) **Step 1: Create the module directory** ```bash mkdir -p stacks/platform/modules/nfs-csi ``` **Step 2: Write the NFS CSI module** Create `stacks/platform/modules/nfs-csi/main.tf`: ```hcl variable "tier" { type = string } variable "nfs_server" { type = string } resource "kubernetes_namespace" "nfs_csi" { metadata { name = "nfs-csi" labels = { tier = var.tier } } } resource "helm_release" "nfs_csi_driver" { namespace = kubernetes_namespace.nfs_csi.metadata[0].name create_namespace = false name = "csi-driver-nfs" atomic = true timeout = 300 repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts" chart = "csi-driver-nfs" values = [yamlencode({ controller = { replicas = 1 resources = { requests = { cpu = "10m", memory = "32Mi" } limits = { cpu = "100m", memory = "128Mi" } } } node = { resources = { requests = { cpu = "10m", memory = "32Mi" } limits = { cpu = "100m", memory = "128Mi" } } } storageClass = { create = false # We create it ourselves below for full control } })] } resource "kubernetes_storage_class" "nfs_truenas" { metadata { name = "nfs-truenas" } storage_provisioner = "nfs.csi.k8s.io" reclaim_policy = "Retain" volume_binding_mode = "Immediate" mount_options = [ "soft", "timeo=30", "retrans=3", "actimeo=5", ] parameters = { server = var.nfs_server share = "/mnt/main" } } ``` **Step 3: Wire the module into `stacks/platform/main.tf`** Add after the `cnpg` module block (around line 318): ```hcl module "nfs-csi" { source = "./modules/nfs-csi" tier = local.tiers.cluster nfs_server = var.nfs_server } ``` **Step 4: Verify with plan** ```bash cd stacks/platform && terragrunt plan --non-interactive 2>&1 | head -80 ``` Expected: Plan shows 3 new resources (`kubernetes_namespace`, `helm_release`, `kubernetes_storage_class`). No changes to existing resources. **Step 5: Apply** ```bash cd stacks/platform && terragrunt apply --non-interactive ``` **Step 6: Verify CSI driver is running** ```bash kubectl --kubeconfig $(pwd)/config get pods -n nfs-csi kubectl --kubeconfig $(pwd)/config get storageclass nfs-truenas ``` Expected: Controller pod + node DaemonSet pods (5 total) all Running. StorageClass `nfs-truenas` exists with provisioner `nfs.csi.k8s.io`. **Step 7: Commit** ```bash git add stacks/platform/modules/nfs-csi/ stacks/platform/main.tf git commit -m "[ci skip] add NFS CSI driver platform module with nfs-truenas StorageClass" ``` --- ## Task 2: Create the Shared `nfs_volume` Module **Files:** - Create: `modules/kubernetes/nfs_volume/main.tf` **Step 1: Write the module** Create `modules/kubernetes/nfs_volume/main.tf`: ```hcl variable "name" { description = "Unique name for PV and PVC (convention: -)" type = string } variable "namespace" { description = "Kubernetes namespace for the PVC" type = string } variable "nfs_server" { description = "NFS server address" type = string } variable "nfs_path" { description = "NFS export path (e.g. /mnt/main/myservice)" type = string } variable "storage" { description = "Storage capacity (informational for NFS)" type = string default = "10Gi" } variable "access_modes" { description = "PV/PVC access modes" type = list(string) default = ["ReadWriteMany"] } resource "kubernetes_persistent_volume" "this" { metadata { name = var.name } spec { capacity = { storage = var.storage } access_modes = var.access_modes persistent_volume_reclaim_policy = "Retain" storage_class_name = "nfs-truenas" volume_mode = "Filesystem" persistent_volume_source { csi { driver = "nfs.csi.k8s.io" volume_handle = var.name volume_attributes = { server = var.nfs_server share = var.nfs_path } } } } } resource "kubernetes_persistent_volume_claim" "this" { metadata { name = var.name namespace = var.namespace } spec { access_modes = var.access_modes storage_class_name = "nfs-truenas" volume_name = kubernetes_persistent_volume.this.metadata[0].name resources { requests = { storage = var.storage } } } } output "claim_name" { description = "PVC name to use in pod spec persistent_volume_claim blocks" value = kubernetes_persistent_volume_claim.this.metadata[0].name } ``` **Step 2: Format** ```bash terraform fmt modules/kubernetes/nfs_volume/main.tf ``` **Step 3: Commit** ```bash git add modules/kubernetes/nfs_volume/ git commit -m "[ci skip] add shared nfs_volume module for CSI-backed PV/PVC creation" ``` --- ## Task 3: Pilot Migration — `privatebin` **Files:** - Modify: `stacks/privatebin/main.tf` This is the first real migration. Validates the pattern end-to-end. **Step 1: Read current state** Current NFS volume in `stacks/privatebin/main.tf`: ```hcl # Lines 71-77 — volume block in pod spec volume { name = "data" nfs { path = "/mnt/main/privatebin" server = var.nfs_server } } ``` Volume mount (lines 54-58, UNCHANGED): ```hcl volume_mount { name = "data" mount_path = "/srv/data" sub_path = "data" } ``` **Step 2: Add module call** Add before the `kubernetes_deployment` resource (e.g., after the ingress_factory module, before the deployment): ```hcl module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "privatebin-data" namespace = kubernetes_namespace.privatebin.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/privatebin" } ``` **Step 3: Replace inline NFS volume with PVC reference** Replace the volume block (lines 71-77): ```hcl # OLD: volume { name = "data" nfs { path = "/mnt/main/privatebin" server = var.nfs_server } } # NEW: volume { name = "data" persistent_volume_claim { claim_name = module.nfs_data.claim_name } } ``` Do NOT touch the `volume_mount` block — it stays identical. **Step 4: Plan and verify** ```bash cd stacks/privatebin && terragrunt plan --non-interactive ``` Expected: 2 resources added (PV + PVC), deployment updated in-place (volume source changed). No resources destroyed (inline volumes aren't tracked as separate TF resources). **Step 5: Apply** ```bash cd stacks/privatebin && terragrunt apply --non-interactive ``` **Step 6: Verify the pod is running with CSI mount** ```bash kubectl --kubeconfig $(pwd)/config get pods -n privatebin kubectl --kubeconfig $(pwd)/config describe pod -n privatebin -l app=privatebin | grep -A5 "Volumes:" ``` Expected: Pod running. Volume shows `Type: PersistentVolumeClaim` with `ClaimName: privatebin-data`, NOT `Type: NFS`. **Step 7: Verify the app works** ```bash curl -sI https://privatebin.viktorbarzin.me | head -5 ``` Expected: HTTP 200 (or 302 redirect to the paste page). **Step 8: Verify mount options** ```bash # SSH to the node running the pod and check mount options NODE=$(kubectl --kubeconfig $(pwd)/config get pod -n privatebin -l app=privatebin -o jsonpath='{.items[0].spec.nodeName}') ssh wizard@$(kubectl --kubeconfig $(pwd)/config get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}') "mount | grep privatebin" ``` Expected: Mount shows `soft,timeo=30,retrans=3,actimeo=5` (NOT the old `hard` default). **Step 9: Commit** ```bash cd /Users/viktorbarzin/code/infra git add stacks/privatebin/main.tf git commit -m "[ci skip] privatebin: migrate NFS volume to CSI-backed PV/PVC with soft mount" ``` --- ## Task 4: Pilot Migration — `resume` **Files:** - Modify: `stacks/resume/main.tf` Same pattern as privatebin. Single NFS volume. **Step 1: Add module call** Add before the `kubernetes_deployment.resume` resource: ```hcl module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "resume-data" namespace = kubernetes_namespace.resume.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/resume" } ``` **Step 2: Replace inline NFS volume with PVC reference** In the `resume` deployment's pod spec, replace: ```hcl # OLD: volume { name = "data" nfs { server = var.nfs_server path = "/mnt/main/resume" } } # NEW: volume { name = "data" persistent_volume_claim { claim_name = module.nfs_data.claim_name } } ``` **Step 3: Plan, apply, verify** ```bash cd stacks/resume && terragrunt plan --non-interactive cd stacks/resume && terragrunt apply --non-interactive kubectl --kubeconfig $(pwd)/config get pods -n resume curl -sI https://resume.viktorbarzin.me | head -5 ``` **Step 4: Commit** ```bash cd /Users/viktorbarzin/code/infra git add stacks/resume/main.tf git commit -m "[ci skip] resume: migrate NFS volume to CSI-backed PV/PVC with soft mount" ``` --- ## Task 5: Pilot Migration — `speedtest` **Files:** - Modify: `stacks/speedtest/main.tf` **Step 1: Add module call** ```hcl module "nfs_config" { source = "../../modules/kubernetes/nfs_volume" name = "speedtest-config" namespace = kubernetes_namespace.speedtest.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/speedtest" } ``` **Step 2: Replace inline NFS volume** ```hcl # OLD: volume { name = "config" nfs { server = var.nfs_server path = "/mnt/main/speedtest" } } # NEW: volume { name = "config" persistent_volume_claim { claim_name = module.nfs_config.claim_name } } ``` **Step 3: Plan, apply, verify** ```bash cd stacks/speedtest && terragrunt plan --non-interactive cd stacks/speedtest && terragrunt apply --non-interactive kubectl --kubeconfig $(pwd)/config get pods -n speedtest curl -sI https://speedtest.viktorbarzin.me | head -5 ``` **Step 4: Commit** ```bash cd /Users/viktorbarzin/code/infra git add stacks/speedtest/main.tf git commit -m "[ci skip] speedtest: migrate NFS volume to CSI-backed PV/PVC with soft mount" ``` --- ## Task 6: Batch Migration — Simple Single-Volume Stacks After pilots are verified, migrate the remaining single-volume stacks. These all follow the exact same mechanical pattern. **Files to modify** (one `main.tf` each — apply and verify each individually): | Stack | Volume Name | PV Name | NFS Path | |-------|------------|---------|----------| | `audiobookshelf` | `data` | `audiobookshelf-data` | `/mnt/main/audiobookshelf` | | `calibre` | `data` | `calibre-data` | `/mnt/main/calibre-web-automated` | | `changedetection` | `data` | `changedetection-data` | `/mnt/main/changedetection` | | `diun` | `data` | `diun-data` | `/mnt/main/diun` | | `excalidraw` | `data` | `excalidraw-data` | `/mnt/main/excalidraw` | | `forgejo` | `data` | `forgejo-data` | `/mnt/main/forgejo` | | `freshrss` | `data` | `freshrss-data` | `/mnt/main/freshrss` | | `hackmd` | `data` | `hackmd-data` | `/mnt/main/hackmd` | | `health` | `data` | `health-data` | `/mnt/main/health` | | `isponsorblocktv` | `data` | `isponsorblocktv-data` | `/mnt/main/isponsorblocktv` | | `meshcentral` | `data` | `meshcentral-data` | `/mnt/main/meshcentral` | | `n8n` | `data` | `n8n-data` | `/mnt/main/n8n` | | `navidrome` | `data` | `navidrome-data` | `/mnt/main/navidrome` | | `netbox` | `data` | `netbox-data` | `/mnt/main/netbox` | | `ntfy` | `data` | `ntfy-data` | `/mnt/main/ntfy` | | `onlyoffice` | `data` | `onlyoffice-data` | `/mnt/main/onlyoffice` | | `owntracks` | `data` | `owntracks-data` | `/mnt/main/owntracks` | | `privatebin` | _(done in Task 3)_ | | | | `resume` | _(done in Task 4)_ | | | | `send` | `data` | `send-data` | `/mnt/main/send` | | `speedtest` | _(done in Task 5)_ | | | | `tandoor` | `data` | `tandoor-data` | `/mnt/main/tandoor` | | `wealthfolio` | `data` | `wealthfolio-data` | `/mnt/main/wealthfolio` | | `whisper` | `data` | `whisper-data` | `/mnt/main/whisper` | | `atuin` | `data` | `atuin-data` | `/mnt/main/atuin` | | `matrix` | `data` | `matrix-data` | `/mnt/main/matrix` | | `ollama` | `data` | `ollama-data` | `/mnt/main/ollama` | | `poison-fountain` | `data` | `poison-fountain-data` | `/mnt/main/poison-fountain` | | `woodpecker` | `data` | `woodpecker-data` | `/mnt/main/woodpecker` | | `ytdlp` | `data` | `ytdlp-data` | `/mnt/main/ytdlp` | | `stirling-pdf` | `data` | `stirling-pdf-data` | `/mnt/main/stirling-pdf` | | `paperless-ngx` | `data` | `paperless-ngx-data` | `/mnt/main/paperless-ngx` | | `grampsweb` | `data` | `grampsweb-data` | `/mnt/main/grampsweb` | | `trading-bot` | `data` | `trading-bot-data` | `/mnt/main/trading-bot` | **For each stack, the pattern is identical:** 1. Read `stacks//main.tf` to find the exact NFS volume block and its volume name 2. Add `module "nfs_"` call with the correct PV name, namespace, and NFS path 3. Replace `nfs {}` block with `persistent_volume_claim { claim_name = module.nfs_.claim_name }` 4. `cd stacks/ && terragrunt apply --non-interactive` 5. Verify pod is running: `kubectl --kubeconfig $(pwd)/config get pods -n ` 6. Verify app is accessible: `curl -sI https://.viktorbarzin.me | head -5` **Important**: Read each `main.tf` first — volume names, NFS paths, and namespace references vary. The table above is a guide, not a source of truth. Some stacks may have different volume names or multiple NFS paths under a parent directory. **Commit after every 3-5 stacks:** ```bash git add stacks/audiobookshelf/main.tf stacks/calibre/main.tf stacks/changedetection/main.tf git commit -m "[ci skip] migrate audiobookshelf, calibre, changedetection NFS volumes to CSI PV/PVC" ``` --- ## Task 7: Multi-Volume Stack Migration These stacks have 2+ NFS volumes. Each needs multiple module calls. **Files to modify** (read each `main.tf` first to get exact volume names and paths): | Stack | Expected NFS Volumes | Notes | |-------|---------------------|-------| | `openclaw` | 4: tools, home, workspace, data | 3 containers share volumes | | `immich` | Multiple: library, upload, thumbs, etc. | Check exact paths from nfs_directories.txt | | `servarr` | Parent + 7 sub-stacks, each with NFS | Factory pattern, check each sub-module | | `frigate` | Multiple: config, media, recordings | GPU service | | `dawarich` | Multiple | Check main.tf | | `ebook2audiobook` | Multiple | GPU service | | `f1-stream` | Multiple | Check main.tf | | `real-estate-crawler` | Multiple | Check main.tf | | `nextcloud` | Multiple | Custom LimitRange, complex stack | | `rybbit` | Multiple: clickhouse data, etc. | Check main.tf | | `osm_routing` | Multiple | Check main.tf | | `affine` | Multiple | Check main.tf | **Pattern is the same — just more module calls:** ```hcl # Example for openclaw (4 volumes) module "nfs_tools" { source = "../../modules/kubernetes/nfs_volume" name = "openclaw-tools" namespace = kubernetes_namespace.openclaw.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/openclaw/tools" } module "nfs_home" { source = "../../modules/kubernetes/nfs_volume" name = "openclaw-home" namespace = kubernetes_namespace.openclaw.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/openclaw/home" } module "nfs_workspace" { source = "../../modules/kubernetes/nfs_volume" name = "openclaw-workspace" namespace = kubernetes_namespace.openclaw.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/openclaw/workspace" } module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "openclaw-data" namespace = kubernetes_namespace.openclaw.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/openclaw/data" } # Then in pod spec: volume { name = "tools" persistent_volume_claim { claim_name = module.nfs_tools.claim_name } } volume { name = "openclaw-home" persistent_volume_claim { claim_name = module.nfs_home.claim_name } } # ... etc ``` **Step for each**: Read main.tf → identify all `nfs {}` blocks → add module calls → replace volume blocks → plan → apply → verify. **Commit after each multi-volume stack** (these are more complex, commit individually): ```bash git add stacks/openclaw/main.tf git commit -m "[ci skip] openclaw: migrate 4 NFS volumes to CSI PV/PVC with soft mount" ``` --- ## Task 8: Platform Module Migration These modules are under `stacks/platform/modules/` and reference shared modules at `../../../../modules/kubernetes/nfs_volume`. **Files to modify:** | Module | Current Storage Pattern | Notes | |--------|----------------------|-------| | `monitoring/prometheus.tf` | Existing PV/PVC with native NFS source | Change PV source from `nfs {}` to `csi {}` | | `monitoring/loki.tf` | Existing PV/PVC with native NFS source | Same | | `monitoring/grafana.tf` | Existing PV (alertmanager) with native NFS | Same | | `redis/main.tf` | Inline NFS or PV | Check current pattern | | `dbaas/` | PV for PostgreSQL, MySQL backup | Check current pattern | | `technitium/` | Inline NFS | Standard migration | | `headscale/` | Inline NFS | Standard migration | | `vaultwarden/` | Inline NFS | Standard migration | | `uptime-kuma/` | Inline NFS | Standard migration | | `mailserver/` | Inline NFS | Standard migration | | `infra-maintenance/` | Inline NFS | Standard migration | **For existing PV/PVC resources** (monitoring stack), the change is different — replace the `persistent_volume_source` block: ```hcl # OLD (in prometheus.tf): persistent_volume_source { nfs { path = "/mnt/main/prometheus" server = var.nfs_server } } # NEW: persistent_volume_source { csi { driver = "nfs.csi.k8s.io" volume_handle = "prometheus-data" volume_attributes = { server = var.nfs_server share = "/mnt/main/prometheus" } } } ``` Also add `storage_class_name = "nfs-truenas"` to the PV spec to inherit mount options. **For inline NFS volumes** in platform modules, use the shared module with the longer path: ```hcl module "nfs_data" { source = "../../../../modules/kubernetes/nfs_volume" name = "technitium-data" namespace = kubernetes_namespace.technitium.metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/technitium" } ``` **Apply as one platform apply:** ```bash cd stacks/platform && terragrunt apply --non-interactive ``` **Verify all platform services:** ```bash kubectl --kubeconfig $(pwd)/config get pods -n monitoring kubectl --kubeconfig $(pwd)/config get pods -n redis kubectl --kubeconfig $(pwd)/config get pods -n dbaas kubectl --kubeconfig $(pwd)/config get pods -n technitium # ... etc ``` **Commit:** ```bash git add stacks/platform/ git commit -m "[ci skip] platform: migrate all NFS volumes to CSI PV/PVC with soft mount" ``` --- ## Task 9: Update Documentation and Skills **Files:** - Modify: `.claude/CLAUDE.md` (update NFS Volume Pattern section) - Modify: `.claude/skills/setup-project/SKILL.md` (update new service template to use module) **Step 1: Update CLAUDE.md NFS Volume Pattern** Replace the existing NFS Volume Pattern section with: ```markdown ### NFS Volume Pattern **Use the `nfs_volume` shared module** for all NFS volumes. This creates CSI-backed PV/PVC with soft mount options (no stale mount hangs): \```hcl module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "-data" # Must be globally unique namespace = kubernetes_namespace..metadata[0].name nfs_server = var.nfs_server nfs_path = "/mnt/main/" } # In pod spec: volume { name = "data" persistent_volume_claim { claim_name = module.nfs_data.claim_name } } \``` For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`. **Legacy pattern (DO NOT use for new services):** Inline `nfs {}` blocks mount with `hard,timeo=600` defaults which hang forever on stale mounts. ``` **Step 2: Update setup-project skill** Update the new service template in `.claude/skills/setup-project/SKILL.md` to use the module pattern instead of inline NFS. **Step 3: Commit** ```bash git add .claude/ git commit -m "[ci skip] update NFS volume documentation to use CSI-backed nfs_volume module" ``` --- ## Task 10: Validation — Simulate NFS Outage **This is a manual verification step. Do NOT automate.** After all services are migrated, simulate an NFS blip to confirm the stale mount fix works: 1. Pick a low-risk service (e.g., `privatebin`) 2. On TrueNAS, temporarily block NFS to the K8s network (iptables rule or pause NFS for 30 seconds) 3. Observe: pod should get I/O errors within ~9 seconds (not hang) 4. If the pod has a liveness probe that touches the filesystem, it should restart automatically 5. After NFS recovers, verify the pod re-mounts cleanly **Do NOT run this on production without a maintenance window.** This is a "when you're ready" validation, not part of the automated migration.