Viktor Barzin 910ea5d923 [ci skip] add NFS CSI migration design doc and implementation plan

2026-03-01 23:30:27 +00:00

22 KiB

Raw Blame History

NFS CSI Driver Migration Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Replace all inline NFS volumes with CSI-backed PV/PVC using soft mount options to eliminate stale mount hangs.

Architecture: Deploy the NFS CSI driver as a platform Helm module, create a shared Terraform module for PV/PVC boilerplate, then mechanically migrate all 56 NFS-dependent services from inline nfs {} to persistent_volume_claim {} referencing the shared module.

Tech Stack: csi-driver-nfs (Helm), Terraform/Terragrunt, Kubernetes PV/PVC/StorageClass

Design doc: docs/plans/2026-03-01-nfs-csi-migration-design.md

Task 1: Create the NFS CSI Driver Platform Module

Files:

Create: stacks/platform/modules/nfs-csi/main.tf
Modify: stacks/platform/main.tf (add module block)

Step 1: Create the module directory

mkdir -p stacks/platform/modules/nfs-csi

Step 2: Write the NFS CSI module

Create stacks/platform/modules/nfs-csi/main.tf:

variable "tier" { type = string }
variable "nfs_server" { type = string }

resource "kubernetes_namespace" "nfs_csi" {
  metadata {
    name = "nfs-csi"
    labels = {
      tier = var.tier
    }
  }
}

resource "helm_release" "nfs_csi_driver" {
  namespace        = kubernetes_namespace.nfs_csi.metadata[0].name
  create_namespace = false
  name             = "csi-driver-nfs"
  atomic           = true
  timeout          = 300

  repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts"
  chart      = "csi-driver-nfs"

  values = [yamlencode({
    controller = {
      replicas = 1
      resources = {
        requests = { cpu = "10m", memory = "32Mi" }
        limits   = { cpu = "100m", memory = "128Mi" }
      }
    }
    node = {
      resources = {
        requests = { cpu = "10m", memory = "32Mi" }
        limits   = { cpu = "100m", memory = "128Mi" }
      }
    }
    storageClass = {
      create = false  # We create it ourselves below for full control
    }
  })]
}

resource "kubernetes_storage_class" "nfs_truenas" {
  metadata {
    name = "nfs-truenas"
  }
  storage_provisioner = "nfs.csi.k8s.io"
  reclaim_policy      = "Retain"
  volume_binding_mode = "Immediate"

  mount_options = [
    "soft",
    "timeo=30",
    "retrans=3",
    "actimeo=5",
  ]

  parameters = {
    server = var.nfs_server
    share  = "/mnt/main"
  }
}

Step 3: Wire the module into stacks/platform/main.tf

Add after the cnpg module block (around line 318):

module "nfs-csi" {
  source     = "./modules/nfs-csi"
  tier       = local.tiers.cluster
  nfs_server = var.nfs_server
}

Step 4: Verify with plan

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | head -80

Expected: Plan shows 3 new resources (kubernetes_namespace, helm_release, kubernetes_storage_class). No changes to existing resources.

Step 5: Apply

cd stacks/platform && terragrunt apply --non-interactive

Step 6: Verify CSI driver is running

kubectl --kubeconfig $(pwd)/config get pods -n nfs-csi
kubectl --kubeconfig $(pwd)/config get storageclass nfs-truenas

Expected: Controller pod + node DaemonSet pods (5 total) all Running. StorageClass nfs-truenas exists with provisioner nfs.csi.k8s.io.

Step 7: Commit

git add stacks/platform/modules/nfs-csi/ stacks/platform/main.tf
git commit -m "[ci skip] add NFS CSI driver platform module with nfs-truenas StorageClass"

Task 2: Create the Shared `nfs_volume` Module

Files:

Create: modules/kubernetes/nfs_volume/main.tf

Step 1: Write the module

Create modules/kubernetes/nfs_volume/main.tf:

variable "name" {
  description = "Unique name for PV and PVC (convention: <service>-<purpose>)"
  type        = string
}

variable "namespace" {
  description = "Kubernetes namespace for the PVC"
  type        = string
}

variable "nfs_server" {
  description = "NFS server address"
  type        = string
}

variable "nfs_path" {
  description = "NFS export path (e.g. /mnt/main/myservice)"
  type        = string
}

variable "storage" {
  description = "Storage capacity (informational for NFS)"
  type        = string
  default     = "10Gi"
}

variable "access_modes" {
  description = "PV/PVC access modes"
  type        = list(string)
  default     = ["ReadWriteMany"]
}

resource "kubernetes_persistent_volume" "this" {
  metadata {
    name = var.name
  }
  spec {
    capacity = {
      storage = var.storage
    }
    access_modes                     = var.access_modes
    persistent_volume_reclaim_policy = "Retain"
    storage_class_name               = "nfs-truenas"
    volume_mode                      = "Filesystem"

    persistent_volume_source {
      csi {
        driver        = "nfs.csi.k8s.io"
        volume_handle = var.name
        volume_attributes = {
          server = var.nfs_server
          share  = var.nfs_path
        }
      }
    }
  }
}

resource "kubernetes_persistent_volume_claim" "this" {
  metadata {
    name      = var.name
    namespace = var.namespace
  }
  spec {
    access_modes       = var.access_modes
    storage_class_name = "nfs-truenas"
    volume_name        = kubernetes_persistent_volume.this.metadata[0].name

    resources {
      requests = {
        storage = var.storage
      }
    }
  }
}

output "claim_name" {
  description = "PVC name to use in pod spec persistent_volume_claim blocks"
  value       = kubernetes_persistent_volume_claim.this.metadata[0].name
}

Step 2: Format

terraform fmt modules/kubernetes/nfs_volume/main.tf

Step 3: Commit

git add modules/kubernetes/nfs_volume/
git commit -m "[ci skip] add shared nfs_volume module for CSI-backed PV/PVC creation"

Task 3: Pilot Migration — `privatebin`

Files:

Modify: stacks/privatebin/main.tf

This is the first real migration. Validates the pattern end-to-end.

Step 1: Read current state

Current NFS volume in stacks/privatebin/main.tf:

# Lines 71-77 — volume block in pod spec
volume {
  name = "data"
  nfs {
    path   = "/mnt/main/privatebin"
    server = var.nfs_server
  }
}

Volume mount (lines 54-58, UNCHANGED):

volume_mount {
  name       = "data"
  mount_path = "/srv/data"
  sub_path   = "data"
}

Step 2: Add module call

Add before the kubernetes_deployment resource (e.g., after the ingress_factory module, before the deployment):

module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "privatebin-data"
  namespace  = kubernetes_namespace.privatebin.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/privatebin"
}

Step 3: Replace inline NFS volume with PVC reference

Replace the volume block (lines 71-77):

# OLD:
volume {
  name = "data"
  nfs {
    path   = "/mnt/main/privatebin"
    server = var.nfs_server
  }
}

# NEW:
volume {
  name = "data"
  persistent_volume_claim {
    claim_name = module.nfs_data.claim_name
  }
}

Do NOT touch the volume_mount block — it stays identical.

Step 4: Plan and verify

cd stacks/privatebin && terragrunt plan --non-interactive

Expected: 2 resources added (PV + PVC), deployment updated in-place (volume source changed). No resources destroyed (inline volumes aren't tracked as separate TF resources).

Step 5: Apply

cd stacks/privatebin && terragrunt apply --non-interactive

Step 6: Verify the pod is running with CSI mount

kubectl --kubeconfig $(pwd)/config get pods -n privatebin
kubectl --kubeconfig $(pwd)/config describe pod -n privatebin -l app=privatebin | grep -A5 "Volumes:"

Expected: Pod running. Volume shows Type: PersistentVolumeClaim with ClaimName: privatebin-data, NOT Type: NFS.

Step 7: Verify the app works

curl -sI https://privatebin.viktorbarzin.me | head -5

Expected: HTTP 200 (or 302 redirect to the paste page).

Step 8: Verify mount options

# SSH to the node running the pod and check mount options
NODE=$(kubectl --kubeconfig $(pwd)/config get pod -n privatebin -l app=privatebin -o jsonpath='{.items[0].spec.nodeName}')
ssh wizard@$(kubectl --kubeconfig $(pwd)/config get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}') "mount | grep privatebin"

Expected: Mount shows soft,timeo=30,retrans=3,actimeo=5 (NOT the old hard default).

Step 9: Commit

cd /Users/viktorbarzin/code/infra
git add stacks/privatebin/main.tf
git commit -m "[ci skip] privatebin: migrate NFS volume to CSI-backed PV/PVC with soft mount"

Task 4: Pilot Migration — `resume`

Files:

Modify: stacks/resume/main.tf

Same pattern as privatebin. Single NFS volume.

Step 1: Add module call

Add before the kubernetes_deployment.resume resource:

module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "resume-data"
  namespace  = kubernetes_namespace.resume.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/resume"
}

Step 2: Replace inline NFS volume with PVC reference

In the resume deployment's pod spec, replace:

# OLD:
volume {
  name = "data"
  nfs {
    server = var.nfs_server
    path   = "/mnt/main/resume"
  }
}

# NEW:
volume {
  name = "data"
  persistent_volume_claim {
    claim_name = module.nfs_data.claim_name
  }
}

Step 3: Plan, apply, verify

cd stacks/resume && terragrunt plan --non-interactive
cd stacks/resume && terragrunt apply --non-interactive
kubectl --kubeconfig $(pwd)/config get pods -n resume
curl -sI https://resume.viktorbarzin.me | head -5

Step 4: Commit

cd /Users/viktorbarzin/code/infra
git add stacks/resume/main.tf
git commit -m "[ci skip] resume: migrate NFS volume to CSI-backed PV/PVC with soft mount"

Task 5: Pilot Migration — `speedtest`

Files:

Modify: stacks/speedtest/main.tf

Step 1: Add module call

module "nfs_config" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "speedtest-config"
  namespace  = kubernetes_namespace.speedtest.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/speedtest"
}

Step 2: Replace inline NFS volume

# OLD:
volume {
  name = "config"
  nfs {
    server = var.nfs_server
    path   = "/mnt/main/speedtest"
  }
}

# NEW:
volume {
  name = "config"
  persistent_volume_claim {
    claim_name = module.nfs_config.claim_name
  }
}

Step 3: Plan, apply, verify

cd stacks/speedtest && terragrunt plan --non-interactive
cd stacks/speedtest && terragrunt apply --non-interactive
kubectl --kubeconfig $(pwd)/config get pods -n speedtest
curl -sI https://speedtest.viktorbarzin.me | head -5

Step 4: Commit

cd /Users/viktorbarzin/code/infra
git add stacks/speedtest/main.tf
git commit -m "[ci skip] speedtest: migrate NFS volume to CSI-backed PV/PVC with soft mount"

Task 6: Batch Migration — Simple Single-Volume Stacks

After pilots are verified, migrate the remaining single-volume stacks. These all follow the exact same mechanical pattern.

Files to modify (one main.tf each — apply and verify each individually):

Stack	Volume Name	PV Name	NFS Path
`audiobookshelf`	`data`	`audiobookshelf-data`	`/mnt/main/audiobookshelf`
`calibre`	`data`	`calibre-data`	`/mnt/main/calibre-web-automated`
`changedetection`	`data`	`changedetection-data`	`/mnt/main/changedetection`
`diun`	`data`	`diun-data`	`/mnt/main/diun`
`excalidraw`	`data`	`excalidraw-data`	`/mnt/main/excalidraw`
`forgejo`	`data`	`forgejo-data`	`/mnt/main/forgejo`
`freshrss`	`data`	`freshrss-data`	`/mnt/main/freshrss`
`hackmd`	`data`	`hackmd-data`	`/mnt/main/hackmd`
`health`	`data`	`health-data`	`/mnt/main/health`
`isponsorblocktv`	`data`	`isponsorblocktv-data`	`/mnt/main/isponsorblocktv`
`meshcentral`	`data`	`meshcentral-data`	`/mnt/main/meshcentral`
`n8n`	`data`	`n8n-data`	`/mnt/main/n8n`
`navidrome`	`data`	`navidrome-data`	`/mnt/main/navidrome`
`netbox`	`data`	`netbox-data`	`/mnt/main/netbox`
`ntfy`	`data`	`ntfy-data`	`/mnt/main/ntfy`
`onlyoffice`	`data`	`onlyoffice-data`	`/mnt/main/onlyoffice`
`owntracks`	`data`	`owntracks-data`	`/mnt/main/owntracks`
`privatebin`	(done in Task 3)
`resume`	(done in Task 4)
`send`	`data`	`send-data`	`/mnt/main/send`
`speedtest`	(done in Task 5)
`tandoor`	`data`	`tandoor-data`	`/mnt/main/tandoor`
`wealthfolio`	`data`	`wealthfolio-data`	`/mnt/main/wealthfolio`
`whisper`	`data`	`whisper-data`	`/mnt/main/whisper`
`atuin`	`data`	`atuin-data`	`/mnt/main/atuin`
`matrix`	`data`	`matrix-data`	`/mnt/main/matrix`
`ollama`	`data`	`ollama-data`	`/mnt/main/ollama`
`poison-fountain`	`data`	`poison-fountain-data`	`/mnt/main/poison-fountain`
`woodpecker`	`data`	`woodpecker-data`	`/mnt/main/woodpecker`
`ytdlp`	`data`	`ytdlp-data`	`/mnt/main/ytdlp`
`stirling-pdf`	`data`	`stirling-pdf-data`	`/mnt/main/stirling-pdf`
`paperless-ngx`	`data`	`paperless-ngx-data`	`/mnt/main/paperless-ngx`
`grampsweb`	`data`	`grampsweb-data`	`/mnt/main/grampsweb`
`trading-bot`	`data`	`trading-bot-data`	`/mnt/main/trading-bot`

For each stack, the pattern is identical:

Read stacks/<service>/main.tf to find the exact NFS volume block and its volume name
Add module "nfs_<volume_name>" call with the correct PV name, namespace, and NFS path
Replace nfs {} block with persistent_volume_claim { claim_name = module.nfs_<volume_name>.claim_name }
cd stacks/<service> && terragrunt apply --non-interactive
Verify pod is running: kubectl --kubeconfig $(pwd)/config get pods -n <service>
Verify app is accessible: curl -sI https://<service>.viktorbarzin.me | head -5

Important: Read each main.tf first — volume names, NFS paths, and namespace references vary. The table above is a guide, not a source of truth. Some stacks may have different volume names or multiple NFS paths under a parent directory.

Commit after every 3-5 stacks:

git add stacks/audiobookshelf/main.tf stacks/calibre/main.tf stacks/changedetection/main.tf
git commit -m "[ci skip] migrate audiobookshelf, calibre, changedetection NFS volumes to CSI PV/PVC"

Task 7: Multi-Volume Stack Migration

These stacks have 2+ NFS volumes. Each needs multiple module calls.

Files to modify (read each main.tf first to get exact volume names and paths):

Stack	Expected NFS Volumes	Notes
`openclaw`	4: tools, home, workspace, data	3 containers share volumes
`immich`	Multiple: library, upload, thumbs, etc.	Check exact paths from nfs_directories.txt
`servarr`	Parent + 7 sub-stacks, each with NFS	Factory pattern, check each sub-module
`frigate`	Multiple: config, media, recordings	GPU service
`dawarich`	Multiple	Check main.tf
`ebook2audiobook`	Multiple	GPU service
`f1-stream`	Multiple	Check main.tf
`real-estate-crawler`	Multiple	Check main.tf
`nextcloud`	Multiple	Custom LimitRange, complex stack
`rybbit`	Multiple: clickhouse data, etc.	Check main.tf
`osm_routing`	Multiple	Check main.tf
`affine`	Multiple	Check main.tf

Pattern is the same — just more module calls:

# Example for openclaw (4 volumes)
module "nfs_tools" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "openclaw-tools"
  namespace  = kubernetes_namespace.openclaw.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/openclaw/tools"
}

module "nfs_home" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "openclaw-home"
  namespace  = kubernetes_namespace.openclaw.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/openclaw/home"
}

module "nfs_workspace" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "openclaw-workspace"
  namespace  = kubernetes_namespace.openclaw.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/openclaw/workspace"
}

module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "openclaw-data"
  namespace  = kubernetes_namespace.openclaw.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/openclaw/data"
}

# Then in pod spec:
volume {
  name = "tools"
  persistent_volume_claim { claim_name = module.nfs_tools.claim_name }
}
volume {
  name = "openclaw-home"
  persistent_volume_claim { claim_name = module.nfs_home.claim_name }
}
# ... etc

Step for each: Read main.tf → identify all nfs {} blocks → add module calls → replace volume blocks → plan → apply → verify.

Commit after each multi-volume stack (these are more complex, commit individually):

git add stacks/openclaw/main.tf
git commit -m "[ci skip] openclaw: migrate 4 NFS volumes to CSI PV/PVC with soft mount"

Task 8: Platform Module Migration

These modules are under stacks/platform/modules/ and reference shared modules at ../../../../modules/kubernetes/nfs_volume.

Files to modify:

Module	Current Storage Pattern	Notes
`monitoring/prometheus.tf`	Existing PV/PVC with native NFS source	Change PV source from `nfs {}` to `csi {}`
`monitoring/loki.tf`	Existing PV/PVC with native NFS source	Same
`monitoring/grafana.tf`	Existing PV (alertmanager) with native NFS	Same
`redis/main.tf`	Inline NFS or PV	Check current pattern
`dbaas/`	PV for PostgreSQL, MySQL backup	Check current pattern
`technitium/`	Inline NFS	Standard migration
`headscale/`	Inline NFS	Standard migration
`vaultwarden/`	Inline NFS	Standard migration
`uptime-kuma/`	Inline NFS	Standard migration
`mailserver/`	Inline NFS	Standard migration
`infra-maintenance/`	Inline NFS	Standard migration

For existing PV/PVC resources (monitoring stack), the change is different — replace the persistent_volume_source block:

# OLD (in prometheus.tf):
persistent_volume_source {
  nfs {
    path   = "/mnt/main/prometheus"
    server = var.nfs_server
  }
}

# NEW:
persistent_volume_source {
  csi {
    driver        = "nfs.csi.k8s.io"
    volume_handle = "prometheus-data"
    volume_attributes = {
      server = var.nfs_server
      share  = "/mnt/main/prometheus"
    }
  }
}

Also add storage_class_name = "nfs-truenas" to the PV spec to inherit mount options.

For inline NFS volumes in platform modules, use the shared module with the longer path:

module "nfs_data" {
  source     = "../../../../modules/kubernetes/nfs_volume"
  name       = "technitium-data"
  namespace  = kubernetes_namespace.technitium.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/technitium"
}

Apply as one platform apply:

cd stacks/platform && terragrunt apply --non-interactive

Verify all platform services:

kubectl --kubeconfig $(pwd)/config get pods -n monitoring
kubectl --kubeconfig $(pwd)/config get pods -n redis
kubectl --kubeconfig $(pwd)/config get pods -n dbaas
kubectl --kubeconfig $(pwd)/config get pods -n technitium
# ... etc

Commit:

git add stacks/platform/
git commit -m "[ci skip] platform: migrate all NFS volumes to CSI PV/PVC with soft mount"

Task 9: Update Documentation and Skills

Files:

Modify: .claude/CLAUDE.md (update NFS Volume Pattern section)
Modify: .claude/skills/setup-project/SKILL.md (update new service template to use module)

Step 1: Update CLAUDE.md NFS Volume Pattern

Replace the existing NFS Volume Pattern section with:

### NFS Volume Pattern
**Use the `nfs_volume` shared module** for all NFS volumes. This creates CSI-backed PV/PVC with soft mount options (no stale mount hangs):
\```hcl
module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "<service>-data"       # Must be globally unique
  namespace  = kubernetes_namespace.<service>.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/<service>"
}

# In pod spec:
volume {
  name = "data"
  persistent_volume_claim {
    claim_name = module.nfs_data.claim_name
  }
}
\```
For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`.

**Legacy pattern (DO NOT use for new services):** Inline `nfs {}` blocks mount with `hard,timeo=600` defaults which hang forever on stale mounts.

Step 2: Update setup-project skill

Update the new service template in .claude/skills/setup-project/SKILL.md to use the module pattern instead of inline NFS.

Step 3: Commit

git add .claude/
git commit -m "[ci skip] update NFS volume documentation to use CSI-backed nfs_volume module"

Task 10: Validation — Simulate NFS Outage

This is a manual verification step. Do NOT automate.

After all services are migrated, simulate an NFS blip to confirm the stale mount fix works:

Pick a low-risk service (e.g., privatebin)
On TrueNAS, temporarily block NFS to the K8s network (iptables rule or pause NFS for 30 seconds)
Observe: pod should get I/O errors within ~9 seconds (not hang)
If the pod has a liveness probe that touches the filesystem, it should restart automatically
After NFS recovers, verify the pod re-mounts cleanly

Do NOT run this on production without a maintenance window. This is a "when you're ready" validation, not part of the automated migration.

22 KiB Raw Blame History

NFS CSI Driver Migration Implementation Plan

Task 1: Create the NFS CSI Driver Platform Module

Task 2: Create the Shared nfs_volume Module

Task 3: Pilot Migration — privatebin

Task 4: Pilot Migration — resume

Task 5: Pilot Migration — speedtest

Task 6: Batch Migration — Simple Single-Volume Stacks

Task 7: Multi-Volume Stack Migration

Task 8: Platform Module Migration

Task 9: Update Documentation and Skills

Task 10: Validation — Simulate NFS Outage

22 KiB

Raw Blame History

Task 2: Create the Shared `nfs_volume` Module

Task 3: Pilot Migration — `privatebin`

Task 4: Pilot Migration — `resume`

Task 5: Pilot Migration — `speedtest`