22 KiB
NFS CSI Driver Migration Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Replace all inline NFS volumes with CSI-backed PV/PVC using soft mount options to eliminate stale mount hangs.
Architecture: Deploy the NFS CSI driver as a platform Helm module, create a shared Terraform module for PV/PVC boilerplate, then mechanically migrate all 56 NFS-dependent services from inline nfs {} to persistent_volume_claim {} referencing the shared module.
Tech Stack: csi-driver-nfs (Helm), Terraform/Terragrunt, Kubernetes PV/PVC/StorageClass
Design doc: docs/plans/2026-03-01-nfs-csi-migration-design.md
Task 1: Create the NFS CSI Driver Platform Module
Files:
- Create:
stacks/platform/modules/nfs-csi/main.tf - Modify:
stacks/platform/main.tf(add module block)
Step 1: Create the module directory
mkdir -p stacks/platform/modules/nfs-csi
Step 2: Write the NFS CSI module
Create stacks/platform/modules/nfs-csi/main.tf:
variable "tier" { type = string }
variable "nfs_server" { type = string }
resource "kubernetes_namespace" "nfs_csi" {
metadata {
name = "nfs-csi"
labels = {
tier = var.tier
}
}
}
resource "helm_release" "nfs_csi_driver" {
namespace = kubernetes_namespace.nfs_csi.metadata[0].name
create_namespace = false
name = "csi-driver-nfs"
atomic = true
timeout = 300
repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts"
chart = "csi-driver-nfs"
values = [yamlencode({
controller = {
replicas = 1
resources = {
requests = { cpu = "10m", memory = "32Mi" }
limits = { cpu = "100m", memory = "128Mi" }
}
}
node = {
resources = {
requests = { cpu = "10m", memory = "32Mi" }
limits = { cpu = "100m", memory = "128Mi" }
}
}
storageClass = {
create = false # We create it ourselves below for full control
}
})]
}
resource "kubernetes_storage_class" "nfs_truenas" {
metadata {
name = "nfs-truenas"
}
storage_provisioner = "nfs.csi.k8s.io"
reclaim_policy = "Retain"
volume_binding_mode = "Immediate"
mount_options = [
"soft",
"timeo=30",
"retrans=3",
"actimeo=5",
]
parameters = {
server = var.nfs_server
share = "/mnt/main"
}
}
Step 3: Wire the module into stacks/platform/main.tf
Add after the cnpg module block (around line 318):
module "nfs-csi" {
source = "./modules/nfs-csi"
tier = local.tiers.cluster
nfs_server = var.nfs_server
}
Step 4: Verify with plan
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | head -80
Expected: Plan shows 3 new resources (kubernetes_namespace, helm_release, kubernetes_storage_class). No changes to existing resources.
Step 5: Apply
cd stacks/platform && terragrunt apply --non-interactive
Step 6: Verify CSI driver is running
kubectl --kubeconfig $(pwd)/config get pods -n nfs-csi
kubectl --kubeconfig $(pwd)/config get storageclass nfs-truenas
Expected: Controller pod + node DaemonSet pods (5 total) all Running. StorageClass nfs-truenas exists with provisioner nfs.csi.k8s.io.
Step 7: Commit
git add stacks/platform/modules/nfs-csi/ stacks/platform/main.tf
git commit -m "[ci skip] add NFS CSI driver platform module with nfs-truenas StorageClass"
Task 2: Create the Shared nfs_volume Module
Files:
- Create:
modules/kubernetes/nfs_volume/main.tf
Step 1: Write the module
Create modules/kubernetes/nfs_volume/main.tf:
variable "name" {
description = "Unique name for PV and PVC (convention: <service>-<purpose>)"
type = string
}
variable "namespace" {
description = "Kubernetes namespace for the PVC"
type = string
}
variable "nfs_server" {
description = "NFS server address"
type = string
}
variable "nfs_path" {
description = "NFS export path (e.g. /mnt/main/myservice)"
type = string
}
variable "storage" {
description = "Storage capacity (informational for NFS)"
type = string
default = "10Gi"
}
variable "access_modes" {
description = "PV/PVC access modes"
type = list(string)
default = ["ReadWriteMany"]
}
resource "kubernetes_persistent_volume" "this" {
metadata {
name = var.name
}
spec {
capacity = {
storage = var.storage
}
access_modes = var.access_modes
persistent_volume_reclaim_policy = "Retain"
storage_class_name = "nfs-truenas"
volume_mode = "Filesystem"
persistent_volume_source {
csi {
driver = "nfs.csi.k8s.io"
volume_handle = var.name
volume_attributes = {
server = var.nfs_server
share = var.nfs_path
}
}
}
}
}
resource "kubernetes_persistent_volume_claim" "this" {
metadata {
name = var.name
namespace = var.namespace
}
spec {
access_modes = var.access_modes
storage_class_name = "nfs-truenas"
volume_name = kubernetes_persistent_volume.this.metadata[0].name
resources {
requests = {
storage = var.storage
}
}
}
}
output "claim_name" {
description = "PVC name to use in pod spec persistent_volume_claim blocks"
value = kubernetes_persistent_volume_claim.this.metadata[0].name
}
Step 2: Format
terraform fmt modules/kubernetes/nfs_volume/main.tf
Step 3: Commit
git add modules/kubernetes/nfs_volume/
git commit -m "[ci skip] add shared nfs_volume module for CSI-backed PV/PVC creation"
Task 3: Pilot Migration — privatebin
Files:
- Modify:
stacks/privatebin/main.tf
This is the first real migration. Validates the pattern end-to-end.
Step 1: Read current state
Current NFS volume in stacks/privatebin/main.tf:
# Lines 71-77 — volume block in pod spec
volume {
name = "data"
nfs {
path = "/mnt/main/privatebin"
server = var.nfs_server
}
}
Volume mount (lines 54-58, UNCHANGED):
volume_mount {
name = "data"
mount_path = "/srv/data"
sub_path = "data"
}
Step 2: Add module call
Add before the kubernetes_deployment resource (e.g., after the ingress_factory module, before the deployment):
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "privatebin-data"
namespace = kubernetes_namespace.privatebin.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/privatebin"
}
Step 3: Replace inline NFS volume with PVC reference
Replace the volume block (lines 71-77):
# OLD:
volume {
name = "data"
nfs {
path = "/mnt/main/privatebin"
server = var.nfs_server
}
}
# NEW:
volume {
name = "data"
persistent_volume_claim {
claim_name = module.nfs_data.claim_name
}
}
Do NOT touch the volume_mount block — it stays identical.
Step 4: Plan and verify
cd stacks/privatebin && terragrunt plan --non-interactive
Expected: 2 resources added (PV + PVC), deployment updated in-place (volume source changed). No resources destroyed (inline volumes aren't tracked as separate TF resources).
Step 5: Apply
cd stacks/privatebin && terragrunt apply --non-interactive
Step 6: Verify the pod is running with CSI mount
kubectl --kubeconfig $(pwd)/config get pods -n privatebin
kubectl --kubeconfig $(pwd)/config describe pod -n privatebin -l app=privatebin | grep -A5 "Volumes:"
Expected: Pod running. Volume shows Type: PersistentVolumeClaim with ClaimName: privatebin-data, NOT Type: NFS.
Step 7: Verify the app works
curl -sI https://privatebin.viktorbarzin.me | head -5
Expected: HTTP 200 (or 302 redirect to the paste page).
Step 8: Verify mount options
# SSH to the node running the pod and check mount options
NODE=$(kubectl --kubeconfig $(pwd)/config get pod -n privatebin -l app=privatebin -o jsonpath='{.items[0].spec.nodeName}')
ssh wizard@$(kubectl --kubeconfig $(pwd)/config get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}') "mount | grep privatebin"
Expected: Mount shows soft,timeo=30,retrans=3,actimeo=5 (NOT the old hard default).
Step 9: Commit
cd /Users/viktorbarzin/code/infra
git add stacks/privatebin/main.tf
git commit -m "[ci skip] privatebin: migrate NFS volume to CSI-backed PV/PVC with soft mount"
Task 4: Pilot Migration — resume
Files:
- Modify:
stacks/resume/main.tf
Same pattern as privatebin. Single NFS volume.
Step 1: Add module call
Add before the kubernetes_deployment.resume resource:
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "resume-data"
namespace = kubernetes_namespace.resume.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/resume"
}
Step 2: Replace inline NFS volume with PVC reference
In the resume deployment's pod spec, replace:
# OLD:
volume {
name = "data"
nfs {
server = var.nfs_server
path = "/mnt/main/resume"
}
}
# NEW:
volume {
name = "data"
persistent_volume_claim {
claim_name = module.nfs_data.claim_name
}
}
Step 3: Plan, apply, verify
cd stacks/resume && terragrunt plan --non-interactive
cd stacks/resume && terragrunt apply --non-interactive
kubectl --kubeconfig $(pwd)/config get pods -n resume
curl -sI https://resume.viktorbarzin.me | head -5
Step 4: Commit
cd /Users/viktorbarzin/code/infra
git add stacks/resume/main.tf
git commit -m "[ci skip] resume: migrate NFS volume to CSI-backed PV/PVC with soft mount"
Task 5: Pilot Migration — speedtest
Files:
- Modify:
stacks/speedtest/main.tf
Step 1: Add module call
module "nfs_config" {
source = "../../modules/kubernetes/nfs_volume"
name = "speedtest-config"
namespace = kubernetes_namespace.speedtest.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/speedtest"
}
Step 2: Replace inline NFS volume
# OLD:
volume {
name = "config"
nfs {
server = var.nfs_server
path = "/mnt/main/speedtest"
}
}
# NEW:
volume {
name = "config"
persistent_volume_claim {
claim_name = module.nfs_config.claim_name
}
}
Step 3: Plan, apply, verify
cd stacks/speedtest && terragrunt plan --non-interactive
cd stacks/speedtest && terragrunt apply --non-interactive
kubectl --kubeconfig $(pwd)/config get pods -n speedtest
curl -sI https://speedtest.viktorbarzin.me | head -5
Step 4: Commit
cd /Users/viktorbarzin/code/infra
git add stacks/speedtest/main.tf
git commit -m "[ci skip] speedtest: migrate NFS volume to CSI-backed PV/PVC with soft mount"
Task 6: Batch Migration — Simple Single-Volume Stacks
After pilots are verified, migrate the remaining single-volume stacks. These all follow the exact same mechanical pattern.
Files to modify (one main.tf each — apply and verify each individually):
| Stack | Volume Name | PV Name | NFS Path |
|---|---|---|---|
audiobookshelf |
data |
audiobookshelf-data |
/mnt/main/audiobookshelf |
calibre |
data |
calibre-data |
/mnt/main/calibre-web-automated |
changedetection |
data |
changedetection-data |
/mnt/main/changedetection |
diun |
data |
diun-data |
/mnt/main/diun |
excalidraw |
data |
excalidraw-data |
/mnt/main/excalidraw |
forgejo |
data |
forgejo-data |
/mnt/main/forgejo |
freshrss |
data |
freshrss-data |
/mnt/main/freshrss |
hackmd |
data |
hackmd-data |
/mnt/main/hackmd |
health |
data |
health-data |
/mnt/main/health |
isponsorblocktv |
data |
isponsorblocktv-data |
/mnt/main/isponsorblocktv |
meshcentral |
data |
meshcentral-data |
/mnt/main/meshcentral |
n8n |
data |
n8n-data |
/mnt/main/n8n |
navidrome |
data |
navidrome-data |
/mnt/main/navidrome |
netbox |
data |
netbox-data |
/mnt/main/netbox |
ntfy |
data |
ntfy-data |
/mnt/main/ntfy |
onlyoffice |
data |
onlyoffice-data |
/mnt/main/onlyoffice |
owntracks |
data |
owntracks-data |
/mnt/main/owntracks |
privatebin |
(done in Task 3) | ||
resume |
(done in Task 4) | ||
send |
data |
send-data |
/mnt/main/send |
speedtest |
(done in Task 5) | ||
tandoor |
data |
tandoor-data |
/mnt/main/tandoor |
wealthfolio |
data |
wealthfolio-data |
/mnt/main/wealthfolio |
whisper |
data |
whisper-data |
/mnt/main/whisper |
atuin |
data |
atuin-data |
/mnt/main/atuin |
matrix |
data |
matrix-data |
/mnt/main/matrix |
ollama |
data |
ollama-data |
/mnt/main/ollama |
poison-fountain |
data |
poison-fountain-data |
/mnt/main/poison-fountain |
woodpecker |
data |
woodpecker-data |
/mnt/main/woodpecker |
ytdlp |
data |
ytdlp-data |
/mnt/main/ytdlp |
stirling-pdf |
data |
stirling-pdf-data |
/mnt/main/stirling-pdf |
paperless-ngx |
data |
paperless-ngx-data |
/mnt/main/paperless-ngx |
grampsweb |
data |
grampsweb-data |
/mnt/main/grampsweb |
trading-bot |
data |
trading-bot-data |
/mnt/main/trading-bot |
For each stack, the pattern is identical:
- Read
stacks/<service>/main.tfto find the exact NFS volume block and its volume name - Add
module "nfs_<volume_name>"call with the correct PV name, namespace, and NFS path - Replace
nfs {}block withpersistent_volume_claim { claim_name = module.nfs_<volume_name>.claim_name } cd stacks/<service> && terragrunt apply --non-interactive- Verify pod is running:
kubectl --kubeconfig $(pwd)/config get pods -n <service> - Verify app is accessible:
curl -sI https://<service>.viktorbarzin.me | head -5
Important: Read each main.tf first — volume names, NFS paths, and namespace references vary. The table above is a guide, not a source of truth. Some stacks may have different volume names or multiple NFS paths under a parent directory.
Commit after every 3-5 stacks:
git add stacks/audiobookshelf/main.tf stacks/calibre/main.tf stacks/changedetection/main.tf
git commit -m "[ci skip] migrate audiobookshelf, calibre, changedetection NFS volumes to CSI PV/PVC"
Task 7: Multi-Volume Stack Migration
These stacks have 2+ NFS volumes. Each needs multiple module calls.
Files to modify (read each main.tf first to get exact volume names and paths):
| Stack | Expected NFS Volumes | Notes |
|---|---|---|
openclaw |
4: tools, home, workspace, data | 3 containers share volumes |
immich |
Multiple: library, upload, thumbs, etc. | Check exact paths from nfs_directories.txt |
servarr |
Parent + 7 sub-stacks, each with NFS | Factory pattern, check each sub-module |
frigate |
Multiple: config, media, recordings | GPU service |
dawarich |
Multiple | Check main.tf |
ebook2audiobook |
Multiple | GPU service |
f1-stream |
Multiple | Check main.tf |
real-estate-crawler |
Multiple | Check main.tf |
nextcloud |
Multiple | Custom LimitRange, complex stack |
rybbit |
Multiple: clickhouse data, etc. | Check main.tf |
osm_routing |
Multiple | Check main.tf |
affine |
Multiple | Check main.tf |
Pattern is the same — just more module calls:
# Example for openclaw (4 volumes)
module "nfs_tools" {
source = "../../modules/kubernetes/nfs_volume"
name = "openclaw-tools"
namespace = kubernetes_namespace.openclaw.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/openclaw/tools"
}
module "nfs_home" {
source = "../../modules/kubernetes/nfs_volume"
name = "openclaw-home"
namespace = kubernetes_namespace.openclaw.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/openclaw/home"
}
module "nfs_workspace" {
source = "../../modules/kubernetes/nfs_volume"
name = "openclaw-workspace"
namespace = kubernetes_namespace.openclaw.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/openclaw/workspace"
}
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "openclaw-data"
namespace = kubernetes_namespace.openclaw.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/openclaw/data"
}
# Then in pod spec:
volume {
name = "tools"
persistent_volume_claim { claim_name = module.nfs_tools.claim_name }
}
volume {
name = "openclaw-home"
persistent_volume_claim { claim_name = module.nfs_home.claim_name }
}
# ... etc
Step for each: Read main.tf → identify all nfs {} blocks → add module calls → replace volume blocks → plan → apply → verify.
Commit after each multi-volume stack (these are more complex, commit individually):
git add stacks/openclaw/main.tf
git commit -m "[ci skip] openclaw: migrate 4 NFS volumes to CSI PV/PVC with soft mount"
Task 8: Platform Module Migration
These modules are under stacks/platform/modules/ and reference shared modules at ../../../../modules/kubernetes/nfs_volume.
Files to modify:
| Module | Current Storage Pattern | Notes |
|---|---|---|
monitoring/prometheus.tf |
Existing PV/PVC with native NFS source | Change PV source from nfs {} to csi {} |
monitoring/loki.tf |
Existing PV/PVC with native NFS source | Same |
monitoring/grafana.tf |
Existing PV (alertmanager) with native NFS | Same |
redis/main.tf |
Inline NFS or PV | Check current pattern |
dbaas/ |
PV for PostgreSQL, MySQL backup | Check current pattern |
technitium/ |
Inline NFS | Standard migration |
headscale/ |
Inline NFS | Standard migration |
vaultwarden/ |
Inline NFS | Standard migration |
uptime-kuma/ |
Inline NFS | Standard migration |
mailserver/ |
Inline NFS | Standard migration |
infra-maintenance/ |
Inline NFS | Standard migration |
For existing PV/PVC resources (monitoring stack), the change is different — replace the persistent_volume_source block:
# OLD (in prometheus.tf):
persistent_volume_source {
nfs {
path = "/mnt/main/prometheus"
server = var.nfs_server
}
}
# NEW:
persistent_volume_source {
csi {
driver = "nfs.csi.k8s.io"
volume_handle = "prometheus-data"
volume_attributes = {
server = var.nfs_server
share = "/mnt/main/prometheus"
}
}
}
Also add storage_class_name = "nfs-truenas" to the PV spec to inherit mount options.
For inline NFS volumes in platform modules, use the shared module with the longer path:
module "nfs_data" {
source = "../../../../modules/kubernetes/nfs_volume"
name = "technitium-data"
namespace = kubernetes_namespace.technitium.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/technitium"
}
Apply as one platform apply:
cd stacks/platform && terragrunt apply --non-interactive
Verify all platform services:
kubectl --kubeconfig $(pwd)/config get pods -n monitoring
kubectl --kubeconfig $(pwd)/config get pods -n redis
kubectl --kubeconfig $(pwd)/config get pods -n dbaas
kubectl --kubeconfig $(pwd)/config get pods -n technitium
# ... etc
Commit:
git add stacks/platform/
git commit -m "[ci skip] platform: migrate all NFS volumes to CSI PV/PVC with soft mount"
Task 9: Update Documentation and Skills
Files:
- Modify:
.claude/CLAUDE.md(update NFS Volume Pattern section) - Modify:
.claude/skills/setup-project/SKILL.md(update new service template to use module)
Step 1: Update CLAUDE.md NFS Volume Pattern
Replace the existing NFS Volume Pattern section with:
### NFS Volume Pattern
**Use the `nfs_volume` shared module** for all NFS volumes. This creates CSI-backed PV/PVC with soft mount options (no stale mount hangs):
\```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "<service>-data" # Must be globally unique
namespace = kubernetes_namespace.<service>.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/<service>"
}
# In pod spec:
volume {
name = "data"
persistent_volume_claim {
claim_name = module.nfs_data.claim_name
}
}
\```
For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`.
**Legacy pattern (DO NOT use for new services):** Inline `nfs {}` blocks mount with `hard,timeo=600` defaults which hang forever on stale mounts.
Step 2: Update setup-project skill
Update the new service template in .claude/skills/setup-project/SKILL.md to use the module pattern instead of inline NFS.
Step 3: Commit
git add .claude/
git commit -m "[ci skip] update NFS volume documentation to use CSI-backed nfs_volume module"
Task 10: Validation — Simulate NFS Outage
This is a manual verification step. Do NOT automate.
After all services are migrated, simulate an NFS blip to confirm the stale mount fix works:
- Pick a low-risk service (e.g.,
privatebin) - On TrueNAS, temporarily block NFS to the K8s network (iptables rule or pause NFS for 30 seconds)
- Observe: pod should get I/O errors within ~9 seconds (not hang)
- If the pod has a liveness probe that touches the filesystem, it should restart automatically
- After NFS recovers, verify the pod re-mounts cleanly
Do NOT run this on production without a maintenance window. This is a "when you're ready" validation, not part of the automated migration.