diff --git a/docs/plans/2026-03-01-nfs-csi-migration-design.md b/docs/plans/2026-03-01-nfs-csi-migration-design.md new file mode 100644 index 00000000..03a84731 --- /dev/null +++ b/docs/plans/2026-03-01-nfs-csi-migration-design.md @@ -0,0 +1,219 @@ +# NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts + +**Date**: 2026-03-01 +**Status**: Draft +**Complements**: `2026-02-28-storage-reliability-design.md` (databases → local disk) +**Goal**: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services + +## Problem + +56 services use inline NFS volumes (`nfs {}` in pod specs). This pattern has three compounding issues: + +1. **Stale mounts hang forever**: Inline NFS defaults to `hard,timeo=600` mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show `Running 1/1` but are completely frozen with zero listening sockets. The only fix is force-deleting the pod. + +2. **No mount health checking**: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler. + +3. **No storage abstraction**: NFS server IP and export paths are hardcoded into every pod spec via `var.nfs_server`. Changing the backend (different NFS server, different protocol) requires editing 56 stacks. + +## Constraints + +- Zero data migration — same NFS paths, same TrueNAS server, same directories +- Services must keep working during migration (no downtime per service beyond a pod restart) +- Must work with existing Terragrunt architecture (per-stack state isolation) +- Must not break services that will later move to local disk (per storage-reliability design) + +## Design + +### Architecture + +``` +BEFORE: + Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS + (no health check, hangs on stale mount, server IP in every stack) + +AFTER: + Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC + CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS + (health-checked, fails fast on stale mount, server IP in module only) +``` + +### Component 1: NFS CSI Driver (Helm chart in platform stack) + +Deploy `csi-driver-nfs` v4.11+ via Helm in `stacks/platform/modules/nfs-csi/`. + +The driver runs as: +- **Controller**: 1 replica (handles PV provisioning) +- **Node DaemonSet**: 1 per node (handles mount/unmount operations) + +Resource footprint: ~50MB RAM per node, ~10m CPU idle. + +The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is: +- Mount options are configurable per-StorageClass (not hardcoded kernel defaults) +- CSI health checking can detect unhealthy volumes +- Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes + +### Component 2: StorageClass + +```hcl +resource "kubernetes_storage_class" "nfs_truenas" { + metadata { name = "nfs-truenas" } + provisioner = "nfs.csi.k8s.io" + reclaim_policy = "Retain" + volume_binding_mode = "Immediate" + + mount_options = [ + "soft", # Return -EIO instead of hanging forever + "timeo=30", # 3-second timeout per NFS RPC call + "retrans=3", # Retry 3 times before giving up (~9 sec total) + "actimeo=5", # 5-second attribute cache (balance freshness vs perf) + ] + + parameters = { + server = var.nfs_server + share = "/mnt/main" + } +} +``` + +Key mount option differences vs current defaults: + +| Option | Current (inline) | New (CSI) | Effect | +|--------|-----------------|-----------|--------| +| `hard` vs `soft` | `hard` (default) | `soft` | I/O errors instead of infinite hang | +| `timeo` | 600 (60 sec) | 30 (3 sec) | Faster failure detection | +| `retrans` | 3 | 3 | Same retry count, but 3s per attempt not 60s | +| `actimeo` | 3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache | +| Total stale detection | **~3 minutes** | **~9 seconds** | 20x faster | + +### Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`) + +Creates a PV + PVC pair for each NFS mount point. Hides boilerplate. + +**Interface**: +```hcl +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "myservice-data" # PV and PVC name (must be unique cluster-wide) + namespace = "myservice" # PVC namespace + nfs_server = var.nfs_server # From terraform.tfvars + nfs_path = "/mnt/main/myservice" # NFS export path + # Optional: + # storage = "10Gi" # Default: 10Gi (informational for NFS) + # access_modes = ["ReadWriteMany"] # Default: RWX +} +``` + +**Outputs**: +- `claim_name` — PVC name to reference in pod spec + +**Module creates**: +1. `kubernetes_persistent_volume` — CSI-backed, references StorageClass mount options +2. `kubernetes_persistent_volume_claim` — bound to the PV, namespaced + +PVs are cluster-scoped, so `name` must be globally unique. Convention: `-` (e.g., `openclaw-tools`, `privatebin-data`). + +### Component 4: Stack Migration (Mechanical Change) + +Each stack changes from: +```hcl +# OLD: inline NFS +volume { + name = "data" + nfs { + server = var.nfs_server + path = "/mnt/main/myservice" + } +} +``` + +To: +```hcl +# NEW: module call (outside pod spec) +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "myservice-data" + namespace = "myservice" + nfs_server = var.nfs_server + nfs_path = "/mnt/main/myservice" +} + +# NEW: PVC reference (in pod spec, replaces nfs {} block) +volume { + name = "data" + persistent_volume_claim { + claim_name = module.nfs_data.claim_name + } +} +``` + +Volume mount blocks (`volume_mount {}`) are **completely unchanged**. + +### Component 5: Platform Module Migration + +Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is `../../../modules/kubernetes/nfs_volume` (one extra level deep). The `nfs_server` variable is already passed through `stacks/platform/main.tf`. + +Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source. + +### What Does NOT Change + +- NFS export paths on TrueNAS (no `nfs_directories.txt` changes) +- NFS server configuration +- Volume mount paths inside containers +- Sub-path usage patterns +- Container images or application config +- Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely + +## Migration Order + +Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch. + +### Phase 0: Infrastructure +1. Deploy NFS CSI driver Helm chart (platform module) +2. Create `nfs-truenas` StorageClass +3. Create `modules/kubernetes/nfs_volume/` shared module + +### Phase 1: Low-Risk Pilot (3 services) +Pick 3 simple, single-volume services to validate the pattern: +- `privatebin` (1 volume, low traffic) +- `echo` — actually stateless, skip. Use `resume` instead (1 volume, personal site) +- `speedtest` (1 volume, low traffic) + +### Phase 2: Simple Services (single NFS volume each, ~20 services) +Mechanical migration of all single-volume stacks. Can be parallelized. + +### Phase 3: Multi-Volume Services (~15 services) +Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern. + +### Phase 4: Platform Modules (~9 modules) +Monitoring stack, Redis, dbaas PVs, etc. These live in `stacks/platform/modules/` and need the module path adjusted. + +### Phase 5: Cleanup +- Update CLAUDE.md documentation (new NFS volume pattern) +- Update `setup-project` skill to use module pattern for new services +- Verify all services healthy + +## Rollback + +Per-service rollback: revert the stack to inline `nfs {}` and `terragrunt apply`. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service. + +Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact. + +## Risks + +1. **`soft` mount I/O errors**: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips. + +2. **PV naming conflicts**: PV names are cluster-global. Must ensure uniqueness. Convention `-` handles this. + +3. **Terraform state churn**: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The `terragrunt apply` will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated. + +4. **CSI driver resource overhead**: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable. + +## Success Criteria + +- [ ] NFS CSI driver deployed and healthy on all 5 nodes +- [ ] `nfs-truenas` StorageClass created with soft mount options +- [ ] `modules/kubernetes/nfs_volume/` module created and tested +- [ ] All 56 NFS-dependent services migrated from inline to PV/PVC +- [ ] No service downtime beyond a single pod restart during migration +- [ ] Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang) +- [ ] Documentation and skills updated for new pattern diff --git a/docs/plans/2026-03-01-nfs-csi-migration-plan.md b/docs/plans/2026-03-01-nfs-csi-migration-plan.md new file mode 100644 index 00000000..57cd0614 --- /dev/null +++ b/docs/plans/2026-03-01-nfs-csi-migration-plan.md @@ -0,0 +1,774 @@ +# NFS CSI Driver Migration Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Replace all inline NFS volumes with CSI-backed PV/PVC using soft mount options to eliminate stale mount hangs. + +**Architecture:** Deploy the NFS CSI driver as a platform Helm module, create a shared Terraform module for PV/PVC boilerplate, then mechanically migrate all 56 NFS-dependent services from inline `nfs {}` to `persistent_volume_claim {}` referencing the shared module. + +**Tech Stack:** csi-driver-nfs (Helm), Terraform/Terragrunt, Kubernetes PV/PVC/StorageClass + +**Design doc:** `docs/plans/2026-03-01-nfs-csi-migration-design.md` + +--- + +## Task 1: Create the NFS CSI Driver Platform Module + +**Files:** +- Create: `stacks/platform/modules/nfs-csi/main.tf` +- Modify: `stacks/platform/main.tf` (add module block) + +**Step 1: Create the module directory** + +```bash +mkdir -p stacks/platform/modules/nfs-csi +``` + +**Step 2: Write the NFS CSI module** + +Create `stacks/platform/modules/nfs-csi/main.tf`: + +```hcl +variable "tier" { type = string } +variable "nfs_server" { type = string } + +resource "kubernetes_namespace" "nfs_csi" { + metadata { + name = "nfs-csi" + labels = { + tier = var.tier + } + } +} + +resource "helm_release" "nfs_csi_driver" { + namespace = kubernetes_namespace.nfs_csi.metadata[0].name + create_namespace = false + name = "csi-driver-nfs" + atomic = true + timeout = 300 + + repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts" + chart = "csi-driver-nfs" + + values = [yamlencode({ + controller = { + replicas = 1 + resources = { + requests = { cpu = "10m", memory = "32Mi" } + limits = { cpu = "100m", memory = "128Mi" } + } + } + node = { + resources = { + requests = { cpu = "10m", memory = "32Mi" } + limits = { cpu = "100m", memory = "128Mi" } + } + } + storageClass = { + create = false # We create it ourselves below for full control + } + })] +} + +resource "kubernetes_storage_class" "nfs_truenas" { + metadata { + name = "nfs-truenas" + } + storage_provisioner = "nfs.csi.k8s.io" + reclaim_policy = "Retain" + volume_binding_mode = "Immediate" + + mount_options = [ + "soft", + "timeo=30", + "retrans=3", + "actimeo=5", + ] + + parameters = { + server = var.nfs_server + share = "/mnt/main" + } +} +``` + +**Step 3: Wire the module into `stacks/platform/main.tf`** + +Add after the `cnpg` module block (around line 318): + +```hcl +module "nfs-csi" { + source = "./modules/nfs-csi" + tier = local.tiers.cluster + nfs_server = var.nfs_server +} +``` + +**Step 4: Verify with plan** + +```bash +cd stacks/platform && terragrunt plan --non-interactive 2>&1 | head -80 +``` + +Expected: Plan shows 3 new resources (`kubernetes_namespace`, `helm_release`, `kubernetes_storage_class`). No changes to existing resources. + +**Step 5: Apply** + +```bash +cd stacks/platform && terragrunt apply --non-interactive +``` + +**Step 6: Verify CSI driver is running** + +```bash +kubectl --kubeconfig $(pwd)/config get pods -n nfs-csi +kubectl --kubeconfig $(pwd)/config get storageclass nfs-truenas +``` + +Expected: Controller pod + node DaemonSet pods (5 total) all Running. StorageClass `nfs-truenas` exists with provisioner `nfs.csi.k8s.io`. + +**Step 7: Commit** + +```bash +git add stacks/platform/modules/nfs-csi/ stacks/platform/main.tf +git commit -m "[ci skip] add NFS CSI driver platform module with nfs-truenas StorageClass" +``` + +--- + +## Task 2: Create the Shared `nfs_volume` Module + +**Files:** +- Create: `modules/kubernetes/nfs_volume/main.tf` + +**Step 1: Write the module** + +Create `modules/kubernetes/nfs_volume/main.tf`: + +```hcl +variable "name" { + description = "Unique name for PV and PVC (convention: -)" + type = string +} + +variable "namespace" { + description = "Kubernetes namespace for the PVC" + type = string +} + +variable "nfs_server" { + description = "NFS server address" + type = string +} + +variable "nfs_path" { + description = "NFS export path (e.g. /mnt/main/myservice)" + type = string +} + +variable "storage" { + description = "Storage capacity (informational for NFS)" + type = string + default = "10Gi" +} + +variable "access_modes" { + description = "PV/PVC access modes" + type = list(string) + default = ["ReadWriteMany"] +} + +resource "kubernetes_persistent_volume" "this" { + metadata { + name = var.name + } + spec { + capacity = { + storage = var.storage + } + access_modes = var.access_modes + persistent_volume_reclaim_policy = "Retain" + storage_class_name = "nfs-truenas" + volume_mode = "Filesystem" + + persistent_volume_source { + csi { + driver = "nfs.csi.k8s.io" + volume_handle = var.name + volume_attributes = { + server = var.nfs_server + share = var.nfs_path + } + } + } + } +} + +resource "kubernetes_persistent_volume_claim" "this" { + metadata { + name = var.name + namespace = var.namespace + } + spec { + access_modes = var.access_modes + storage_class_name = "nfs-truenas" + volume_name = kubernetes_persistent_volume.this.metadata[0].name + + resources { + requests = { + storage = var.storage + } + } + } +} + +output "claim_name" { + description = "PVC name to use in pod spec persistent_volume_claim blocks" + value = kubernetes_persistent_volume_claim.this.metadata[0].name +} +``` + +**Step 2: Format** + +```bash +terraform fmt modules/kubernetes/nfs_volume/main.tf +``` + +**Step 3: Commit** + +```bash +git add modules/kubernetes/nfs_volume/ +git commit -m "[ci skip] add shared nfs_volume module for CSI-backed PV/PVC creation" +``` + +--- + +## Task 3: Pilot Migration — `privatebin` + +**Files:** +- Modify: `stacks/privatebin/main.tf` + +This is the first real migration. Validates the pattern end-to-end. + +**Step 1: Read current state** + +Current NFS volume in `stacks/privatebin/main.tf`: + +```hcl +# Lines 71-77 — volume block in pod spec +volume { + name = "data" + nfs { + path = "/mnt/main/privatebin" + server = var.nfs_server + } +} +``` + +Volume mount (lines 54-58, UNCHANGED): +```hcl +volume_mount { + name = "data" + mount_path = "/srv/data" + sub_path = "data" +} +``` + +**Step 2: Add module call** + +Add before the `kubernetes_deployment` resource (e.g., after the ingress_factory module, before the deployment): + +```hcl +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "privatebin-data" + namespace = kubernetes_namespace.privatebin.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/privatebin" +} +``` + +**Step 3: Replace inline NFS volume with PVC reference** + +Replace the volume block (lines 71-77): + +```hcl +# OLD: +volume { + name = "data" + nfs { + path = "/mnt/main/privatebin" + server = var.nfs_server + } +} + +# NEW: +volume { + name = "data" + persistent_volume_claim { + claim_name = module.nfs_data.claim_name + } +} +``` + +Do NOT touch the `volume_mount` block — it stays identical. + +**Step 4: Plan and verify** + +```bash +cd stacks/privatebin && terragrunt plan --non-interactive +``` + +Expected: 2 resources added (PV + PVC), deployment updated in-place (volume source changed). No resources destroyed (inline volumes aren't tracked as separate TF resources). + +**Step 5: Apply** + +```bash +cd stacks/privatebin && terragrunt apply --non-interactive +``` + +**Step 6: Verify the pod is running with CSI mount** + +```bash +kubectl --kubeconfig $(pwd)/config get pods -n privatebin +kubectl --kubeconfig $(pwd)/config describe pod -n privatebin -l app=privatebin | grep -A5 "Volumes:" +``` + +Expected: Pod running. Volume shows `Type: PersistentVolumeClaim` with `ClaimName: privatebin-data`, NOT `Type: NFS`. + +**Step 7: Verify the app works** + +```bash +curl -sI https://privatebin.viktorbarzin.me | head -5 +``` + +Expected: HTTP 200 (or 302 redirect to the paste page). + +**Step 8: Verify mount options** + +```bash +# SSH to the node running the pod and check mount options +NODE=$(kubectl --kubeconfig $(pwd)/config get pod -n privatebin -l app=privatebin -o jsonpath='{.items[0].spec.nodeName}') +ssh wizard@$(kubectl --kubeconfig $(pwd)/config get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}') "mount | grep privatebin" +``` + +Expected: Mount shows `soft,timeo=30,retrans=3,actimeo=5` (NOT the old `hard` default). + +**Step 9: Commit** + +```bash +cd /Users/viktorbarzin/code/infra +git add stacks/privatebin/main.tf +git commit -m "[ci skip] privatebin: migrate NFS volume to CSI-backed PV/PVC with soft mount" +``` + +--- + +## Task 4: Pilot Migration — `resume` + +**Files:** +- Modify: `stacks/resume/main.tf` + +Same pattern as privatebin. Single NFS volume. + +**Step 1: Add module call** + +Add before the `kubernetes_deployment.resume` resource: + +```hcl +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "resume-data" + namespace = kubernetes_namespace.resume.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/resume" +} +``` + +**Step 2: Replace inline NFS volume with PVC reference** + +In the `resume` deployment's pod spec, replace: + +```hcl +# OLD: +volume { + name = "data" + nfs { + server = var.nfs_server + path = "/mnt/main/resume" + } +} + +# NEW: +volume { + name = "data" + persistent_volume_claim { + claim_name = module.nfs_data.claim_name + } +} +``` + +**Step 3: Plan, apply, verify** + +```bash +cd stacks/resume && terragrunt plan --non-interactive +cd stacks/resume && terragrunt apply --non-interactive +kubectl --kubeconfig $(pwd)/config get pods -n resume +curl -sI https://resume.viktorbarzin.me | head -5 +``` + +**Step 4: Commit** + +```bash +cd /Users/viktorbarzin/code/infra +git add stacks/resume/main.tf +git commit -m "[ci skip] resume: migrate NFS volume to CSI-backed PV/PVC with soft mount" +``` + +--- + +## Task 5: Pilot Migration — `speedtest` + +**Files:** +- Modify: `stacks/speedtest/main.tf` + +**Step 1: Add module call** + +```hcl +module "nfs_config" { + source = "../../modules/kubernetes/nfs_volume" + name = "speedtest-config" + namespace = kubernetes_namespace.speedtest.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/speedtest" +} +``` + +**Step 2: Replace inline NFS volume** + +```hcl +# OLD: +volume { + name = "config" + nfs { + server = var.nfs_server + path = "/mnt/main/speedtest" + } +} + +# NEW: +volume { + name = "config" + persistent_volume_claim { + claim_name = module.nfs_config.claim_name + } +} +``` + +**Step 3: Plan, apply, verify** + +```bash +cd stacks/speedtest && terragrunt plan --non-interactive +cd stacks/speedtest && terragrunt apply --non-interactive +kubectl --kubeconfig $(pwd)/config get pods -n speedtest +curl -sI https://speedtest.viktorbarzin.me | head -5 +``` + +**Step 4: Commit** + +```bash +cd /Users/viktorbarzin/code/infra +git add stacks/speedtest/main.tf +git commit -m "[ci skip] speedtest: migrate NFS volume to CSI-backed PV/PVC with soft mount" +``` + +--- + +## Task 6: Batch Migration — Simple Single-Volume Stacks + +After pilots are verified, migrate the remaining single-volume stacks. These all follow the exact same mechanical pattern. + +**Files to modify** (one `main.tf` each — apply and verify each individually): + +| Stack | Volume Name | PV Name | NFS Path | +|-------|------------|---------|----------| +| `audiobookshelf` | `data` | `audiobookshelf-data` | `/mnt/main/audiobookshelf` | +| `calibre` | `data` | `calibre-data` | `/mnt/main/calibre-web-automated` | +| `changedetection` | `data` | `changedetection-data` | `/mnt/main/changedetection` | +| `diun` | `data` | `diun-data` | `/mnt/main/diun` | +| `excalidraw` | `data` | `excalidraw-data` | `/mnt/main/excalidraw` | +| `forgejo` | `data` | `forgejo-data` | `/mnt/main/forgejo` | +| `freshrss` | `data` | `freshrss-data` | `/mnt/main/freshrss` | +| `hackmd` | `data` | `hackmd-data` | `/mnt/main/hackmd` | +| `health` | `data` | `health-data` | `/mnt/main/health` | +| `isponsorblocktv` | `data` | `isponsorblocktv-data` | `/mnt/main/isponsorblocktv` | +| `meshcentral` | `data` | `meshcentral-data` | `/mnt/main/meshcentral` | +| `n8n` | `data` | `n8n-data` | `/mnt/main/n8n` | +| `navidrome` | `data` | `navidrome-data` | `/mnt/main/navidrome` | +| `netbox` | `data` | `netbox-data` | `/mnt/main/netbox` | +| `ntfy` | `data` | `ntfy-data` | `/mnt/main/ntfy` | +| `onlyoffice` | `data` | `onlyoffice-data` | `/mnt/main/onlyoffice` | +| `owntracks` | `data` | `owntracks-data` | `/mnt/main/owntracks` | +| `privatebin` | _(done in Task 3)_ | | | +| `resume` | _(done in Task 4)_ | | | +| `send` | `data` | `send-data` | `/mnt/main/send` | +| `speedtest` | _(done in Task 5)_ | | | +| `tandoor` | `data` | `tandoor-data` | `/mnt/main/tandoor` | +| `wealthfolio` | `data` | `wealthfolio-data` | `/mnt/main/wealthfolio` | +| `whisper` | `data` | `whisper-data` | `/mnt/main/whisper` | +| `atuin` | `data` | `atuin-data` | `/mnt/main/atuin` | +| `matrix` | `data` | `matrix-data` | `/mnt/main/matrix` | +| `ollama` | `data` | `ollama-data` | `/mnt/main/ollama` | +| `poison-fountain` | `data` | `poison-fountain-data` | `/mnt/main/poison-fountain` | +| `woodpecker` | `data` | `woodpecker-data` | `/mnt/main/woodpecker` | +| `ytdlp` | `data` | `ytdlp-data` | `/mnt/main/ytdlp` | +| `stirling-pdf` | `data` | `stirling-pdf-data` | `/mnt/main/stirling-pdf` | +| `paperless-ngx` | `data` | `paperless-ngx-data` | `/mnt/main/paperless-ngx` | +| `grampsweb` | `data` | `grampsweb-data` | `/mnt/main/grampsweb` | +| `trading-bot` | `data` | `trading-bot-data` | `/mnt/main/trading-bot` | + +**For each stack, the pattern is identical:** + +1. Read `stacks//main.tf` to find the exact NFS volume block and its volume name +2. Add `module "nfs_"` call with the correct PV name, namespace, and NFS path +3. Replace `nfs {}` block with `persistent_volume_claim { claim_name = module.nfs_.claim_name }` +4. `cd stacks/ && terragrunt apply --non-interactive` +5. Verify pod is running: `kubectl --kubeconfig $(pwd)/config get pods -n ` +6. Verify app is accessible: `curl -sI https://.viktorbarzin.me | head -5` + +**Important**: Read each `main.tf` first — volume names, NFS paths, and namespace references vary. The table above is a guide, not a source of truth. Some stacks may have different volume names or multiple NFS paths under a parent directory. + +**Commit after every 3-5 stacks:** + +```bash +git add stacks/audiobookshelf/main.tf stacks/calibre/main.tf stacks/changedetection/main.tf +git commit -m "[ci skip] migrate audiobookshelf, calibre, changedetection NFS volumes to CSI PV/PVC" +``` + +--- + +## Task 7: Multi-Volume Stack Migration + +These stacks have 2+ NFS volumes. Each needs multiple module calls. + +**Files to modify** (read each `main.tf` first to get exact volume names and paths): + +| Stack | Expected NFS Volumes | Notes | +|-------|---------------------|-------| +| `openclaw` | 4: tools, home, workspace, data | 3 containers share volumes | +| `immich` | Multiple: library, upload, thumbs, etc. | Check exact paths from nfs_directories.txt | +| `servarr` | Parent + 7 sub-stacks, each with NFS | Factory pattern, check each sub-module | +| `frigate` | Multiple: config, media, recordings | GPU service | +| `dawarich` | Multiple | Check main.tf | +| `ebook2audiobook` | Multiple | GPU service | +| `f1-stream` | Multiple | Check main.tf | +| `real-estate-crawler` | Multiple | Check main.tf | +| `nextcloud` | Multiple | Custom LimitRange, complex stack | +| `rybbit` | Multiple: clickhouse data, etc. | Check main.tf | +| `osm_routing` | Multiple | Check main.tf | +| `affine` | Multiple | Check main.tf | + +**Pattern is the same — just more module calls:** + +```hcl +# Example for openclaw (4 volumes) +module "nfs_tools" { + source = "../../modules/kubernetes/nfs_volume" + name = "openclaw-tools" + namespace = kubernetes_namespace.openclaw.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/openclaw/tools" +} + +module "nfs_home" { + source = "../../modules/kubernetes/nfs_volume" + name = "openclaw-home" + namespace = kubernetes_namespace.openclaw.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/openclaw/home" +} + +module "nfs_workspace" { + source = "../../modules/kubernetes/nfs_volume" + name = "openclaw-workspace" + namespace = kubernetes_namespace.openclaw.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/openclaw/workspace" +} + +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "openclaw-data" + namespace = kubernetes_namespace.openclaw.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/openclaw/data" +} + +# Then in pod spec: +volume { + name = "tools" + persistent_volume_claim { claim_name = module.nfs_tools.claim_name } +} +volume { + name = "openclaw-home" + persistent_volume_claim { claim_name = module.nfs_home.claim_name } +} +# ... etc +``` + +**Step for each**: Read main.tf → identify all `nfs {}` blocks → add module calls → replace volume blocks → plan → apply → verify. + +**Commit after each multi-volume stack** (these are more complex, commit individually): + +```bash +git add stacks/openclaw/main.tf +git commit -m "[ci skip] openclaw: migrate 4 NFS volumes to CSI PV/PVC with soft mount" +``` + +--- + +## Task 8: Platform Module Migration + +These modules are under `stacks/platform/modules/` and reference shared modules at `../../../../modules/kubernetes/nfs_volume`. + +**Files to modify:** + +| Module | Current Storage Pattern | Notes | +|--------|----------------------|-------| +| `monitoring/prometheus.tf` | Existing PV/PVC with native NFS source | Change PV source from `nfs {}` to `csi {}` | +| `monitoring/loki.tf` | Existing PV/PVC with native NFS source | Same | +| `monitoring/grafana.tf` | Existing PV (alertmanager) with native NFS | Same | +| `redis/main.tf` | Inline NFS or PV | Check current pattern | +| `dbaas/` | PV for PostgreSQL, MySQL backup | Check current pattern | +| `technitium/` | Inline NFS | Standard migration | +| `headscale/` | Inline NFS | Standard migration | +| `vaultwarden/` | Inline NFS | Standard migration | +| `uptime-kuma/` | Inline NFS | Standard migration | +| `mailserver/` | Inline NFS | Standard migration | +| `infra-maintenance/` | Inline NFS | Standard migration | + +**For existing PV/PVC resources** (monitoring stack), the change is different — replace the `persistent_volume_source` block: + +```hcl +# OLD (in prometheus.tf): +persistent_volume_source { + nfs { + path = "/mnt/main/prometheus" + server = var.nfs_server + } +} + +# NEW: +persistent_volume_source { + csi { + driver = "nfs.csi.k8s.io" + volume_handle = "prometheus-data" + volume_attributes = { + server = var.nfs_server + share = "/mnt/main/prometheus" + } + } +} +``` + +Also add `storage_class_name = "nfs-truenas"` to the PV spec to inherit mount options. + +**For inline NFS volumes** in platform modules, use the shared module with the longer path: + +```hcl +module "nfs_data" { + source = "../../../../modules/kubernetes/nfs_volume" + name = "technitium-data" + namespace = kubernetes_namespace.technitium.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/technitium" +} +``` + +**Apply as one platform apply:** + +```bash +cd stacks/platform && terragrunt apply --non-interactive +``` + +**Verify all platform services:** + +```bash +kubectl --kubeconfig $(pwd)/config get pods -n monitoring +kubectl --kubeconfig $(pwd)/config get pods -n redis +kubectl --kubeconfig $(pwd)/config get pods -n dbaas +kubectl --kubeconfig $(pwd)/config get pods -n technitium +# ... etc +``` + +**Commit:** + +```bash +git add stacks/platform/ +git commit -m "[ci skip] platform: migrate all NFS volumes to CSI PV/PVC with soft mount" +``` + +--- + +## Task 9: Update Documentation and Skills + +**Files:** +- Modify: `.claude/CLAUDE.md` (update NFS Volume Pattern section) +- Modify: `.claude/skills/setup-project/SKILL.md` (update new service template to use module) + +**Step 1: Update CLAUDE.md NFS Volume Pattern** + +Replace the existing NFS Volume Pattern section with: + +```markdown +### NFS Volume Pattern +**Use the `nfs_volume` shared module** for all NFS volumes. This creates CSI-backed PV/PVC with soft mount options (no stale mount hangs): +\```hcl +module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "-data" # Must be globally unique + namespace = kubernetes_namespace..metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/" +} + +# In pod spec: +volume { + name = "data" + persistent_volume_claim { + claim_name = module.nfs_data.claim_name + } +} +\``` +For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`. + +**Legacy pattern (DO NOT use for new services):** Inline `nfs {}` blocks mount with `hard,timeo=600` defaults which hang forever on stale mounts. +``` + +**Step 2: Update setup-project skill** + +Update the new service template in `.claude/skills/setup-project/SKILL.md` to use the module pattern instead of inline NFS. + +**Step 3: Commit** + +```bash +git add .claude/ +git commit -m "[ci skip] update NFS volume documentation to use CSI-backed nfs_volume module" +``` + +--- + +## Task 10: Validation — Simulate NFS Outage + +**This is a manual verification step. Do NOT automate.** + +After all services are migrated, simulate an NFS blip to confirm the stale mount fix works: + +1. Pick a low-risk service (e.g., `privatebin`) +2. On TrueNAS, temporarily block NFS to the K8s network (iptables rule or pause NFS for 30 seconds) +3. Observe: pod should get I/O errors within ~9 seconds (not hang) +4. If the pod has a liveness probe that touches the filesystem, it should restart automatically +5. After NFS recovers, verify the pod re-mounts cleanly + +**Do NOT run this on production without a maintenance window.** This is a "when you're ready" validation, not part of the automated migration.