[ci skip] revise storage reliability design based on research agent findings

Key changes from v1: - Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL - Remove Headscale from PG migration (project discourages it) - Remove MeshCentral from PG migration (NeDB, not SQLite) - Replace Redis Sentinel with single redis:7 on local disk (modules unused) - Add RAM overcommit warning and mitigation - Add explicit single-host limitation acknowledgment - Add per-component rollback plans - Fix backup strategy (CNPG can't archive WAL to NFS natively) - Reorder migration: low-risk services first, authentik last - Add research gate before each service migration
2026-02-28 14:38:01 +00:00 · 2026-02-28 14:38:01 +00:00 · 6cc1da4bd6
commit 6cc1da4bd6
parent 6381bcee40
1 changed files with 238 additions and 107 deletions
--- a/docs/plans/2026-02-28-storage-reliability-design.md
+++ b/docs/plans/2026-02-28-storage-reliability-design.md
@ -1,7 +1,7 @@
 # Storage Reliability: Database Replication + SQLite Consolidation
 **Date**: 2026-02-28
-**Status**: Approved
+**Status**: Revised (v2) — incorporates research agent findings
 **Goal**: Eliminate data corruption risk from NFS outages by moving databases off NFS
 ## Problem
@ -21,6 +21,21 @@ SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode u
 - Stop-and-verify after each service migration
 - No data loss tolerance
 ## Single-Host Limitation (Explicit Acknowledgment)
 All K8s nodes are VMs on a single Proxmox host (192.168.1.127). This means:
 **Replication PROTECTS against**: individual VM crash/restart, NFS outage,
 individual node rebuild, pod OOM/eviction, software-level failures.
 **Replication does NOT protect against**: Proxmox host failure, physical
 disk failure, power loss — all replicas die simultaneously.
 Given this, the plan uses **minimal replication** (1 primary + 1 replica
 for PostgreSQL, single instance for Redis) rather than full 3-instance
 clusters. The primary reliability gain comes from moving off NFS to local
 disk with proper fsync semantics, not from replication count.
 ## Design
 ### Strategy Overview
@ -28,106 +43,162 @@ SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode u
 ```
 BEFORE:  All services → NFS (TrueNAS VM) → single point of failure
-AFTER:   Databases → local disk + native replication (HA, proper fsync)
+AFTER:   Databases → local disk (proper fsync, no NFS SPOF)
-         SQLite apps → migrated to shared PostgreSQL where possible
+         SQLite apps → migrated to shared PostgreSQL where supported
         Media/configs → NFS (TrueNAS, non-critical path)
         Backups → all consolidate to NFS → rsync to backup NAS
 ```
-### Component 1: PostgreSQL HA via CloudNativePG
+### Component 1: PostgreSQL via CloudNativePG
 **Current**: Single PostgreSQL 16 pod on NFS (`/mnt/main/postgresql/data`)
-**Target**: CloudNativePG operator with 3-instance cluster on local disk
+using custom image `viktorbarzin/postgres:16-master` (postgis + pgvector + pgvecto-rs).
-CloudNativePG is a CNCF K8s operator that manages PostgreSQL clusters with:
+**Target**: CloudNativePG operator with 2-instance cluster on local disk.
 CloudNativePG (CNCF project, v1.28+, supports K8s 1.34 and PG 14-18):
 - Automatic primary/replica failover
- Streaming replication across nodes
+- Streaming replication
- Continuous WAL archiving (to NFS for backup)
+- Declarative CRD-based management (Terraform/Terragrunt compatible)
- Local PVCs for data (proper fsync)
+- Built-in monolith import mode (better than manual pg_dumpall)
 - Built-in PgBouncer pooler CRD
 Architecture:
 ```
-CloudNativePG Cluster
+CloudNativePG Cluster (namespace: dbaas)
-├── Primary (node A) — local PVC
+├── Primary (worker node A) — local PVC via local-path-provisioner
-├── Replica (node B) — local PVC, streaming replication
+├── Replica (worker node B) — local PVC, streaming replication
-├── Replica (node C) — local PVC, streaming replication
+└── Services: <cluster>-rw (read-write), <cluster>-ro (read-only)
 └── WAL archive → NFS:/mnt/main/postgresql-wal-archive/ (backup only)
 ```
-Dependent services (unchanged connection, new reliable backend):
+**Migration approach**: Use CNPG's native monolith import mode, which
-authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
+connects to the running old PostgreSQL and imports databases + roles
-rybbit, affine, health, resume, trading-bot
+using pg_dump -Fd per database. Superior to manual pg_dumpall.
-Resource overhead: ~3GB RAM total, ~50GB local disk per node
+**Service endpoint strategy**: Create an ExternalName Service called
 `postgresql` in namespace `dbaas` pointing to the CNPG `-rw` service.
 This preserves `var.postgresql_host` = `postgresql.dbaas.svc.cluster.local`
 with zero changes to dependent services.
-### Component 2: MySQL HA (or Migration to PostgreSQL)
+**Special cases**:
 - Authentik: Replace manual PgBouncer deployment with CNPG's built-in
  Pooler CRD, or update PgBouncer to point to CNPG's `-rw` service
 - Init containers (woodpecker, trading-bot): Enable `enableSuperuserAccess: true`
  in CNPG Cluster spec — CNPG strips SUPERUSER from imported roles by default
 - Custom image: Test `viktorbarzin/postgres:16-master` with CNPG first.
  Move `shared_preload_libraries=vectors.so` to CNPG `postgresql.parameters`
  (CNPG overrides container CMD). Tag format may need adjusting.
 **Backup**: Keep existing pg_dumpall CronJob, pointed at new CNPG endpoint.
 CNPG's native WAL archiving requires S3-compatible backend (not NFS) —
 adding MinIO is a future enhancement, not a blocker.
 Dependent services (12): authentik, n8n, dawarich, tandoor, linkwarden,
 netbox, woodpecker, rybbit, affine, health, resume, trading-bot
 Resource overhead: ~2GB RAM total (2 instances), ~50GB local disk per node
 ### Component 2: Redis — Single Instance on Local Disk
 **Current**: Single redis-stack pod on NFS (`/mnt/main/redis`).
 RDB background save takes 39 seconds on NFS (should be <1s on local disk).
 **Finding**: redis-stack modules (RedisJSON, RediSearch, RedisTimeSeries,
 RedisBloom, RedisGears) are completely unused. Zero module commands in
 `INFO commandstats`. All 11 services use plain Redis commands only
 (GET, SET, BullMQ queues, Celery broker, caching).
 **Finding**: No service stores critical primary data in Redis. All use it
 for job queues and caching. Losing Redis data means: users re-login,
 jobs retry, caches rebuild. Inconvenient but never catastrophic.
 **Finding**: None of the 11 services support Sentinel-aware connections.
 Redis Sentinel would require a proxy layer with no reliability gain on
 a single physical host.
 **Target**: Single `redis:7-alpine` (or `valkey:9`) on local PVC.
 Drop redis-stack — modules are unused overhead (~100MB RAM saved).
 Architecture:
 ```
 Redis 7 (single instance)
 ├── Local PVC via local-path-provisioner (fast RDB saves)
 ├── K8s Service: redis.redis.svc.cluster.local (unchanged)
 └── Hourly CronJob: cp dump.rdb → NFS:/mnt/main/redis-backup/
 ```
 No client changes needed. Same service endpoint. Same Redis commands.
 Resource overhead: ~650MB RAM (same as today minus module overhead),
 ~1GB local disk
 ### Component 3: MySQL — Single Instance on Local Disk
 **Current**: Single MySQL pod on NFS (`/mnt/main/mysql`)
-**Target**: Either MySQL Operator (InnoDB Cluster) or migrate MySQL services to PostgreSQL
+**Target**: Single MySQL on local PVC
-Services on MySQL: hackmd, speedtest, onlyoffice, crowdsec,
+Services on MySQL (8): hackmd, speedtest, onlyoffice, crowdsec,
 paperless-ngx, real-estate-crawler, url-shortener, grafana
-Many of these support PostgreSQL. Consolidating to one DB engine
+Evaluate per-service whether migration to PostgreSQL is feasible
-reduces operational complexity. Evaluate per-service during implementation.
+(reduces operational complexity to one DB engine). Do during
 implementation research phase.
-### Component 3: Redis HA via Sentinel
+**Backup**: Keep existing mysqldump CronJob.
 **Current**: Single redis-stack pod on NFS (`/mnt/main/redis`)
 **Target**: Redis Sentinel (3 instances) with data on local disk
 Architecture:
 ```
 Redis Sentinel (3 instances)
 ├── Primary (node A) — local PVC, RDB + AOF persistence
 ├── Replica (node B) — local PVC
 ├── Replica (node C) — local PVC
 └── Sentinel monitors (3) — automatic failover
 ```
 Resource overhead: ~1.5GB RAM total, ~2GB local disk per node
 ### Component 4: Immich PostgreSQL
 **Current**: Dedicated PostgreSQL + pgvector on NFS
-**Target**: Migrate to CloudNativePG cluster (separate database in same cluster, or dedicated cluster with pgvector extension)
+(`ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`)
-### Component 5: ClickHouse HA (Rybbit)
+**Target**: Move to local PVC (same image, same single instance).
 Immich's PG has specialized extensions (VectorChord, pgvectors) that
 may not be compatible with CNPG operand images. Simpler to keep as
 standalone PG on local disk.
 ### Component 5: ClickHouse (Rybbit)
 **Current**: Single ClickHouse on NFS (`/mnt/main/clickhouse`)
-**Target**: ClickHouse with native replication via ClickHouse Keeper, or accept risk (analytics data, rebuildable)
+**Target**: Move to local PVC (single instance). Analytics data is
 rebuildable. ClickHouse replication is not justified for a homelab.
 ### Component 6: SQLite App Consolidation to PostgreSQL
-Apps that support PostgreSQL — migrate to shared CloudNativePG cluster:
+**REVISED based on per-app research:**
-| App | Config mechanism | Priority |
+Apps confirmed safe to migrate:
 |-----|-----------------|----------|
 | Vaultwarden | `DATABASE_URL` env var | P0 (password vault) |
 | Headscale | `db_type: postgres` in config | P0 (VPN) |
 | Forgejo | `[database]` section in app.ini | P1 |
 | Open WebUI | `DATABASE_URL` env var | P2 |
 | Meshcentral | config.json `db` section | P2 |
 | FreshRSS | `db` config | P2 |
-Apps stuck on SQLite (accept risk or use Litestream for backup):
+| App | Config mechanism | Migration tool | Risk | Notes |
 |-----|-----------------|---------------|------|-------|
 | Forgejo | `[database]` in app.ini | `forgejo dump --database postgres` | Moderate | Git repos stay on NFS |
 | FreshRSS | `DB_HOST` env vars | OPML export/import (fresh install) | Low | PG is the recommended backend |
 | Open WebUI | `DATABASE_URL` env var | None (start fresh) | Low | Chat history is disposable |
 | Vaultwarden | `DATABASE_URL` env var | pgloader (unsupported by maintainers) | **HIGH** | Test extensively; attachments stay on NFS |
 **Apps REMOVED from migration plan:**
 | App | Reason |
 |-----|--------|
 | **Headscale** | Project EXPLICITLY DISCOURAGES PostgreSQL: "highly discouraged, only supported for legacy reasons. All new development and testing are SQLite." Migrating risks VPN stability. |
 | **MeshCentral** | Uses NeDB (document store), not SQLite. NeDB→PG migration path is poorly documented and risky. |
 Apps confirmed SQLite/BoltDB-only (stay on NFS):
 | App | Storage engine | Mitigation |
 |-----|---------------|------------|
-| Uptime Kuma | SQLite only | Litestream or accept |
+| Headscale | SQLite (recommended by project) | Accept (project-recommended config) |
-| Navidrome | SQLite only | Litestream or accept |
+| Uptime Kuma | SQLite (v2 adds MariaDB, not PG) | Accept or Litestream |
-| Audiobookshelf | SQLite only | Litestream or accept |
+| Navidrome | SQLite only | Accept or Litestream |
 | Audiobookshelf | SQLite only | Accept or Litestream |
 | Calibre-Web | SQLite (Calibre format) | Accept (format constraint) |
-| Wealthfolio | SQLite only | Litestream or accept |
+| Wealthfolio | SQLite only | Accept or Litestream |
-| Diun | BoltDB only | Accept (rebuildable state) |
+| MeshCentral | NeDB (document store) | Accept |
 | Diun | bbolt (BoltDB fork) | Accept (rebuildable state) |
 ### Component 7: Monitoring Stack
 Prometheus, Loki, Alertmanager use specialized storage (TSDB, BoltDB).
-Cannot migrate to PostgreSQL. Options:
+Cannot migrate to PostgreSQL. Prometheus WAL is already on tmpfs (good).
 - Move to local disk (emptyDir or local PVC)
 - Accept NFS risk (metrics/logs are ephemeral, loss is annoying not catastrophic)
 - Prometheus WAL is already on tmpfs (good)
 Recommendation: Move to local PVCs. Losing metrics history on node
 failure is acceptable for a homelab.
@ -143,82 +214,142 @@ NFS failure for these = temporary unavailability, not corruption.
 ## Backup Strategy
 ```
-CloudNativePG     → continuous WAL archiving  → NFS:/mnt/main/postgresql-wal-archive/
+CNPG PostgreSQL  → pg_dumpall CronJob (daily) → NFS:/mnt/main/postgresql-backup/
-MySQL (if kept)   → automated mysqldump       → NFS:/mnt/main/mysql-backup/
+MySQL            → mysqldump CronJob (daily)  → NFS:/mnt/main/mysql-backup/
-Redis Sentinel    → periodic RDB snapshots    → NFS:/mnt/main/redis-backup/
+Redis            → RDB copy CronJob (hourly)  → NFS:/mnt/main/redis-backup/
-Litestream        → continuous SQLite backup   → NFS:/mnt/main/litestream/
+Immich PG        → pg_dump CronJob (daily)    → NFS:/mnt/main/immich-pg-backup/
-Media/configs     → already on NFS
+Litestream       → continuous SQLite backup   → NFS:/mnt/main/litestream/ (optional)
 Media/configs    → already on NFS
 NFS (TrueNAS) → rsync → Backup NAS  (unchanged)
 ```
-All backups still consolidate to TrueNAS. The rsync-to-backup-NAS
+All backups still consolidate to TrueNAS. Rsync-to-backup-NAS workflow
-workflow is completely unchanged.
+is completely unchanged.
 **Note**: CNPG's native WAL archiving requires S3-compatible storage
 (not NFS). Adding MinIO for PITR capability is a future enhancement.
 The pg_dumpall CronJob provides adequate backup for a homelab.
 ## Migration Order (Safety-First)
-Each phase: backup → migrate → verify → user confirms → next phase.
+Each phase: research → backup → migrate → verify → user confirms → next.
 Before each service migration, a research subagent will:
 1. Confirm current setup and configuration
 2. Research online best practices and documentation
 3. Scrutinize the migration plan for that specific service
 4. Present findings for review before execution
 ### Phase 0: Infrastructure Prerequisites
- Install CloudNativePG operator
+- Verify RAM headroom (current overcommit must be addressed first)
- Add local virtual disks to K8s nodes (via Proxmox)
+- Add dedicated local virtual disks to K8s worker nodes (via Proxmox)
- Set up local-path StorageClass
+- Verify local-path-provisioner is configured for new disks
- Install Redis Sentinel
+- Install CloudNativePG operator (Helm)
 - Test CNPG with custom PostgreSQL image (throwaway cluster)
-### Phase 1: PostgreSQL Migration (highest impact)
+### Phase 1: PostgreSQL Migration (highest impact, most preparation)
-1. Full pg_dumpall backup to NFS
+1. Deploy throwaway CNPG cluster to test image compatibility and import
-2. Deploy CloudNativePG cluster (empty)
+2. Full pg_dumpall backup to NFS
-3. Restore backup into CloudNativePG
+3. Deploy production CNPG cluster with monolith import from running PG
-4. Verify all 12 dependent services work
+4. Create ExternalName Service for backwards compatibility
-5. Decommission old PostgreSQL pod + NFS volume
+5. Migrate ONE low-risk service first (e.g., `resume` or `health`)
 6. Verify for 24-48 hours
 7. Migrate remaining services one at a time, verify each
 8. Migrate authentik LAST (identity provider — highest blast radius)
 9. Keep old PG pod scaled to 0 for one week as rollback safety net
 10. Decommission old PG only after stability confirmed
 ### Phase 2: Redis Migration
-1. RDB snapshot backup
+1. RDB snapshot backup to NFS
-2. Deploy Redis Sentinel cluster
+2. Deploy single redis:7-alpine on local PVC (same namespace, new pod)
-3. Restore data
+3. Restore RDB snapshot
-4. Update service connection strings
+4. Update redis Service to point to new pod
 5. Verify all 11 dependent services
 6. Add hourly RDB backup CronJob to NFS
 7. Decommission old redis-stack pod
-### Phase 3: Critical SQLite Apps → PostgreSQL
+### Phase 3: MySQL Migration
-Migrate one at a time, verify after each:
+1. mysqldump backup
-3a. Vaultwarden (password vault — most critical)
+2. Deploy single MySQL on local PVC
-3b. Headscale (VPN coordination)
+3. Restore dump
 4. Verify all 8 dependent services
 5. Research per-service PostgreSQL migration feasibility (future work)
-### Phase 4: MySQL Migration
+### Phase 4: Immich PostgreSQL
-Either deploy MySQL Operator or migrate services to PostgreSQL.
+1. pg_dump backup
-One service at a time.
+2. Move Immich PG to local PVC (same image, same config)
 3. Verify Immich functionality (upload, search, face recognition)
-### Phase 5: Immich PostgreSQL
+### Phase 5: SQLite Apps → PostgreSQL
-Migrate Immich's dedicated PostgreSQL to CloudNativePG.
+Migrate one at a time, safest first:
 5a. FreshRSS (lowest risk — fresh install with OPML import)
 5b. Open WebUI (low risk — start fresh, chat history disposable)
 5c. Forgejo (moderate risk — use forgejo dump, verify git operations)
 5d. Vaultwarden (HIGH risk — pgloader, test extensively on copy first)
-### Phase 6: Remaining SQLite Apps → PostgreSQL
+### Phase 6: ClickHouse + Monitoring
-One at a time: Forgejo, Open WebUI, Meshcentral, FreshRSS
+6a. ClickHouse → local PVC
 6b. Prometheus → local PVC
 6c. Loki → local PVC
 6d. Alertmanager → local PVC
-### Phase 7: Monitoring Stack
+### Phase 7: Cleanup + Optional Enhancements
-Move Prometheus, Loki, Alertmanager to local PVCs.
+- Remove old NFS directories from nfs_directories.txt
 - Update nfs_exports.sh
 - Optional: Add Litestream for SQLite-only apps
 - Optional: Add MinIO for CNPG WAL archiving (PITR capability)
 - Optional: Evaluate MySQL→PostgreSQL consolidation
-### Phase 8: ClickHouse + Remaining
+## Rollback Plan (per component)
-ClickHouse replication or accept risk.
+
-Litestream for SQLite-only apps (optional).
+**PostgreSQL**: Old pod kept scaled to 0 with NFS data intact. Rollback =
 scale old pod back up, revert ExternalName Service. Pre-migration
 pg_dumpall available if NFS data is stale.
 **Redis**: Old redis-stack pod kept scaled to 0. Rollback = scale up,
 revert Service. Pre-migration RDB snapshot on NFS.
 **MySQL**: Same pattern — old pod scaled to 0, mysqldump on NFS.
 **SQLite apps**: Original SQLite databases remain on NFS untouched.
 Rollback = remove DATABASE_URL env var, restart pod.
 ## Resource Budget
 | Component | RAM | Local Disk |
 |-----------|-----|-----------|
-| CloudNativePG (3 instances) | ~3GB | ~50GB/node |
+| CloudNativePG (2 instances) | ~2GB | ~50GB/node (2 nodes) |
-| Redis Sentinel (3 instances) | ~1.5GB | ~2GB/node |
+| Redis 7 (single instance) | ~550MB | ~1GB |
-| MySQL Operator (if kept) | ~2GB | ~20GB/node |
+| MySQL (single instance) | ~1GB | ~20GB |
-| Litestream (6 apps) | ~300MB | None |
+| Immich PG (single instance) | ~500MB | ~10GB |
-| **Total new** | **~7GB** | **~72GB/node** |
+| CNPG Operator | ~200MB | None |
 | **Total new overhead** | **~4.25GB** | **~81GB across 2 nodes** |
-Current cluster has 88GB RAM total. TrueNAS VM (16GB) could be
+**RAM WARNING**: Proxmox host has 142GB physical RAM with ~156GB
-downsized since it no longer serves database workloads, partially
+allocated to running VMs (already ~10% overcommitted). This plan adds
-offsetting the new overhead.
+~4.25GB but also frees ~1.5GB by dropping redis-stack modules and
 removing old DB pods. Net increase: ~2.75GB. The old DB pods
 (postgresql, mysql, redis-stack on NFS) will be decommissioned,
 partially offsetting the new resource usage. Monitor swap usage closely.
 Consider stopping unused VMs (PBS is already stopped, Windows10 uses
 8GB and may not need to run continuously).
 ## Monitoring Additions
 After migration, add alerts for:
 - CNPG replication lag
 - CNPG instance count (< 2 = degraded)
 - Local disk space on `/opt/local-path-provisioner` per node
 - Redis RDB save failures
 - Backup CronJob failures (pg_dumpall, mysqldump, RDB copy)
 ## Success Criteria
- [ ] No database runs on NFS
+- [ ] PostgreSQL, MySQL, Redis, Immich PG, ClickHouse all on local disk
 - [ ] TrueNAS VM restart causes zero data corruption
 - [ ] TrueNAS VM restart only affects media/config services (temporary unavailability)
 - [ ] All backups still consolidate to TrueNAS for rsync to backup NAS
 - [ ] Each migrated service verified working before proceeding to next
 - [ ] Rollback tested for PostgreSQL before decommissioning old pod