diff --git a/docs/plans/2026-02-28-storage-reliability-design.md b/docs/plans/2026-02-28-storage-reliability-design.md index 1aa6dc79..bcbefef3 100644 --- a/docs/plans/2026-02-28-storage-reliability-design.md +++ b/docs/plans/2026-02-28-storage-reliability-design.md @@ -1,7 +1,7 @@ # Storage Reliability: Database Replication + SQLite Consolidation **Date**: 2026-02-28 -**Status**: Approved +**Status**: Revised (v2) — incorporates research agent findings **Goal**: Eliminate data corruption risk from NFS outages by moving databases off NFS ## Problem @@ -21,6 +21,21 @@ SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode u - Stop-and-verify after each service migration - No data loss tolerance +## Single-Host Limitation (Explicit Acknowledgment) + +All K8s nodes are VMs on a single Proxmox host (192.168.1.127). This means: + +**Replication PROTECTS against**: individual VM crash/restart, NFS outage, +individual node rebuild, pod OOM/eviction, software-level failures. + +**Replication does NOT protect against**: Proxmox host failure, physical +disk failure, power loss — all replicas die simultaneously. + +Given this, the plan uses **minimal replication** (1 primary + 1 replica +for PostgreSQL, single instance for Redis) rather than full 3-instance +clusters. The primary reliability gain comes from moving off NFS to local +disk with proper fsync semantics, not from replication count. + ## Design ### Strategy Overview @@ -28,106 +43,162 @@ SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode u ``` BEFORE: All services → NFS (TrueNAS VM) → single point of failure -AFTER: Databases → local disk + native replication (HA, proper fsync) - SQLite apps → migrated to shared PostgreSQL where possible +AFTER: Databases → local disk (proper fsync, no NFS SPOF) + SQLite apps → migrated to shared PostgreSQL where supported Media/configs → NFS (TrueNAS, non-critical path) Backups → all consolidate to NFS → rsync to backup NAS ``` -### Component 1: PostgreSQL HA via CloudNativePG +### Component 1: PostgreSQL via CloudNativePG **Current**: Single PostgreSQL 16 pod on NFS (`/mnt/main/postgresql/data`) -**Target**: CloudNativePG operator with 3-instance cluster on local disk +using custom image `viktorbarzin/postgres:16-master` (postgis + pgvector + pgvecto-rs). -CloudNativePG is a CNCF K8s operator that manages PostgreSQL clusters with: +**Target**: CloudNativePG operator with 2-instance cluster on local disk. + +CloudNativePG (CNCF project, v1.28+, supports K8s 1.34 and PG 14-18): - Automatic primary/replica failover -- Streaming replication across nodes -- Continuous WAL archiving (to NFS for backup) -- Local PVCs for data (proper fsync) +- Streaming replication +- Declarative CRD-based management (Terraform/Terragrunt compatible) +- Built-in monolith import mode (better than manual pg_dumpall) +- Built-in PgBouncer pooler CRD Architecture: ``` -CloudNativePG Cluster -├── Primary (node A) — local PVC -├── Replica (node B) — local PVC, streaming replication -├── Replica (node C) — local PVC, streaming replication -└── WAL archive → NFS:/mnt/main/postgresql-wal-archive/ (backup only) +CloudNativePG Cluster (namespace: dbaas) +├── Primary (worker node A) — local PVC via local-path-provisioner +├── Replica (worker node B) — local PVC, streaming replication +└── Services: -rw (read-write), -ro (read-only) ``` -Dependent services (unchanged connection, new reliable backend): -authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, -rybbit, affine, health, resume, trading-bot +**Migration approach**: Use CNPG's native monolith import mode, which +connects to the running old PostgreSQL and imports databases + roles +using pg_dump -Fd per database. Superior to manual pg_dumpall. -Resource overhead: ~3GB RAM total, ~50GB local disk per node +**Service endpoint strategy**: Create an ExternalName Service called +`postgresql` in namespace `dbaas` pointing to the CNPG `-rw` service. +This preserves `var.postgresql_host` = `postgresql.dbaas.svc.cluster.local` +with zero changes to dependent services. -### Component 2: MySQL HA (or Migration to PostgreSQL) +**Special cases**: +- Authentik: Replace manual PgBouncer deployment with CNPG's built-in + Pooler CRD, or update PgBouncer to point to CNPG's `-rw` service +- Init containers (woodpecker, trading-bot): Enable `enableSuperuserAccess: true` + in CNPG Cluster spec — CNPG strips SUPERUSER from imported roles by default +- Custom image: Test `viktorbarzin/postgres:16-master` with CNPG first. + Move `shared_preload_libraries=vectors.so` to CNPG `postgresql.parameters` + (CNPG overrides container CMD). Tag format may need adjusting. + +**Backup**: Keep existing pg_dumpall CronJob, pointed at new CNPG endpoint. +CNPG's native WAL archiving requires S3-compatible backend (not NFS) — +adding MinIO is a future enhancement, not a blocker. + +Dependent services (12): authentik, n8n, dawarich, tandoor, linkwarden, +netbox, woodpecker, rybbit, affine, health, resume, trading-bot + +Resource overhead: ~2GB RAM total (2 instances), ~50GB local disk per node + +### Component 2: Redis — Single Instance on Local Disk + +**Current**: Single redis-stack pod on NFS (`/mnt/main/redis`). +RDB background save takes 39 seconds on NFS (should be <1s on local disk). + +**Finding**: redis-stack modules (RedisJSON, RediSearch, RedisTimeSeries, +RedisBloom, RedisGears) are completely unused. Zero module commands in +`INFO commandstats`. All 11 services use plain Redis commands only +(GET, SET, BullMQ queues, Celery broker, caching). + +**Finding**: No service stores critical primary data in Redis. All use it +for job queues and caching. Losing Redis data means: users re-login, +jobs retry, caches rebuild. Inconvenient but never catastrophic. + +**Finding**: None of the 11 services support Sentinel-aware connections. +Redis Sentinel would require a proxy layer with no reliability gain on +a single physical host. + +**Target**: Single `redis:7-alpine` (or `valkey:9`) on local PVC. +Drop redis-stack — modules are unused overhead (~100MB RAM saved). + +Architecture: +``` +Redis 7 (single instance) +├── Local PVC via local-path-provisioner (fast RDB saves) +├── K8s Service: redis.redis.svc.cluster.local (unchanged) +└── Hourly CronJob: cp dump.rdb → NFS:/mnt/main/redis-backup/ +``` + +No client changes needed. Same service endpoint. Same Redis commands. + +Resource overhead: ~650MB RAM (same as today minus module overhead), +~1GB local disk + +### Component 3: MySQL — Single Instance on Local Disk **Current**: Single MySQL pod on NFS (`/mnt/main/mysql`) -**Target**: Either MySQL Operator (InnoDB Cluster) or migrate MySQL services to PostgreSQL +**Target**: Single MySQL on local PVC -Services on MySQL: hackmd, speedtest, onlyoffice, crowdsec, +Services on MySQL (8): hackmd, speedtest, onlyoffice, crowdsec, paperless-ngx, real-estate-crawler, url-shortener, grafana -Many of these support PostgreSQL. Consolidating to one DB engine -reduces operational complexity. Evaluate per-service during implementation. +Evaluate per-service whether migration to PostgreSQL is feasible +(reduces operational complexity to one DB engine). Do during +implementation research phase. -### Component 3: Redis HA via Sentinel - -**Current**: Single redis-stack pod on NFS (`/mnt/main/redis`) -**Target**: Redis Sentinel (3 instances) with data on local disk - -Architecture: -``` -Redis Sentinel (3 instances) -├── Primary (node A) — local PVC, RDB + AOF persistence -├── Replica (node B) — local PVC -├── Replica (node C) — local PVC -└── Sentinel monitors (3) — automatic failover -``` - -Resource overhead: ~1.5GB RAM total, ~2GB local disk per node +**Backup**: Keep existing mysqldump CronJob. ### Component 4: Immich PostgreSQL **Current**: Dedicated PostgreSQL + pgvector on NFS -**Target**: Migrate to CloudNativePG cluster (separate database in same cluster, or dedicated cluster with pgvector extension) +(`ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`) -### Component 5: ClickHouse HA (Rybbit) +**Target**: Move to local PVC (same image, same single instance). +Immich's PG has specialized extensions (VectorChord, pgvectors) that +may not be compatible with CNPG operand images. Simpler to keep as +standalone PG on local disk. + +### Component 5: ClickHouse (Rybbit) **Current**: Single ClickHouse on NFS (`/mnt/main/clickhouse`) -**Target**: ClickHouse with native replication via ClickHouse Keeper, or accept risk (analytics data, rebuildable) +**Target**: Move to local PVC (single instance). Analytics data is +rebuildable. ClickHouse replication is not justified for a homelab. ### Component 6: SQLite App Consolidation to PostgreSQL -Apps that support PostgreSQL — migrate to shared CloudNativePG cluster: +**REVISED based on per-app research:** -| App | Config mechanism | Priority | -|-----|-----------------|----------| -| Vaultwarden | `DATABASE_URL` env var | P0 (password vault) | -| Headscale | `db_type: postgres` in config | P0 (VPN) | -| Forgejo | `[database]` section in app.ini | P1 | -| Open WebUI | `DATABASE_URL` env var | P2 | -| Meshcentral | config.json `db` section | P2 | -| FreshRSS | `db` config | P2 | +Apps confirmed safe to migrate: -Apps stuck on SQLite (accept risk or use Litestream for backup): +| App | Config mechanism | Migration tool | Risk | Notes | +|-----|-----------------|---------------|------|-------| +| Forgejo | `[database]` in app.ini | `forgejo dump --database postgres` | Moderate | Git repos stay on NFS | +| FreshRSS | `DB_HOST` env vars | OPML export/import (fresh install) | Low | PG is the recommended backend | +| Open WebUI | `DATABASE_URL` env var | None (start fresh) | Low | Chat history is disposable | +| Vaultwarden | `DATABASE_URL` env var | pgloader (unsupported by maintainers) | **HIGH** | Test extensively; attachments stay on NFS | + +**Apps REMOVED from migration plan:** + +| App | Reason | +|-----|--------| +| **Headscale** | Project EXPLICITLY DISCOURAGES PostgreSQL: "highly discouraged, only supported for legacy reasons. All new development and testing are SQLite." Migrating risks VPN stability. | +| **MeshCentral** | Uses NeDB (document store), not SQLite. NeDB→PG migration path is poorly documented and risky. | + +Apps confirmed SQLite/BoltDB-only (stay on NFS): | App | Storage engine | Mitigation | |-----|---------------|------------| -| Uptime Kuma | SQLite only | Litestream or accept | -| Navidrome | SQLite only | Litestream or accept | -| Audiobookshelf | SQLite only | Litestream or accept | +| Headscale | SQLite (recommended by project) | Accept (project-recommended config) | +| Uptime Kuma | SQLite (v2 adds MariaDB, not PG) | Accept or Litestream | +| Navidrome | SQLite only | Accept or Litestream | +| Audiobookshelf | SQLite only | Accept or Litestream | | Calibre-Web | SQLite (Calibre format) | Accept (format constraint) | -| Wealthfolio | SQLite only | Litestream or accept | -| Diun | BoltDB only | Accept (rebuildable state) | +| Wealthfolio | SQLite only | Accept or Litestream | +| MeshCentral | NeDB (document store) | Accept | +| Diun | bbolt (BoltDB fork) | Accept (rebuildable state) | ### Component 7: Monitoring Stack Prometheus, Loki, Alertmanager use specialized storage (TSDB, BoltDB). -Cannot migrate to PostgreSQL. Options: -- Move to local disk (emptyDir or local PVC) -- Accept NFS risk (metrics/logs are ephemeral, loss is annoying not catastrophic) -- Prometheus WAL is already on tmpfs (good) +Cannot migrate to PostgreSQL. Prometheus WAL is already on tmpfs (good). Recommendation: Move to local PVCs. Losing metrics history on node failure is acceptable for a homelab. @@ -143,82 +214,142 @@ NFS failure for these = temporary unavailability, not corruption. ## Backup Strategy ``` -CloudNativePG → continuous WAL archiving → NFS:/mnt/main/postgresql-wal-archive/ -MySQL (if kept) → automated mysqldump → NFS:/mnt/main/mysql-backup/ -Redis Sentinel → periodic RDB snapshots → NFS:/mnt/main/redis-backup/ -Litestream → continuous SQLite backup → NFS:/mnt/main/litestream/ -Media/configs → already on NFS +CNPG PostgreSQL → pg_dumpall CronJob (daily) → NFS:/mnt/main/postgresql-backup/ +MySQL → mysqldump CronJob (daily) → NFS:/mnt/main/mysql-backup/ +Redis → RDB copy CronJob (hourly) → NFS:/mnt/main/redis-backup/ +Immich PG → pg_dump CronJob (daily) → NFS:/mnt/main/immich-pg-backup/ +Litestream → continuous SQLite backup → NFS:/mnt/main/litestream/ (optional) +Media/configs → already on NFS NFS (TrueNAS) → rsync → Backup NAS (unchanged) ``` -All backups still consolidate to TrueNAS. The rsync-to-backup-NAS -workflow is completely unchanged. +All backups still consolidate to TrueNAS. Rsync-to-backup-NAS workflow +is completely unchanged. + +**Note**: CNPG's native WAL archiving requires S3-compatible storage +(not NFS). Adding MinIO for PITR capability is a future enhancement. +The pg_dumpall CronJob provides adequate backup for a homelab. ## Migration Order (Safety-First) -Each phase: backup → migrate → verify → user confirms → next phase. +Each phase: research → backup → migrate → verify → user confirms → next. + +Before each service migration, a research subagent will: +1. Confirm current setup and configuration +2. Research online best practices and documentation +3. Scrutinize the migration plan for that specific service +4. Present findings for review before execution ### Phase 0: Infrastructure Prerequisites -- Install CloudNativePG operator -- Add local virtual disks to K8s nodes (via Proxmox) -- Set up local-path StorageClass -- Install Redis Sentinel +- Verify RAM headroom (current overcommit must be addressed first) +- Add dedicated local virtual disks to K8s worker nodes (via Proxmox) +- Verify local-path-provisioner is configured for new disks +- Install CloudNativePG operator (Helm) +- Test CNPG with custom PostgreSQL image (throwaway cluster) -### Phase 1: PostgreSQL Migration (highest impact) -1. Full pg_dumpall backup to NFS -2. Deploy CloudNativePG cluster (empty) -3. Restore backup into CloudNativePG -4. Verify all 12 dependent services work -5. Decommission old PostgreSQL pod + NFS volume +### Phase 1: PostgreSQL Migration (highest impact, most preparation) +1. Deploy throwaway CNPG cluster to test image compatibility and import +2. Full pg_dumpall backup to NFS +3. Deploy production CNPG cluster with monolith import from running PG +4. Create ExternalName Service for backwards compatibility +5. Migrate ONE low-risk service first (e.g., `resume` or `health`) +6. Verify for 24-48 hours +7. Migrate remaining services one at a time, verify each +8. Migrate authentik LAST (identity provider — highest blast radius) +9. Keep old PG pod scaled to 0 for one week as rollback safety net +10. Decommission old PG only after stability confirmed ### Phase 2: Redis Migration -1. RDB snapshot backup -2. Deploy Redis Sentinel cluster -3. Restore data -4. Update service connection strings +1. RDB snapshot backup to NFS +2. Deploy single redis:7-alpine on local PVC (same namespace, new pod) +3. Restore RDB snapshot +4. Update redis Service to point to new pod 5. Verify all 11 dependent services +6. Add hourly RDB backup CronJob to NFS +7. Decommission old redis-stack pod -### Phase 3: Critical SQLite Apps → PostgreSQL -Migrate one at a time, verify after each: -3a. Vaultwarden (password vault — most critical) -3b. Headscale (VPN coordination) +### Phase 3: MySQL Migration +1. mysqldump backup +2. Deploy single MySQL on local PVC +3. Restore dump +4. Verify all 8 dependent services +5. Research per-service PostgreSQL migration feasibility (future work) -### Phase 4: MySQL Migration -Either deploy MySQL Operator or migrate services to PostgreSQL. -One service at a time. +### Phase 4: Immich PostgreSQL +1. pg_dump backup +2. Move Immich PG to local PVC (same image, same config) +3. Verify Immich functionality (upload, search, face recognition) -### Phase 5: Immich PostgreSQL -Migrate Immich's dedicated PostgreSQL to CloudNativePG. +### Phase 5: SQLite Apps → PostgreSQL +Migrate one at a time, safest first: +5a. FreshRSS (lowest risk — fresh install with OPML import) +5b. Open WebUI (low risk — start fresh, chat history disposable) +5c. Forgejo (moderate risk — use forgejo dump, verify git operations) +5d. Vaultwarden (HIGH risk — pgloader, test extensively on copy first) -### Phase 6: Remaining SQLite Apps → PostgreSQL -One at a time: Forgejo, Open WebUI, Meshcentral, FreshRSS +### Phase 6: ClickHouse + Monitoring +6a. ClickHouse → local PVC +6b. Prometheus → local PVC +6c. Loki → local PVC +6d. Alertmanager → local PVC -### Phase 7: Monitoring Stack -Move Prometheus, Loki, Alertmanager to local PVCs. +### Phase 7: Cleanup + Optional Enhancements +- Remove old NFS directories from nfs_directories.txt +- Update nfs_exports.sh +- Optional: Add Litestream for SQLite-only apps +- Optional: Add MinIO for CNPG WAL archiving (PITR capability) +- Optional: Evaluate MySQL→PostgreSQL consolidation -### Phase 8: ClickHouse + Remaining -ClickHouse replication or accept risk. -Litestream for SQLite-only apps (optional). +## Rollback Plan (per component) + +**PostgreSQL**: Old pod kept scaled to 0 with NFS data intact. Rollback = +scale old pod back up, revert ExternalName Service. Pre-migration +pg_dumpall available if NFS data is stale. + +**Redis**: Old redis-stack pod kept scaled to 0. Rollback = scale up, +revert Service. Pre-migration RDB snapshot on NFS. + +**MySQL**: Same pattern — old pod scaled to 0, mysqldump on NFS. + +**SQLite apps**: Original SQLite databases remain on NFS untouched. +Rollback = remove DATABASE_URL env var, restart pod. ## Resource Budget | Component | RAM | Local Disk | |-----------|-----|-----------| -| CloudNativePG (3 instances) | ~3GB | ~50GB/node | -| Redis Sentinel (3 instances) | ~1.5GB | ~2GB/node | -| MySQL Operator (if kept) | ~2GB | ~20GB/node | -| Litestream (6 apps) | ~300MB | None | -| **Total new** | **~7GB** | **~72GB/node** | +| CloudNativePG (2 instances) | ~2GB | ~50GB/node (2 nodes) | +| Redis 7 (single instance) | ~550MB | ~1GB | +| MySQL (single instance) | ~1GB | ~20GB | +| Immich PG (single instance) | ~500MB | ~10GB | +| CNPG Operator | ~200MB | None | +| **Total new overhead** | **~4.25GB** | **~81GB across 2 nodes** | -Current cluster has 88GB RAM total. TrueNAS VM (16GB) could be -downsized since it no longer serves database workloads, partially -offsetting the new overhead. +**RAM WARNING**: Proxmox host has 142GB physical RAM with ~156GB +allocated to running VMs (already ~10% overcommitted). This plan adds +~4.25GB but also frees ~1.5GB by dropping redis-stack modules and +removing old DB pods. Net increase: ~2.75GB. The old DB pods +(postgresql, mysql, redis-stack on NFS) will be decommissioned, +partially offsetting the new resource usage. Monitor swap usage closely. + +Consider stopping unused VMs (PBS is already stopped, Windows10 uses +8GB and may not need to run continuously). + +## Monitoring Additions + +After migration, add alerts for: +- CNPG replication lag +- CNPG instance count (< 2 = degraded) +- Local disk space on `/opt/local-path-provisioner` per node +- Redis RDB save failures +- Backup CronJob failures (pg_dumpall, mysqldump, RDB copy) ## Success Criteria -- [ ] No database runs on NFS +- [ ] PostgreSQL, MySQL, Redis, Immich PG, ClickHouse all on local disk - [ ] TrueNAS VM restart causes zero data corruption - [ ] TrueNAS VM restart only affects media/config services (temporary unavailability) - [ ] All backups still consolidate to TrueNAS for rsync to backup NAS - [ ] Each migrated service verified working before proceeding to next +- [ ] Rollback tested for PostgreSQL before decommissioning old pod