[ci skip] add storage reliability design: DB replication + SQLite consolidation

2026-02-28 14:24:42 +00:00 · 2026-02-28 14:24:42 +00:00 · 415d8704d4
commit 415d8704d4
parent 71d4801cca
1 changed files with 224 additions and 0 deletions
--- a/docs/plans/2026-02-28-storage-reliability-design.md
+++ b/docs/plans/2026-02-28-storage-reliability-design.md
@ -0,0 +1,224 @@
+# Storage Reliability: Database Replication + SQLite Consolidation
+
+**Date**: 2026-02-28
+**Status**: Approved
+**Goal**: Eliminate data corruption risk from NFS outages by moving databases off NFS
+
+## Problem
+
+All 70+ services store data on a single TrueNAS VM (10.0.10.15) via NFS. When this VM crashes or hangs:
+
+- **22 services** risk **data corruption** (databases with WAL/fsync requirements on NFS)
+- **12 services** experience downtime but no corruption (media, configs)
+- The shared PostgreSQL alone backs 12 services — a single NFS hiccup can corrupt data for all of them
+
+SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode unsafe).
+
+## Constraints
+
+- Zero cost — all self-hosted, OSS
+- Must preserve backup workflow (consolidate to TrueNAS → rsync to backup NAS)
+- Stop-and-verify after each service migration
+- No data loss tolerance
+
+## Design
+
+### Strategy Overview
+
+```
+BEFORE:  All services → NFS (TrueNAS VM) → single point of failure
+
+AFTER:   Databases → local disk + native replication (HA, proper fsync)
+         SQLite apps → migrated to shared PostgreSQL where possible
+         Media/configs → NFS (TrueNAS, non-critical path)
+         Backups → all consolidate to NFS → rsync to backup NAS
+```
+
+### Component 1: PostgreSQL HA via CloudNativePG
+
+**Current**: Single PostgreSQL 16 pod on NFS (`/mnt/main/postgresql/data`)
+**Target**: CloudNativePG operator with 3-instance cluster on local disk
+
+CloudNativePG is a CNCF K8s operator that manages PostgreSQL clusters with:
+- Automatic primary/replica failover
+- Streaming replication across nodes
+- Continuous WAL archiving (to NFS for backup)
+- Local PVCs for data (proper fsync)
+
+Architecture:
+```
+CloudNativePG Cluster
+├── Primary (node A) — local PVC
+├── Replica (node B) — local PVC, streaming replication
+├── Replica (node C) — local PVC, streaming replication
+└── WAL archive → NFS:/mnt/main/postgresql-wal-archive/ (backup only)
+```
+
+Dependent services (unchanged connection, new reliable backend):
+authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
+rybbit, affine, health, resume, trading-bot
+
+Resource overhead: ~3GB RAM total, ~50GB local disk per node
+
+### Component 2: MySQL HA (or Migration to PostgreSQL)
+
+**Current**: Single MySQL pod on NFS (`/mnt/main/mysql`)
+**Target**: Either MySQL Operator (InnoDB Cluster) or migrate MySQL services to PostgreSQL
+
+Services on MySQL: hackmd, speedtest, onlyoffice, crowdsec,
+paperless-ngx, real-estate-crawler, url-shortener, grafana
+
+Many of these support PostgreSQL. Consolidating to one DB engine
+reduces operational complexity. Evaluate per-service during implementation.
+
+### Component 3: Redis HA via Sentinel
+
+**Current**: Single redis-stack pod on NFS (`/mnt/main/redis`)
+**Target**: Redis Sentinel (3 instances) with data on local disk
+
+Architecture:
+```
+Redis Sentinel (3 instances)
+├── Primary (node A) — local PVC, RDB + AOF persistence
+├── Replica (node B) — local PVC
+├── Replica (node C) — local PVC
+└── Sentinel monitors (3) — automatic failover
+```
+
+Resource overhead: ~1.5GB RAM total, ~2GB local disk per node
+
+### Component 4: Immich PostgreSQL
+
+**Current**: Dedicated PostgreSQL + pgvector on NFS
+**Target**: Migrate to CloudNativePG cluster (separate database in same cluster, or dedicated cluster with pgvector extension)
+
+### Component 5: ClickHouse HA (Rybbit)
+
+**Current**: Single ClickHouse on NFS (`/mnt/main/clickhouse`)
+**Target**: ClickHouse with native replication via ClickHouse Keeper, or accept risk (analytics data, rebuildable)
+
+### Component 6: SQLite App Consolidation to PostgreSQL
+
+Apps that support PostgreSQL — migrate to shared CloudNativePG cluster:
+
+| App | Config mechanism | Priority |
+|-----|-----------------|----------|
+| Vaultwarden | `DATABASE_URL` env var | P0 (password vault) |
+| Headscale | `db_type: postgres` in config | P0 (VPN) |
+| Forgejo | `[database]` section in app.ini | P1 |
+| Open WebUI | `DATABASE_URL` env var | P2 |
+| Meshcentral | config.json `db` section | P2 |
+| FreshRSS | `db` config | P2 |
+
+Apps stuck on SQLite (accept risk or use Litestream for backup):
+
+| App | Storage engine | Mitigation |
+|-----|---------------|------------|
+| Uptime Kuma | SQLite only | Litestream or accept |
+| Navidrome | SQLite only | Litestream or accept |
+| Audiobookshelf | SQLite only | Litestream or accept |
+| Calibre-Web | SQLite (Calibre format) | Accept (format constraint) |
+| Wealthfolio | SQLite only | Litestream or accept |
+| Diun | BoltDB only | Accept (rebuildable state) |
+
+### Component 7: Monitoring Stack
+
+Prometheus, Loki, Alertmanager use specialized storage (TSDB, BoltDB).
+Cannot migrate to PostgreSQL. Options:
+- Move to local disk (emptyDir or local PVC)
+- Accept NFS risk (metrics/logs are ephemeral, loss is annoying not catastrophic)
+- Prometheus WAL is already on tmpfs (good)
+
+Recommendation: Move to local PVCs. Losing metrics history on node
+failure is acceptable for a homelab.
+
+### Component 8: What Stays on NFS (unchanged)
+
+All ~35 LOW risk services: media files, configs, caches, static content.
+Immich photos, Jellyfin media, Audiobookshelf audiobooks, Calibre ebooks,
+Frigate recordings, downloads, backups, model caches, etc.
+
+NFS failure for these = temporary unavailability, not corruption.
+
+## Backup Strategy
+
+```
+CloudNativePG     → continuous WAL archiving  → NFS:/mnt/main/postgresql-wal-archive/
+MySQL (if kept)   → automated mysqldump       → NFS:/mnt/main/mysql-backup/
+Redis Sentinel    → periodic RDB snapshots    → NFS:/mnt/main/redis-backup/
+Litestream        → continuous SQLite backup   → NFS:/mnt/main/litestream/
+Media/configs     → already on NFS
+
+NFS (TrueNAS) → rsync → Backup NAS  (unchanged)
+```
+
+All backups still consolidate to TrueNAS. The rsync-to-backup-NAS
+workflow is completely unchanged.
+
+## Migration Order (Safety-First)
+
+Each phase: backup → migrate → verify → user confirms → next phase.
+
+### Phase 0: Infrastructure Prerequisites
+- Install CloudNativePG operator
+- Add local virtual disks to K8s nodes (via Proxmox)
+- Set up local-path StorageClass
+- Install Redis Sentinel
+
+### Phase 1: PostgreSQL Migration (highest impact)
+1. Full pg_dumpall backup to NFS
+2. Deploy CloudNativePG cluster (empty)
+3. Restore backup into CloudNativePG
+4. Verify all 12 dependent services work
+5. Decommission old PostgreSQL pod + NFS volume
+
+### Phase 2: Redis Migration
+1. RDB snapshot backup
+2. Deploy Redis Sentinel cluster
+3. Restore data
+4. Update service connection strings
+5. Verify all 11 dependent services
+
+### Phase 3: Critical SQLite Apps → PostgreSQL
+Migrate one at a time, verify after each:
+3a. Vaultwarden (password vault — most critical)
+3b. Headscale (VPN coordination)
+
+### Phase 4: MySQL Migration
+Either deploy MySQL Operator or migrate services to PostgreSQL.
+One service at a time.
+
+### Phase 5: Immich PostgreSQL
+Migrate Immich's dedicated PostgreSQL to CloudNativePG.
+
+### Phase 6: Remaining SQLite Apps → PostgreSQL
+One at a time: Forgejo, Open WebUI, Meshcentral, FreshRSS
+
+### Phase 7: Monitoring Stack
+Move Prometheus, Loki, Alertmanager to local PVCs.
+
+### Phase 8: ClickHouse + Remaining
+ClickHouse replication or accept risk.
+Litestream for SQLite-only apps (optional).
+
+## Resource Budget
+
+| Component | RAM | Local Disk |
+|-----------|-----|-----------|
+| CloudNativePG (3 instances) | ~3GB | ~50GB/node |
+| Redis Sentinel (3 instances) | ~1.5GB | ~2GB/node |
+| MySQL Operator (if kept) | ~2GB | ~20GB/node |
+| Litestream (6 apps) | ~300MB | None |
+| **Total new** | **~7GB** | **~72GB/node** |
+
+Current cluster has 88GB RAM total. TrueNAS VM (16GB) could be
+downsized since it no longer serves database workloads, partially
+offsetting the new overhead.
+
+## Success Criteria
+
+- [ ] No database runs on NFS
+- [ ] TrueNAS VM restart causes zero data corruption
+- [ ] TrueNAS VM restart only affects media/config services (temporary unavailability)
+- [ ] All backups still consolidate to TrueNAS for rsync to backup NAS
+- [ ] Each migrated service verified working before proceeding to next