infra

Viktor Barzin b6cd83f85a [redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:` BullMQ queues and `_kombu.` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-19 16:13:43 +00:00
..
agent-task-tracking.md	Add agent task tracking documentation	2026-04-15 17:11:26 +00:00
authentication.md	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2	2026-04-15 06:41:56 +00:00
automated-upgrades.md	[docs] automated-upgrades: document long-lived OAuth + expiry monitoring	2026-04-18 13:00:07 +00:00
backup-dr.md	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps	2026-04-17 05:51:52 +00:00
ci-cd.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
compute.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
databases.md	[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only	2026-04-19 16:13:43 +00:00
dns.md	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)	2026-04-19 16:12:23 +00:00
homepage.md	add homepage auto-discovery documentation [ci skip]	2026-03-25 13:06:43 +02:00
incident-response.md	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP	2026-04-18 10:12:02 +00:00
mailserver.md	[docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip]	2026-04-19 12:40:53 +00:00
monitoring.md	[docs] Document external-monitor opt-out mechanism in monitoring.md	2026-04-19 15:19:06 +00:00
multi-tenancy.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00
networking.md	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)	2026-04-19 16:12:23 +00:00
overview.md	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]	2026-04-13 14:42:07 +00:00
secrets.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
security.md	[docs] Update anti-AI and rybbit docs after rewrite-body removal	2026-04-17 21:43:13 +00:00
storage.md	docs(storage): add encrypted LVM documentation	2026-04-15 21:00:37 +00:00
vpn.md	docs: update Technitium DNS docs after cache optimization	2026-04-12 18:29:25 +01:00