[redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs

Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h with 0 alerts firing and 127 ops/sec on the v2 master — skipped the nominal 24h rollback window per user direction. - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm destroy cleaned up the StatefulSet redis-node (already scaled to 0), ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless` ClusterIP services that the chart owned. - Removed `null_resource.patch_redis_service` — the kubectl-patch hack that worked around the Bitnami chart's broken service selector. No Helm chart, no patch needed. - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy deployment. - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete). - Simplified the top-of-file comment and the redis-v2 architecture comment — they talked about the parallel-cluster migration state that no longer exists. Folded in the sentinel hostname gotcha, the redis 8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning so the rationale survives in the code rather than only in beads. - `RedisDown` alert no longer matches `redis-node|redis-v2` — just `redis-v2` since that's the only StatefulSet now. Kept the `or on() vector(0)` so the alert fires when kube_state_metrics has no sample (e.g. after accidental delete). - `docs/architecture/databases.md` trimmed: no more "pending TF removal" or "cold rollback for 24h" language. Verification after apply: - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-* (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only. - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted). - Sentinel: all 3 agree mymaster = redis-v2-0 hostname. - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master. - Prometheus: 0 firing redis alerts. Closes: code-v2b Closes: code-2mw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:32:14 +00:00 · 2026-04-19 16:32:14 +00:00 · e55c549c9a
commit e55c549c9a
parent c113be4d5e
3 changed files with 24 additions and 160 deletions
--- a/docs/architecture/databases.md
+++ b/docs/architecture/databases.md
@ -122,14 +122,10 @@ graph TB

 Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.

-**Current state (as of 2026-04-19, cutover complete)**:
-
-Active cluster: `redis-v2-*` — 3 pods, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy backends point at `redis-v2-{0,1,2}.redis-v2-headless.redis.svc.cluster.local`. DBSIZE matched between old master and new at cutover; all data (including `immich_bull:*` and `_kombu.*` queues) preserved via chained `REPLICAOF`. Steady-state probe: 45/45 PING OK. Two chaos drills (kill master, sentinel failover) passed — first drill ~12s disruption, second ~1s after hostname fix below.
-
-Legacy `redis-node-*` StatefulSet is scaled to 0 (kept as cold rollback for 24h). Helm release `helm_release.redis` + PVCs `redis-data-redis-node-{0,1}` are pending Terraform removal in a follow-up commit (see beads follow-up task).
-
 **Architecture**:

+3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`.
+
 - 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
 - `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
 - redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.