[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes

Five compounding factors produced the 2026-04-22 flap cascade: soft anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe timing amplified LUKS-encrypted LVM I/O stalls into spurious +switch-master loops, HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters, publish_not_ready_addresses=true fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery CrashLoopBackOff closed the feedback loop. Changes: - Anti-affinity: preferred → required (one redis pod per node, hard) - Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000 - Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5 - HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s - Headless svc: publish_not_ready_addresses true→false Post-rollout verification clean: 0 flaps, 0 +switch-master events, 0 celery ReadOnlyError in the 60s window after settle. Docs updated.
2026-04-22 15:59:00 +00:00 · 2026-04-22 15:59:00 +00:00 · 7dfe89a6e0
commit 7dfe89a6e0
parent fdced7577b
2 changed files with 23 additions and 22 deletions
--- a/docs/architecture/databases.md
+++ b/docs/architecture/databases.md
@ -127,9 +127,13 @@ Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperl
 3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`.

 - 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
+- **Pod anti-affinity is `required` (hard)** — each redis pod must land on a distinct node. Soft anti-affinity previously let the scheduler co-locate 2/3 pods on the same node; when that node (`k8s-node3`) went `NotReady→Ready` at 11:42 UTC on 2026-04-22 it took 2 redis pods with it and the cluster lost quorum. Cluster-wide PV `nodeAffinity` matches one zone (`topology.kubernetes.io/region=pve, zone=pve`), so PVCs rebind freely on reschedule.
 - `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
 - redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
 - **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted.
+- **Failover timing (tuned 2026-04-22)**: `sentinel down-after-milliseconds=15000` + `sentinel failover-timeout=60000`. Redis liveness probe `timeout_seconds=10, failure_threshold=5`; sentinel liveness probe same. LUKS-encrypted LVM + BGSAVE fork can briefly stall master I/O >5s, which under the old 5s/30s sentinel timings + 3s/3 probes induced spurious `+sdown`→`+odown`→`+switch-master` cycles every 1-2 minutes. The new values absorb normal BGSAVE pauses without triggering failover.
+- **HAProxy check smoothing (tuned 2026-04-22)**: `check inter 2s fall 3 rise 2` (was `1s / 2 / 2`) + `timeout check 5s` (was `3s`). The aggressive 1s polling used to race sentinel failovers — during a legitimate promote, HAProxy could catch the old master serving `role:slave` in the 1-3s window before re-probing the new master, leaving the backend empty and clients receiving `ReadOnlyError`.
+- **Headless service `publish_not_ready_addresses=false`** (flipped 2026-04-22). Previously `true` meant HAProxy's DNS resolver saw not-yet-ready pods during rollouts, compounding the check-race above. Sentinel peer discovery is unaffected because sentinels announce to each other explicitly via `sentinel announce-hostnames yes`.
 - Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
 - Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway at the 20% TBW budget.
 - `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.
@ -138,7 +142,7 @@ Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperl

 **Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`.

-**Why this design** — three incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave. See beads epic `code-v2b` for the full plan and linked challenger analyses.
+**Why this design** — four incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave; (d) 2026-04-22 five-factor flap cascade — soft anti-affinity let 2/3 pods co-locate on `k8s-node3`, node bounced NotReady→Ready and took quorum with it; aggressive sentinel/probe timing (5s/30s + 3s/3) amplified disk-I/O stalls under LUKS-encrypted LVM into spurious `+switch-master` loops; HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters; `publish_not_ready_addresses=true` fed not-yet-ready pods into HAProxy DNS; downstream `realestate-crawler-celery` CrashLoopBackOff closed the feedback loop. See beads epic `code-v2b` for the full plan and linked challenger analyses.

 ### SQLite (Per-App)

--- a/stacks/redis/modules/redis/main.tf
+++ b/stacks/redis/modules/redis/main.tf
@ -43,7 +43,7 @@ resource "kubernetes_config_map" "haproxy" {
        timeout connect 5s
        timeout client  30s
        timeout server  30s
-        timeout check   3s
+        timeout check   5s

      # Dynamic DNS resolution via cluster CoreDNS. Without this, haproxy
      # resolves server hostnames once at startup and caches forever, so
@ -82,9 +82,9 @@ resource "kubernetes_config_map" "haproxy" {
        tcp-check expect rstring role:master
        tcp-check send "QUIT\r\n"
        tcp-check expect string +OK
-        server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
-        server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
-        server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
+        server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none
+        server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none
+        server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none

      backend redis_sentinel
        balance roundrobin
@ -362,8 +362,8 @@ resource "kubernetes_config_map" "redis_v2_sentinel_bootstrap" {
      sentinel resolve-hostnames yes
      sentinel announce-hostnames yes
      sentinel monitor mymaster $MASTER_HOST 6379 2
-      sentinel down-after-milliseconds mymaster 5000
-      sentinel failover-timeout mymaster 30000
+      sentinel down-after-milliseconds mymaster 15000
+      sentinel failover-timeout mymaster 60000
      sentinel parallel-syncs mymaster 1
      EOF

@ -396,7 +396,7 @@ resource "kubernetes_service" "redis_v2_headless" {
  }
  spec {
    cluster_ip                  = "None"
-    publish_not_ready_addresses = true
+    publish_not_ready_addresses = false
    selector = {
      app = "redis-v2"
    }
@ -451,18 +451,15 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {

        affinity {
          pod_anti_affinity {
-            preferred_during_scheduling_ignored_during_execution {
-              weight = 100
-              pod_affinity_term {
-                label_selector {
-                  match_expressions {
-                    key      = "app"
-                    operator = "In"
-                    values   = ["redis-v2"]
-                  }
+            required_during_scheduling_ignored_during_execution {
+              label_selector {
+                match_expressions {
+                  key      = "app"
+                  operator = "In"
+                  values   = ["redis-v2"]
                }
-                topology_key = "kubernetes.io/hostname"
              }
+              topology_key = "kubernetes.io/hostname"
            }
          }
        }
@ -535,8 +532,8 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {
            }
            initial_delay_seconds = 15
            period_seconds        = 10
-            timeout_seconds       = 3
-            failure_threshold     = 3
+            timeout_seconds       = 10
+            failure_threshold     = 5
          }
          readiness_probe {
            exec {
@ -580,8 +577,8 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {
            }
            initial_delay_seconds = 20
            period_seconds        = 10
-            timeout_seconds       = 3
-            failure_threshold     = 3
+            timeout_seconds       = 10
+            failure_threshold     = 5
          }
          readiness_probe {
            exec {