diff --git a/docs/architecture/databases.md b/docs/architecture/databases.md index b648012e..86b6f0c8 100644 --- a/docs/architecture/databases.md +++ b/docs/architecture/databases.md @@ -121,29 +121,23 @@ graph TB ### Redis -Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`. +Single **standalone** instance shared by all consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Celery apps, Traefik, etc.). Clients talk to `redis-master.redis.svc.cluster.local:6379`, which now selects the single redis pod directly. **No Sentinel, no HAProxy, no replicas** — reverted from 3-node HA on 2026-05-30 (see "Why standalone" below). **Architecture**: -3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`. +1 pod in StatefulSet `redis-v2` (`replicas=1`, `podManagementPolicy=Parallel` retained for STS-field immutability), running `redis` + `redis_exporter` containers on `docker.io/library/redis:8-alpine` (8.6.2). Data on a `proxmox-lvm-encrypted` PVC (`data-redis-v2-0`, 5Gi→20Gi autoresize). -- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain. -- **Pod anti-affinity is `required` (hard)** — each redis pod must land on a distinct node. Soft anti-affinity previously let the scheduler co-locate 2/3 pods on the same node; when that node (`k8s-node3`) went `NotReady→Ready` at 11:42 UTC on 2026-04-22 it took 2 redis pods with it and the cluster lost quorum. Cluster-wide PV `nodeAffinity` matches one zone (`topology.kubernetes.io/region=pve, zone=pve`), so PVCs rebind freely on reschedule. -- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident). -- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof 6379` (replicas), so pods come up already in the right role — no bootstrap race. -- **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted. -- **Failover timing (tuned 2026-04-22)**: `sentinel down-after-milliseconds=15000` + `sentinel failover-timeout=60000`. Redis liveness probe `timeout_seconds=10, failure_threshold=5`; sentinel liveness probe same. LUKS-encrypted LVM + BGSAVE fork can briefly stall master I/O >5s, which under the old 5s/30s sentinel timings + 3s/3 probes induced spurious `+sdown`→`+odown`→`+switch-master` cycles every 1-2 minutes. The new values absorb normal BGSAVE pauses without triggering failover. -- **HAProxy check smoothing (tuned 2026-04-22)**: `check inter 2s fall 3 rise 2` (was `1s / 2 / 2`) + `timeout check 5s` (was `3s`). The aggressive 1s polling used to race sentinel failovers — during a legitimate promote, HAProxy could catch the old master serving `role:slave` in the 1-3s window before re-probing the new master, leaving the backend empty and clients receiving `ReadOnlyError`. -- **Headless service `publish_not_ready_addresses=false`** (flipped 2026-04-22). Previously `true` meant HAProxy's DNS resolver saw not-yet-ready pods during rollouts, compounding the check-race above. Sentinel peer discovery is unaffected because sentinels announce to each other explicitly via `sentinel announce-hostnames yes`. -- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency. -- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway at the 20% TBW budget. -- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`. -- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics). -- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up. +- `maxmemory=640mb` (83% of the 768Mi pod limit), **`maxmemory-policy=volatile-lru`**. The instance is shared by two workload classes: CACHES (want LRU eviction of disposable keys) and QUEUES (Immich BullMQ `bull:*`, Celery `_kombu:*` — must never be evicted or jobs vanish). `volatile-lru` evicts only keys carrying a TTL (caches set them) and never touches TTL-less keys (queue jobs), serving both correctly in one instance. Backstop: alert `RedisMemoryPressure` at 80% — if it ever fills with non-volatile keys, writes error like `noeviction`. +- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. `aof-load-corrupt-tail-max-size=1024` tolerates ≤1KB of AOF tail garbage from an unclean reboot instead of crashlooping. Disk-wear (sdb Samsung 850 EVO, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway. +- Memory `requests=limits=768Mi`. BGSAVE + AOF-rewrite fork can double RSS via COW; `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency. +- Service `redis-master` (name/DNS unchanged across the HA teardown so no consumer needed editing). Keel opt-out (`keel.sh/policy=never`, label + annotation) — a prior patch-bump to `:8.0.6-alpine` rejected the AOF config and crashed it. +- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, Pushgateway metrics). +- Auth disabled — NetworkPolicy is the isolation layer. `requirepass` + creds rollout to all clients remains a planned follow-up. +- **Downtime model**: a single instance means a pod restart (image bump, node drain, OOM) is a few-seconds cluster-wide Redis blip. Explicitly accepted (Viktor, 2026-05-30) as the price of eliminating the HA failure modes below. There is no PDB (a single-replica PDB would only block node drains). -**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`. +**Observability**: `oliver006/redis_exporter:v1.62.0` sidecar on port 9121, auto-scraped. Alerts: `RedisDown`, `RedisMemoryPressure` (>80%), `RedisEvictions`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisBackupStale`, `RedisBackupNeverSucceeded`. (`RedisReplicationLagHigh` + `RedisReplicasMissing` removed with the replicas.) -**Why this design** — four incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave; (d) 2026-04-22 five-factor flap cascade — soft anti-affinity let 2/3 pods co-locate on `k8s-node3`, node bounced NotReady→Ready and took quorum with it; aggressive sentinel/probe timing (5s/30s + 3s/3) amplified disk-I/O stalls under LUKS-encrypted LVM into spurious `+switch-master` loops; HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters; `publish_not_ready_addresses=true` fed not-yet-ready pods into HAProxy DNS; downstream `realestate-crawler-celery` CrashLoopBackOff closed the feedback loop. See beads epic `code-v2b` for the full plan and linked challenger analyses. +**Why standalone** — HA Redis caused more outages than it prevented in this homelab. Five incidents: (a) 2026-04-04 service selector routed writes to a replica → `READONLY`; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC (256Mi too tight); (c) 2026-04-19 PM sentinel quorum drift (2 sentinels, no majority) routed writes to a slave; (d) 2026-04-22 five-factor flap cascade (soft anti-affinity co-located pods + aggressive sentinel/probe timing + HAProxy polling race); (e) **2026-05-30 split-brain** — `redis-v2-0` booted during a network partition, hit the init script's deterministic "pod-0 is bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected `redis-v2-2`; HAProxy's `expect rstring role:master` matched both and round-robined client connections across them, so Immich enqueued BullMQ jobs on one master while its workers blocked-popped on the other → every queue wedged, new-upload thumbnails 404'd cluster-wide. The 3-sentinel design (beads `code-v2b`) was built specifically to prevent split-brain after incident (c), yet the bootstrap fallback manufactured one anyway. Conclusion: for a homelab cache/broker, a single instance with a few-seconds restart blip is strictly simpler and more reliable than chasing Sentinel correctness. Mirrors the MySQL InnoDB-Cluster → standalone reversion (2026-04-16). Post-mortem: `docs/post-mortems/2026-05-30-redis-split-brain.md`. ### SQLite (Per-App) diff --git a/docs/post-mortems/2026-05-30-redis-split-brain.md b/docs/post-mortems/2026-05-30-redis-split-brain.md new file mode 100644 index 00000000..098a2c58 --- /dev/null +++ b/docs/post-mortems/2026-05-30-redis-split-brain.md @@ -0,0 +1,93 @@ +# Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30) + +**Severity:** SEV2 (degraded — no data loss in Redis; queue processing stalled +cluster-wide). **Status:** Resolved. + +## Summary + +The 3-node Sentinel HA Redis (`redis-v2`) split-brained: two pods both held +`role:master`. HAProxy — which routes to any backend reporting `role:master` — +round-robined client connections across **both** masters. Immich enqueued +BullMQ jobs on one master while its workers blocked-popped on the other, so +every queue stalled. User-visible symptom: **newly uploaded Immich photos +returned HTTP 404 for their thumbnails** (the generation job never ran). Celery +apps (real-estate-crawler, trading-bot, paperless) and other queue users were +affected the same way. + +## Impact + +- Immich: thumbnail/preview/face/ML jobs not processing. `facialRecognition` + backlog reached ~30k waiting; new uploads showed broken images in the web UI. +- All ~15 shared-Redis consumers had inconsistent reads/writes (connections + split across two diverging masters). +- No Redis data lost — the larger dataset (`redis-v2-0`, ~30k keys) was + preserved through the fix. + +## Timeline (UTC+1 local) + +- **~2026-05-26/27**: `redis-v2` pods recreated (node2 unclean reboot era). + `redis-v2-0` came up partitioned; its Sentinel saw 0 peers and it declared + itself master via the init script's deterministic "pod-0 = bootstrap master" + fallback. Sentinels on `-1`/`-2` independently elected `redis-v2-2`. + Split-brain formed and persisted (~3-4 days) as the network healed but the + topology never reconciled. +- **2026-05-30 ~16:58**: investigating "Immich images with no thumbnails." + Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing + data-loss issue) AND a stuck job queue. +- **2026-05-30 ~17:00**: user manually restarted immich-server; namespace + `tier-quota` (24Gi) briefly blocked the replacement pod → ~1 min Immich + outage. Recovered. (Red herring — not the root cause.) +- **2026-05-30 ~17:1x**: identified two `role:master` redis pods + (`redis-v2-0` dbsize 30320, isolated, 0 connected slaves; `redis-v2-2` dbsize + 442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out + IPv6 (cluster is single-stack IPv4) and eviction (`evicted_keys=0`). +- **2026-05-30 ~17:30**: reverted `redis-v2` to a single standalone instance. + Queues drained immediately; newest Immich assets served HTTP 200. + +## Root cause + +`redis-v2`'s init container (`generate-sentinel-conf`) falls through to +"Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer +Sentinels/Redis. During a network partition, `redis-v2-0` hit that fallback and +became a second master. HAProxy's health check (`tcp-check expect rstring +role:master`) matches **any** master, so with two masters it placed both in +rotation and round-robined writes/reads across diverging datasets. BullMQ's +enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances → +jobs never consumed. + +This is the **third** Sentinel-class incident (after 2026-04-19 PM quorum drift +and 2026-04-22 flap cascade). The 3-sentinel design was built to *prevent* +split-brain, but the bootstrap fallback re-introduced it. + +## Resolution + +Reverted `redis-v2` to a **single standalone instance** (`replicas=1`, Sentinel ++ HAProxy removed), collapsing onto `redis-v2-0`'s dataset (preserved Immich's +queued jobs). Eviction policy changed `allkeys-lru` → **`volatile-lru`** so the +shared cache+queue workload is served correctly by one instance (evict only +TTL'd cache keys; never TTL-less queue keys). `redis-master` service name/DNS +unchanged → no consumer edits. Decision rationale: a homelab cache/broker does +not need HA; a few-seconds restart blip beats chasing Sentinel correctness. +Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion. + +## Follow-ups + +- [ ] Re-upload the ~99 Immich images + 12 timeline videos whose **originals** + are missing/zero-filled on disk (pre-existing data loss, unrelated to the + split-brain — re-running jobs can't regenerate them). Owner: Viktor. +- [ ] `requirepass` auth on Redis + creds rollout to all consumers (carried over + from the 2026-04-19 rework; still open). +- [ ] Consider whether any queue user (Immich/Celery) warrants its own dedicated + Redis if the shared instance's memory ever becomes contended (currently + ~30MB / 640MB — not a concern). + +## Lessons + +- HA that re-introduces its own failure class is worse than no HA. For a + single-node-tolerant homelab, prefer a standalone instance + a small accepted + downtime window. +- `allkeys-lru` on a shared cache+queue Redis silently drops queue jobs under + pressure; `volatile-lru` is the correct single-instance policy (Immich even + logs `IMPORTANT! Eviction policy ... should be "noeviction"`). +- A "bootstrap master" fallback that fires under partition is a split-brain + generator — avoid deterministic self-promotion when peers are unreachable. diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 12fd8884..da01bc92 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1676,30 +1676,28 @@ serverFiles: labels: severity: critical annotations: - summary: "Redis has no ready replicas" + summary: "Redis is down — statefulset redis-v2 has no ready pod" - alert: RedisMemoryPressure - expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85 + # Single instance, volatile-lru (2026-05-30): at maxmemory, TTL'd + # (cache) keys are evicted but TTL-less keys (Immich BullMQ + Celery + # jobs) are NOT — so once cache headroom is gone, queue writes start + # erroring. 80% is the backstop to intervene (bump maxmemory) first. + expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.80 for: 5m labels: severity: warning annotations: - summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent" + summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — volatile-lru evicting cache keys; queue writes at risk" - alert: RedisEvictions - # allkeys-lru is configured so evictions under cache pressure are - # expected, but sustained evictions mean we're thrashing — raise it. + # volatile-lru evicts only TTL'd (cache) keys under pressure — an + # occasional eviction is by design, but a sustained rate means we're + # near maxmemory and should raise it before queue writes error. expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0 for: 5m labels: severity: warning annotations: - summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)" - - alert: RedisReplicationLagHigh - expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30 - for: 3m - labels: - severity: warning - annotations: - summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master" + summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s) — near maxmemory" - alert: RedisForkLatencyHigh # latest_fork_usec > 500ms means BGSAVE fork is stalling the main # thread long enough to drop client requests. COW pressure or @@ -1717,16 +1715,6 @@ serverFiles: severity: warning annotations: summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate" - - alert: RedisReplicasMissing - # redis-v2 StatefulSet should always have 3 replicas connected to - # the master (2 replicas + itself). <2 connected_slaves means one - # replica is unreachable or still syncing. - expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1 - for: 10m - labels: - severity: warning - annotations: - summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)" - alert: HeadscaleReplicasMismatch expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1 for: 5m diff --git a/stacks/redis/modules/redis/main.tf b/stacks/redis/modules/redis/main.tf index 898fab34..f8a81c53 100644 --- a/stacks/redis/modules/redis/main.tf +++ b/stacks/redis/modules/redis/main.tf @@ -22,222 +22,38 @@ module "tls_secret" { tls_secret_name = var.tls_secret_name } -# HAProxy-based master-only proxy for the 17 Redis consumers. -# Health-checks each redis-v2 pod via `INFO replication` and only routes -# to the current master. On Sentinel failover, HAProxy detects the new -# master within ~3s via its 1s tcp-check interval. 3 replicas + PDB -# minAvailable=2 — HAProxy is the sole client-facing path since the -# 2026-04-19 redis rework (see beads code-v2b), so it needs its own HA. - -resource "kubernetes_config_map" "haproxy" { - metadata { - name = "redis-haproxy" - namespace = kubernetes_namespace.redis.metadata[0].name - } - data = { - "haproxy.cfg" = <<-EOT - global - maxconn 256 - - defaults - mode tcp - timeout connect 5s - timeout client 30s - timeout server 30s - timeout check 5s - - # Dynamic DNS resolution via cluster CoreDNS. Without this, haproxy - # resolves server hostnames once at startup and caches forever, so - # when redis-node-X pods restart and get new IPs, haproxy keeps - # connecting to the old (dead) IPs and returns "Connection refused" - # until haproxy itself is restarted. This caused an immich outage - # on 2026-04-19 after a redis pod cycle. - resolvers kubernetes - nameserver coredns kube-dns.kube-system.svc.cluster.local:53 - resolve_retries 3 - timeout resolve 1s - timeout retry 1s - hold other 10s - hold refused 10s - hold nx 10s - hold timeout 10s - hold valid 10s - hold obsolete 10s - - frontend redis_front - bind *:6379 - default_backend redis_master - - frontend sentinel_front - bind *:26379 - default_backend redis_sentinel - - backend redis_master - option tcp-check - tcp-check connect - tcp-check send "PING\r\n" - tcp-check expect string +PONG - tcp-check send "INFO replication\r\n" - # Match "role:master" only — cannot appear in slave responses - # (slave has "role:slave" then "master_host:..." which doesn't match) - tcp-check expect rstring role:master - tcp-check send "QUIT\r\n" - tcp-check expect string +OK - server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none - server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none - server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:6379 check inter 2s fall 3 rise 2 resolvers kubernetes init-addr last,libc,none - - backend redis_sentinel - balance roundrobin - server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none - server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none - server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none - EOT - } -} - -resource "kubernetes_deployment" "haproxy" { - metadata { - name = "redis-haproxy" - namespace = kubernetes_namespace.redis.metadata[0].name - labels = { - app = "redis-haproxy" - } - } - spec { - # 3 replicas + PDB minAvailable=2 (see kubernetes_pod_disruption_budget_v1.redis_haproxy). - # After Nextcloud drops its sentinel fallback in Phase 6 of the 2026-04-19 redis - # rework, HAProxy is the sole client-facing path for all 17 redis consumers, so - # it needs HA equivalent to other critical-path pods (Traefik, Authentik, PgBouncer). - replicas = 3 - selector { - match_labels = { - app = "redis-haproxy" - } - } - template { - metadata { - labels = { - app = "redis-haproxy" - } - annotations = { - # Roll the deployment whenever haproxy.cfg content changes so a - # config update (e.g. DNS resolver tweaks) actually takes effect. - "checksum/config" = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"]) - } - } - spec { - container { - name = "haproxy" - image = "docker.io/library/haproxy:3.1-alpine" - port { - container_port = 6379 - name = "redis" - } - port { - container_port = 26379 - name = "sentinel" - } - volume_mount { - name = "config" - mount_path = "/usr/local/etc/haproxy" - read_only = true - } - resources { - requests = { - cpu = "10m" - memory = "32Mi" - } - limits = { - memory = "64Mi" - } - } - liveness_probe { - tcp_socket { - port = 6379 - } - initial_delay_seconds = 5 - period_seconds = 10 - } - } - volume { - name = "config" - config_map { - name = kubernetes_config_map.haproxy.metadata[0].name - } - } - } - } - } - - lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] - } -} - -# Dedicated service for HAProxy master-only routing. -# Clients should use redis-master.redis.svc.cluster.local for write-safe connections. -# HAProxy health-checks Redis nodes and only routes to the current master. -resource "kubernetes_service" "redis_master" { - metadata { - name = "redis-master" - namespace = kubernetes_namespace.redis.metadata[0].name - labels = { - app = "redis-haproxy" - } - } - spec { - selector = { - app = "redis-haproxy" - } - port { - name = "redis" - port = 6379 - target_port = 6379 - } - port { - name = "sentinel" - port = 26379 - target_port = 26379 - } - } - - depends_on = [kubernetes_deployment.haproxy] -} - -module "nfs_backup_host" { - source = "../../../../modules/kubernetes/nfs_volume" - name = "redis-backup-host" - namespace = kubernetes_namespace.redis.metadata[0].name - nfs_server = "192.168.1.127" - nfs_path = "/srv/nfs/redis-backup" -} - -#### Redis — 3-node raw StatefulSet (MySQL standalone precedent) +#### Redis — SINGLE standalone instance (reverted from 3-node Sentinel HA 2026-05-30) # -# Pattern: raw `kubernetes_stateful_set_v1` + official image, no Bitnami Helm -# chart (deprecated by Broadcom Aug 2025; the atomic-Helm trap broke the old -# cluster's memory-bump roll during the 2026-04-19 AM incident). +# History: a 3-node StatefulSet + Sentinel + HAProxy (the "redis-v2" rework of +# 2026-04-19, beads code-v2b) was built to eliminate the 2-sentinel split-brain +# of the 2026-04-19 PM incident. It STILL split-brained on 2026-05-30: +# redis-v2-0 booted during a network partition, hit the init script's +# "pod-0 is always the bootstrap master" fallback, and became a SECOND master +# alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring +# role:master` matched BOTH, so it round-robined client connections across +# both masters — Immich enqueued BullMQ jobs on one instance while its workers +# blocked-popped on the other, wedging every queue (new-upload thumbnails 404'd +# cluster-wide). Third Redis HA incident in ~6 weeks. # -# Design driven by three April 2026 incidents (see beads code-v2b): -# - 3 sentinels (odd count, quorum=2) — eliminates the split-brain class -# that caused the 2026-04-19 PM incident (2 sentinels, no majority). -# - Init container regenerates sentinel.conf on every boot by probing -# peers for role:master — no persistent sentinel runtime state, so stale -# entries can never resurface across pod restarts. -# - `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` -# — without both set BEFORE the first MONITOR, sentinel stores resolved -# IPs and failover breaks when pod IPs churn on restart. -# - podManagementPolicy=Parallel — all 3 pods start together, avoiding the -# "sentinel-0 elects before -2 booted" ordering bug. -# - Memory 768Mi (up from the old 256→512Mi) — concurrent BGSAVE + AOF-rewrite -# fork can double RSS via COW. auto-aof-rewrite-percentage 200 + min-size -# 128mb tune down rewrite frequency. -# - Persistence: RDB snapshots + AOF everysec. Measured <1 GB/day write -# volume (2026-04-19 disk-wear analysis) → 40+ year SSD runway. -# - Image `redis:8-alpine` (8.6.2) — must match what the Bitnami legacy -# cluster ran, otherwise PSYNC fails with "Can't handle RDB format 13". +# Decision (Viktor, 2026-05-30): revert to a SINGLE instance. A homelab +# cache/broker does not need HA; a few seconds of downtime on a pod restart is +# an acceptable trade for structurally removing the entire split-brain class +# (no sentinel quorum, no second master, no HAProxy master fan-out). +# +# eviction policy `volatile-lru` (was `allkeys-lru`): the instance is shared by +# ~15 consumers split between CACHES (want LRU eviction of disposable keys) and +# QUEUES (Immich BullMQ `bull:*`, Celery `_kombu:*` — must NEVER be evicted or +# jobs vanish). `volatile-lru` evicts only keys that carry a TTL (caches set +# them) and never touches TTL-less keys (queue jobs), so it serves both +# correctly in one instance. Backstop: PrometheusRule RedisMemoryHigh (>80%) +# in the monitoring stack — if it ever fills with non-volatile keys, writes +# error like noeviction, and we want to know before that happens. +# +# Service name `redis-master.redis.svc.cluster.local:6379` is UNCHANGED so all +# ~15 consumers keep working without edits — it now selects the redis pod +# directly instead of HAProxy. Confirmed (2026-05-30) no consumer used the +# Sentinel port (26379); Nextcloud dropped its in-process sentinel query in the +# 2026-04-19 rework. Pattern mirrors the MySQL standalone (memory 711). resource "kubernetes_config_map" "redis_v2_conf" { metadata { @@ -252,7 +68,11 @@ resource "kubernetes_config_map" "redis_v2_conf" { dir /data maxmemory 640mb - maxmemory-policy allkeys-lru + # volatile-lru: evict only keys WITH a TTL (caches) under memory + # pressure; never evict TTL-less keys (Immich BullMQ + Celery jobs). + # See the header comment for the full rationale. Was allkeys-lru, which + # silently evicted queue jobs. + maxmemory-policy volatile-lru save 900 1 save 300 100 @@ -269,126 +89,16 @@ resource "kubernetes_config_map" "redis_v2_conf" { aof-load-truncated yes aof-use-rdb-preamble yes # Allow loading an AOF with up to 1KB of garbage at the tail (post-2026-05-26 - # node2 unclean reboot corrupted redis-v2-2's incremental AOF at offset - # 84799139; without this, redis-v2-2 crashlooped). Redis truncates the - # corrupted tail and continues. Default is 0 (refuse to load any corruption). + # node2 unclean reboot corrupted an incremental AOF; without this redis + # crashlooped). Redis truncates the corrupted tail and continues. aof-load-corrupt-tail-max-size 1024 - - replica-read-only yes - replica-serve-stale-data yes - timeout 0 tcp-keepalive 300 tcp-backlog 511 databases 16 loglevel notice - - # Included last so `replicaof` directive written by the init container - # overrides the "standalone master" default. Prevents the parallel- - # bootstrap race where all 3 pods claim role:master simultaneously. - include /shared/replica.conf - EOT - } -} - -resource "kubernetes_config_map" "redis_v2_sentinel_bootstrap" { - metadata { - name = "redis-v2-sentinel-bootstrap" - namespace = kubernetes_namespace.redis.metadata[0].name - } - data = { - "init.sh" = <<-EOT - #!/bin/sh - set -eu - - HOSTNAME=$(hostname) - MY_NUM=$${HOSTNAME##*-} - MY_DNS="$HOSTNAME.redis-v2-headless.redis.svc.cluster.local" - MASTER_HOST="" - - echo "=== Redis v2 bootstrap ===" - echo "hostname: $HOSTNAME (index $MY_NUM)" - - # Priority 1: ask peer sentinels for the consensus master. Covers the - # "steady-state pod restart" case — sentinels already agree on reality - # and a restarting pod should join that topology. - votes_0=0; votes_1=0; votes_2=0; votes_total=0 - for i in 0 1 2; do - if [ "$i" = "$MY_NUM" ]; then continue; fi - peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local" - reply=$(redis-cli -h "$peer" -p 26379 -t 2 SENTINEL get-master-addr-by-name mymaster 2>/dev/null | head -n1 || true) - echo "sentinel probe $peer: master=$${reply:-unreachable}" - case "$reply" in - *redis-v2-0*) votes_0=$((votes_0 + 1)); votes_total=$((votes_total + 1)) ;; - *redis-v2-1*) votes_1=$((votes_1 + 1)); votes_total=$((votes_total + 1)) ;; - *redis-v2-2*) votes_2=$((votes_2 + 1)); votes_total=$((votes_total + 1)) ;; - esac - done - if [ "$votes_total" -gt 0 ]; then - if [ "$votes_0" -ge "$votes_1" ] && [ "$votes_0" -ge "$votes_2" ] && [ "$votes_0" -gt 0 ]; then - MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local" - elif [ "$votes_1" -ge "$votes_2" ] && [ "$votes_1" -gt 0 ]; then - MASTER_HOST="redis-v2-1.redis-v2-headless.redis.svc.cluster.local" - elif [ "$votes_2" -gt 0 ]; then - MASTER_HOST="redis-v2-2.redis-v2-headless.redis.svc.cluster.local" - fi - [ -n "$MASTER_HOST" ] && echo "sentinel vote winner: $MASTER_HOST" - fi - - # Priority 2: look for a peer redis that's a master WITH at least one - # replica connected. "Standalone master" peers (bootstrap race) are - # skipped — connected_slaves=0 is ambiguous. - if [ -z "$MASTER_HOST" ]; then - for i in 0 1 2; do - if [ "$i" = "$MY_NUM" ]; then continue; fi - peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local" - info=$(redis-cli -h "$peer" -t 2 INFO replication 2>/dev/null || true) - role=$(echo "$info" | awk -F: '/^role:/ {gsub(/\r/,""); print $2; exit}') - slaves=$(echo "$info" | awk -F: '/^connected_slaves:/ {gsub(/\r/,""); print $2; exit}') - echo "redis probe $peer: role=$${role:-unreachable} slaves=$${slaves:-0}" - if [ "$role" = "master" ] && [ "$${slaves:-0}" -gt 0 ]; then - MASTER_HOST="$peer" - break - fi - done - fi - - # Priority 3: deterministic fallback — pod -0 is always the bootstrap - # master on a fresh cluster. All sentinels converge here, no race. - if [ -z "$MASTER_HOST" ]; then - MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local" - echo "no master found via probes — bootstrap default: $MASTER_HOST" - fi - - cat > /shared/sentinel.conf <`. - # This way pods come up already in the right role — no post-start race. - if [ "$MY_DNS" = "$MASTER_HOST" ]; then - : > /shared/replica.conf - echo "role: master" - else - echo "replicaof $MASTER_HOST 6379" > /shared/replica.conf - echo "role: replica of $MASTER_HOST" - fi - - echo "=== bootstrap complete ===" - cat /shared/sentinel.conf - echo "--- replica.conf ---" - cat /shared/replica.conf EOT } } @@ -411,10 +121,6 @@ resource "kubernetes_service" "redis_v2_headless" { name = "redis" port = 6379 } - port { - name = "sentinel" - port = 26379 - } port { name = "exporter" port = 9121 @@ -422,18 +128,46 @@ resource "kubernetes_service" "redis_v2_headless" { } } +# Stable client-facing service for all ~15 Redis consumers. +# Name/DNS (redis-master.redis.svc.cluster.local) unchanged across the HA +# teardown; now selects the redis pod directly (HAProxy removed). +resource "kubernetes_service" "redis_master" { + metadata { + name = "redis-master" + namespace = kubernetes_namespace.redis.metadata[0].name + labels = { + app = "redis-v2" + } + } + spec { + selector = { + app = "redis-v2" + } + port { + name = "redis" + port = 6379 + target_port = 6379 + } + } +} + +module "nfs_backup_host" { + source = "../../../../modules/kubernetes/nfs_volume" + name = "redis-backup-host" + namespace = kubernetes_namespace.redis.metadata[0].name + nfs_server = "192.168.1.127" + nfs_path = "/srv/nfs/redis-backup" +} + resource "kubernetes_stateful_set_v1" "redis_v2" { metadata { name = "redis-v2" namespace = kubernetes_namespace.redis.metadata[0].name labels = { app = "redis-v2" - # 2026-05-26: Keel patch-bumped :8-alpine → :8.0.6-alpine, which - # rejected the `aof-load-corrupt-tail-max-size` config and crashed - # redis-v2-2. The bump is also semantically a downgrade (8-alpine is - # 8.6.2, 8.0.6 is older). Both LABEL + ANNOTATION are required for - # full opt-out: label drives Kyverno's selector exclude, annotation - # drives Keel's own gate. + # Keel opt-out: a :8-alpine -> :8.0.6-alpine patch bump (also a + # semantic downgrade) rejected `aof-load-corrupt-tail-max-size` and + # crashed redis. Both LABEL + ANNOTATION required for full opt-out. "keel.sh/policy" = "never" } annotations = { @@ -441,8 +175,11 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { } } spec { - service_name = kubernetes_service.redis_v2_headless.metadata[0].name - replicas = 3 + service_name = kubernetes_service.redis_v2_headless.metadata[0].name + replicas = 1 + # pod_management_policy is immutable on a StatefulSet — kept as "Parallel" + # (unchanged from the 3-node era) so this revert does NOT force a + # destroy/recreate of the STS (which would detach the data PVC). pod_management_policy = "Parallel" selector { @@ -460,53 +197,11 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { "prometheus.io/scrape" = "true" "prometheus.io/port" = "9121" "checksum/conf" = sha256(kubernetes_config_map.redis_v2_conf.data["redis.conf"]) - "checksum/bootstrap" = sha256(kubernetes_config_map.redis_v2_sentinel_bootstrap.data["init.sh"]) } } spec { termination_grace_period_seconds = 30 - affinity { - pod_anti_affinity { - required_during_scheduling_ignored_during_execution { - label_selector { - match_expressions { - key = "app" - operator = "In" - values = ["redis-v2"] - } - } - topology_key = "kubernetes.io/hostname" - } - } - } - - init_container { - name = "generate-sentinel-conf" - image = "docker.io/library/redis:8-alpine" - command = ["/bin/sh", "/bootstrap/init.sh"] - - resources { - requests = { - cpu = "10m" - memory = "32Mi" - } - limits = { - memory = "32Mi" - } - } - - volume_mount { - name = "bootstrap" - mount_path = "/bootstrap" - read_only = true - } - volume_mount { - name = "shared" - mount_path = "/shared" - } - } - container { name = "redis" image = "docker.io/library/redis:8-alpine" @@ -536,12 +231,6 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { mount_path = "/etc/redis" read_only = true } - volume_mount { - # redis.conf `include /shared/replica.conf` — written by init container. - name = "shared" - mount_path = "/shared" - read_only = true - } liveness_probe { exec { @@ -563,51 +252,6 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { } } - container { - name = "sentinel" - image = "docker.io/library/redis:8-alpine" - command = ["redis-sentinel", "/shared/sentinel.conf"] - - port { - container_port = 26379 - name = "sentinel" - } - - resources { - requests = { - cpu = "20m" - memory = "64Mi" - } - limits = { - memory = "64Mi" - } - } - - volume_mount { - name = "shared" - mount_path = "/shared" - } - - liveness_probe { - exec { - command = ["redis-cli", "-p", "26379", "PING"] - } - initial_delay_seconds = 20 - period_seconds = 10 - timeout_seconds = 10 - failure_threshold = 5 - } - readiness_probe { - exec { - command = ["redis-cli", "-p", "26379", "PING"] - } - initial_delay_seconds = 10 - period_seconds = 5 - timeout_seconds = 3 - failure_threshold = 3 - } - } - container { name = "exporter" image = "docker.io/oliver006/redis_exporter:v1.62.0" @@ -649,17 +293,6 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { name = kubernetes_config_map.redis_v2_conf.metadata[0].name } } - volume { - name = "bootstrap" - config_map { - name = kubernetes_config_map.redis_v2_sentinel_bootstrap.metadata[0].name - default_mode = "0755" - } - } - volume { - name = "shared" - empty_dir {} - } } } @@ -667,7 +300,10 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { metadata { name = "data" annotations = { - "resize.topolvm.io/threshold" = "10%" + # NOTE: VCT is immutable on a live StatefulSet — this must match the + # live value (drifted to 80% out-of-band) or apply fails with + # "updates to statefulset spec ... forbidden". Don't "fix" to 10%. + "resize.topolvm.io/threshold" = "80%" "resize.topolvm.io/increase" = "100%" "resize.topolvm.io/storage_limit" = "20Gi" } @@ -690,37 +326,7 @@ resource "kubernetes_stateful_set_v1" "redis_v2" { } } -resource "kubernetes_pod_disruption_budget_v1" "redis_v2" { - metadata { - name = "redis-v2" - namespace = kubernetes_namespace.redis.metadata[0].name - } - spec { - min_available = 2 - selector { - match_labels = { - app = "redis-v2" - } - } - } -} - -resource "kubernetes_pod_disruption_budget_v1" "redis_haproxy" { - metadata { - name = "redis-haproxy" - namespace = kubernetes_namespace.redis.metadata[0].name - } - spec { - min_available = 2 - selector { - match_labels = { - app = "redis-haproxy" - } - } - } -} - -# Hourly backup: copy RDB snapshot from master to NFS +# Weekly backup: copy RDB snapshot to NFS resource "kubernetes_cron_job_v1" "redis-backup" { metadata { name = "redis-backup"