redis: revert 3-node Sentinel HA to single standalone instance [ci skip]
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
5bcb4525a4
commit
e1ab23193d
4 changed files with 196 additions and 515 deletions
|
|
@ -121,29 +121,23 @@ graph TB
|
|||
|
||||
### Redis
|
||||
|
||||
Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.
|
||||
Single **standalone** instance shared by all consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Celery apps, Traefik, etc.). Clients talk to `redis-master.redis.svc.cluster.local:6379`, which now selects the single redis pod directly. **No Sentinel, no HAProxy, no replicas** — reverted from 3-node HA on 2026-05-30 (see "Why standalone" below).
|
||||
|
||||
**Architecture**:
|
||||
|
||||
3 pods in StatefulSet `redis-v2`, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy (3 replicas, PDB minAvailable=2) routes clients to the current master via 1s `INFO replication` tcp-checks. Full context behind the April 2026 rework in beads `code-v2b`.
|
||||
1 pod in StatefulSet `redis-v2` (`replicas=1`, `podManagementPolicy=Parallel` retained for STS-field immutability), running `redis` + `redis_exporter` containers on `docker.io/library/redis:8-alpine` (8.6.2). Data on a `proxmox-lvm-encrypted` PVC (`data-redis-v2-0`, 5Gi→20Gi autoresize).
|
||||
|
||||
- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
|
||||
- **Pod anti-affinity is `required` (hard)** — each redis pod must land on a distinct node. Soft anti-affinity previously let the scheduler co-locate 2/3 pods on the same node; when that node (`k8s-node3`) went `NotReady→Ready` at 11:42 UTC on 2026-04-22 it took 2 redis pods with it and the cluster lost quorum. Cluster-wide PV `nodeAffinity` matches one zone (`topology.kubernetes.io/region=pve, zone=pve`), so PVCs rebind freely on reschedule.
|
||||
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
|
||||
- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
|
||||
- **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted.
|
||||
- **Failover timing (tuned 2026-04-22)**: `sentinel down-after-milliseconds=15000` + `sentinel failover-timeout=60000`. Redis liveness probe `timeout_seconds=10, failure_threshold=5`; sentinel liveness probe same. LUKS-encrypted LVM + BGSAVE fork can briefly stall master I/O >5s, which under the old 5s/30s sentinel timings + 3s/3 probes induced spurious `+sdown`→`+odown`→`+switch-master` cycles every 1-2 minutes. The new values absorb normal BGSAVE pauses without triggering failover.
|
||||
- **HAProxy check smoothing (tuned 2026-04-22)**: `check inter 2s fall 3 rise 2` (was `1s / 2 / 2`) + `timeout check 5s` (was `3s`). The aggressive 1s polling used to race sentinel failovers — during a legitimate promote, HAProxy could catch the old master serving `role:slave` in the 1-3s window before re-probing the new master, leaving the backend empty and clients receiving `ReadOnlyError`.
|
||||
- **Headless service `publish_not_ready_addresses=false`** (flipped 2026-04-22). Previously `true` meant HAProxy's DNS resolver saw not-yet-ready pods during rollouts, compounding the check-race above. Sentinel peer discovery is unaffected because sentinels announce to each other explicitly via `sentinel announce-hostnames yes`.
|
||||
- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
|
||||
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway at the 20% TBW budget.
|
||||
- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.
|
||||
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics).
|
||||
- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up.
|
||||
- `maxmemory=640mb` (83% of the 768Mi pod limit), **`maxmemory-policy=volatile-lru`**. The instance is shared by two workload classes: CACHES (want LRU eviction of disposable keys) and QUEUES (Immich BullMQ `bull:*`, Celery `_kombu:*` — must never be evicted or jobs vanish). `volatile-lru` evicts only keys carrying a TTL (caches set them) and never touches TTL-less keys (queue jobs), serving both correctly in one instance. Backstop: alert `RedisMemoryPressure` at 80% — if it ever fills with non-volatile keys, writes error like `noeviction`.
|
||||
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. `aof-load-corrupt-tail-max-size=1024` tolerates ≤1KB of AOF tail garbage from an unclean reboot instead of crashlooping. Disk-wear (sdb Samsung 850 EVO, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway.
|
||||
- Memory `requests=limits=768Mi`. BGSAVE + AOF-rewrite fork can double RSS via COW; `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
|
||||
- Service `redis-master` (name/DNS unchanged across the HA teardown so no consumer needed editing). Keel opt-out (`keel.sh/policy=never`, label + annotation) — a prior patch-bump to `:8.0.6-alpine` rejected the AOF config and crashed it.
|
||||
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, Pushgateway metrics).
|
||||
- Auth disabled — NetworkPolicy is the isolation layer. `requirepass` + creds rollout to all clients remains a planned follow-up.
|
||||
- **Downtime model**: a single instance means a pod restart (image bump, node drain, OOM) is a few-seconds cluster-wide Redis blip. Explicitly accepted (Viktor, 2026-05-30) as the price of eliminating the HA failure modes below. There is no PDB (a single-replica PDB would only block node drains).
|
||||
|
||||
**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`.
|
||||
**Observability**: `oliver006/redis_exporter:v1.62.0` sidecar on port 9121, auto-scraped. Alerts: `RedisDown`, `RedisMemoryPressure` (>80%), `RedisEvictions`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisBackupStale`, `RedisBackupNeverSucceeded`. (`RedisReplicationLagHigh` + `RedisReplicasMissing` removed with the replicas.)
|
||||
|
||||
**Why this design** — four incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave; (d) 2026-04-22 five-factor flap cascade — soft anti-affinity let 2/3 pods co-locate on `k8s-node3`, node bounced NotReady→Ready and took quorum with it; aggressive sentinel/probe timing (5s/30s + 3s/3) amplified disk-I/O stalls under LUKS-encrypted LVM into spurious `+switch-master` loops; HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters; `publish_not_ready_addresses=true` fed not-yet-ready pods into HAProxy DNS; downstream `realestate-crawler-celery` CrashLoopBackOff closed the feedback loop. See beads epic `code-v2b` for the full plan and linked challenger analyses.
|
||||
**Why standalone** — HA Redis caused more outages than it prevented in this homelab. Five incidents: (a) 2026-04-04 service selector routed writes to a replica → `READONLY`; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC (256Mi too tight); (c) 2026-04-19 PM sentinel quorum drift (2 sentinels, no majority) routed writes to a slave; (d) 2026-04-22 five-factor flap cascade (soft anti-affinity co-located pods + aggressive sentinel/probe timing + HAProxy polling race); (e) **2026-05-30 split-brain** — `redis-v2-0` booted during a network partition, hit the init script's deterministic "pod-0 is bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected `redis-v2-2`; HAProxy's `expect rstring role:master` matched both and round-robined client connections across them, so Immich enqueued BullMQ jobs on one master while its workers blocked-popped on the other → every queue wedged, new-upload thumbnails 404'd cluster-wide. The 3-sentinel design (beads `code-v2b`) was built specifically to prevent split-brain after incident (c), yet the bootstrap fallback manufactured one anyway. Conclusion: for a homelab cache/broker, a single instance with a few-seconds restart blip is strictly simpler and more reliable than chasing Sentinel correctness. Mirrors the MySQL InnoDB-Cluster → standalone reversion (2026-04-16). Post-mortem: `docs/post-mortems/2026-05-30-redis-split-brain.md`.
|
||||
|
||||
### SQLite (Per-App)
|
||||
|
||||
|
|
|
|||
93
docs/post-mortems/2026-05-30-redis-split-brain.md
Normal file
93
docs/post-mortems/2026-05-30-redis-split-brain.md
Normal file
|
|
@ -0,0 +1,93 @@
|
|||
# Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30)
|
||||
|
||||
**Severity:** SEV2 (degraded — no data loss in Redis; queue processing stalled
|
||||
cluster-wide). **Status:** Resolved.
|
||||
|
||||
## Summary
|
||||
|
||||
The 3-node Sentinel HA Redis (`redis-v2`) split-brained: two pods both held
|
||||
`role:master`. HAProxy — which routes to any backend reporting `role:master` —
|
||||
round-robined client connections across **both** masters. Immich enqueued
|
||||
BullMQ jobs on one master while its workers blocked-popped on the other, so
|
||||
every queue stalled. User-visible symptom: **newly uploaded Immich photos
|
||||
returned HTTP 404 for their thumbnails** (the generation job never ran). Celery
|
||||
apps (real-estate-crawler, trading-bot, paperless) and other queue users were
|
||||
affected the same way.
|
||||
|
||||
## Impact
|
||||
|
||||
- Immich: thumbnail/preview/face/ML jobs not processing. `facialRecognition`
|
||||
backlog reached ~30k waiting; new uploads showed broken images in the web UI.
|
||||
- All ~15 shared-Redis consumers had inconsistent reads/writes (connections
|
||||
split across two diverging masters).
|
||||
- No Redis data lost — the larger dataset (`redis-v2-0`, ~30k keys) was
|
||||
preserved through the fix.
|
||||
|
||||
## Timeline (UTC+1 local)
|
||||
|
||||
- **~2026-05-26/27**: `redis-v2` pods recreated (node2 unclean reboot era).
|
||||
`redis-v2-0` came up partitioned; its Sentinel saw 0 peers and it declared
|
||||
itself master via the init script's deterministic "pod-0 = bootstrap master"
|
||||
fallback. Sentinels on `-1`/`-2` independently elected `redis-v2-2`.
|
||||
Split-brain formed and persisted (~3-4 days) as the network healed but the
|
||||
topology never reconciled.
|
||||
- **2026-05-30 ~16:58**: investigating "Immich images with no thumbnails."
|
||||
Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing
|
||||
data-loss issue) AND a stuck job queue.
|
||||
- **2026-05-30 ~17:00**: user manually restarted immich-server; namespace
|
||||
`tier-quota` (24Gi) briefly blocked the replacement pod → ~1 min Immich
|
||||
outage. Recovered. (Red herring — not the root cause.)
|
||||
- **2026-05-30 ~17:1x**: identified two `role:master` redis pods
|
||||
(`redis-v2-0` dbsize 30320, isolated, 0 connected slaves; `redis-v2-2` dbsize
|
||||
442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out
|
||||
IPv6 (cluster is single-stack IPv4) and eviction (`evicted_keys=0`).
|
||||
- **2026-05-30 ~17:30**: reverted `redis-v2` to a single standalone instance.
|
||||
Queues drained immediately; newest Immich assets served HTTP 200.
|
||||
|
||||
## Root cause
|
||||
|
||||
`redis-v2`'s init container (`generate-sentinel-conf`) falls through to
|
||||
"Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer
|
||||
Sentinels/Redis. During a network partition, `redis-v2-0` hit that fallback and
|
||||
became a second master. HAProxy's health check (`tcp-check expect rstring
|
||||
role:master`) matches **any** master, so with two masters it placed both in
|
||||
rotation and round-robined writes/reads across diverging datasets. BullMQ's
|
||||
enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances →
|
||||
jobs never consumed.
|
||||
|
||||
This is the **third** Sentinel-class incident (after 2026-04-19 PM quorum drift
|
||||
and 2026-04-22 flap cascade). The 3-sentinel design was built to *prevent*
|
||||
split-brain, but the bootstrap fallback re-introduced it.
|
||||
|
||||
## Resolution
|
||||
|
||||
Reverted `redis-v2` to a **single standalone instance** (`replicas=1`, Sentinel
|
||||
+ HAProxy removed), collapsing onto `redis-v2-0`'s dataset (preserved Immich's
|
||||
queued jobs). Eviction policy changed `allkeys-lru` → **`volatile-lru`** so the
|
||||
shared cache+queue workload is served correctly by one instance (evict only
|
||||
TTL'd cache keys; never TTL-less queue keys). `redis-master` service name/DNS
|
||||
unchanged → no consumer edits. Decision rationale: a homelab cache/broker does
|
||||
not need HA; a few-seconds restart blip beats chasing Sentinel correctness.
|
||||
Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion.
|
||||
|
||||
## Follow-ups
|
||||
|
||||
- [ ] Re-upload the ~99 Immich images + 12 timeline videos whose **originals**
|
||||
are missing/zero-filled on disk (pre-existing data loss, unrelated to the
|
||||
split-brain — re-running jobs can't regenerate them). Owner: Viktor.
|
||||
- [ ] `requirepass` auth on Redis + creds rollout to all consumers (carried over
|
||||
from the 2026-04-19 rework; still open).
|
||||
- [ ] Consider whether any queue user (Immich/Celery) warrants its own dedicated
|
||||
Redis if the shared instance's memory ever becomes contended (currently
|
||||
~30MB / 640MB — not a concern).
|
||||
|
||||
## Lessons
|
||||
|
||||
- HA that re-introduces its own failure class is worse than no HA. For a
|
||||
single-node-tolerant homelab, prefer a standalone instance + a small accepted
|
||||
downtime window.
|
||||
- `allkeys-lru` on a shared cache+queue Redis silently drops queue jobs under
|
||||
pressure; `volatile-lru` is the correct single-instance policy (Immich even
|
||||
logs `IMPORTANT! Eviction policy ... should be "noeviction"`).
|
||||
- A "bootstrap master" fallback that fires under partition is a split-brain
|
||||
generator — avoid deterministic self-promotion when peers are unreachable.
|
||||
Loading…
Add table
Add a link
Reference in a new issue