The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
93 lines
4.9 KiB
Markdown
93 lines
4.9 KiB
Markdown
# Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30)
|
|
|
|
**Severity:** SEV2 (degraded — no data loss in Redis; queue processing stalled
|
|
cluster-wide). **Status:** Resolved.
|
|
|
|
## Summary
|
|
|
|
The 3-node Sentinel HA Redis (`redis-v2`) split-brained: two pods both held
|
|
`role:master`. HAProxy — which routes to any backend reporting `role:master` —
|
|
round-robined client connections across **both** masters. Immich enqueued
|
|
BullMQ jobs on one master while its workers blocked-popped on the other, so
|
|
every queue stalled. User-visible symptom: **newly uploaded Immich photos
|
|
returned HTTP 404 for their thumbnails** (the generation job never ran). Celery
|
|
apps (real-estate-crawler, trading-bot, paperless) and other queue users were
|
|
affected the same way.
|
|
|
|
## Impact
|
|
|
|
- Immich: thumbnail/preview/face/ML jobs not processing. `facialRecognition`
|
|
backlog reached ~30k waiting; new uploads showed broken images in the web UI.
|
|
- All ~15 shared-Redis consumers had inconsistent reads/writes (connections
|
|
split across two diverging masters).
|
|
- No Redis data lost — the larger dataset (`redis-v2-0`, ~30k keys) was
|
|
preserved through the fix.
|
|
|
|
## Timeline (UTC+1 local)
|
|
|
|
- **~2026-05-26/27**: `redis-v2` pods recreated (node2 unclean reboot era).
|
|
`redis-v2-0` came up partitioned; its Sentinel saw 0 peers and it declared
|
|
itself master via the init script's deterministic "pod-0 = bootstrap master"
|
|
fallback. Sentinels on `-1`/`-2` independently elected `redis-v2-2`.
|
|
Split-brain formed and persisted (~3-4 days) as the network healed but the
|
|
topology never reconciled.
|
|
- **2026-05-30 ~16:58**: investigating "Immich images with no thumbnails."
|
|
Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing
|
|
data-loss issue) AND a stuck job queue.
|
|
- **2026-05-30 ~17:00**: user manually restarted immich-server; namespace
|
|
`tier-quota` (24Gi) briefly blocked the replacement pod → ~1 min Immich
|
|
outage. Recovered. (Red herring — not the root cause.)
|
|
- **2026-05-30 ~17:1x**: identified two `role:master` redis pods
|
|
(`redis-v2-0` dbsize 30320, isolated, 0 connected slaves; `redis-v2-2` dbsize
|
|
442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out
|
|
IPv6 (cluster is single-stack IPv4) and eviction (`evicted_keys=0`).
|
|
- **2026-05-30 ~17:30**: reverted `redis-v2` to a single standalone instance.
|
|
Queues drained immediately; newest Immich assets served HTTP 200.
|
|
|
|
## Root cause
|
|
|
|
`redis-v2`'s init container (`generate-sentinel-conf`) falls through to
|
|
"Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer
|
|
Sentinels/Redis. During a network partition, `redis-v2-0` hit that fallback and
|
|
became a second master. HAProxy's health check (`tcp-check expect rstring
|
|
role:master`) matches **any** master, so with two masters it placed both in
|
|
rotation and round-robined writes/reads across diverging datasets. BullMQ's
|
|
enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances →
|
|
jobs never consumed.
|
|
|
|
This is the **third** Sentinel-class incident (after 2026-04-19 PM quorum drift
|
|
and 2026-04-22 flap cascade). The 3-sentinel design was built to *prevent*
|
|
split-brain, but the bootstrap fallback re-introduced it.
|
|
|
|
## Resolution
|
|
|
|
Reverted `redis-v2` to a **single standalone instance** (`replicas=1`, Sentinel
|
|
+ HAProxy removed), collapsing onto `redis-v2-0`'s dataset (preserved Immich's
|
|
queued jobs). Eviction policy changed `allkeys-lru` → **`volatile-lru`** so the
|
|
shared cache+queue workload is served correctly by one instance (evict only
|
|
TTL'd cache keys; never TTL-less queue keys). `redis-master` service name/DNS
|
|
unchanged → no consumer edits. Decision rationale: a homelab cache/broker does
|
|
not need HA; a few-seconds restart blip beats chasing Sentinel correctness.
|
|
Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion.
|
|
|
|
## Follow-ups
|
|
|
|
- [ ] Re-upload the ~99 Immich images + 12 timeline videos whose **originals**
|
|
are missing/zero-filled on disk (pre-existing data loss, unrelated to the
|
|
split-brain — re-running jobs can't regenerate them). Owner: Viktor.
|
|
- [ ] `requirepass` auth on Redis + creds rollout to all consumers (carried over
|
|
from the 2026-04-19 rework; still open).
|
|
- [ ] Consider whether any queue user (Immich/Celery) warrants its own dedicated
|
|
Redis if the shared instance's memory ever becomes contended (currently
|
|
~30MB / 640MB — not a concern).
|
|
|
|
## Lessons
|
|
|
|
- HA that re-introduces its own failure class is worse than no HA. For a
|
|
single-node-tolerant homelab, prefer a standalone instance + a small accepted
|
|
downtime window.
|
|
- `allkeys-lru` on a shared cache+queue Redis silently drops queue jobs under
|
|
pressure; `volatile-lru` is the correct single-instance policy (Immich even
|
|
logs `IMPORTANT! Eviction policy ... should be "noeviction"`).
|
|
- A "bootstrap master" fallback that fires under partition is a split-brain
|
|
generator — avoid deterministic self-promotion when peers are unreachable.
|