infra/docs/post-mortems/2026-05-30-redis-split-brain.md
Viktor Barzin e1ab23193d redis: revert 3-node Sentinel HA to single standalone instance [ci skip]
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).

Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].

Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.

Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:49:43 +00:00

4.9 KiB

Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30)

Severity: SEV2 (degraded — no data loss in Redis; queue processing stalled cluster-wide). Status: Resolved.

Summary

The 3-node Sentinel HA Redis (redis-v2) split-brained: two pods both held role:master. HAProxy — which routes to any backend reporting role:master — round-robined client connections across both masters. Immich enqueued BullMQ jobs on one master while its workers blocked-popped on the other, so every queue stalled. User-visible symptom: newly uploaded Immich photos returned HTTP 404 for their thumbnails (the generation job never ran). Celery apps (real-estate-crawler, trading-bot, paperless) and other queue users were affected the same way.

Impact

  • Immich: thumbnail/preview/face/ML jobs not processing. facialRecognition backlog reached ~30k waiting; new uploads showed broken images in the web UI.
  • All ~15 shared-Redis consumers had inconsistent reads/writes (connections split across two diverging masters).
  • No Redis data lost — the larger dataset (redis-v2-0, ~30k keys) was preserved through the fix.

Timeline (UTC+1 local)

  • ~2026-05-26/27: redis-v2 pods recreated (node2 unclean reboot era). redis-v2-0 came up partitioned; its Sentinel saw 0 peers and it declared itself master via the init script's deterministic "pod-0 = bootstrap master" fallback. Sentinels on -1/-2 independently elected redis-v2-2. Split-brain formed and persisted (~3-4 days) as the network healed but the topology never reconciled.
  • 2026-05-30 ~16:58: investigating "Immich images with no thumbnails." Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing data-loss issue) AND a stuck job queue.
  • 2026-05-30 ~17:00: user manually restarted immich-server; namespace tier-quota (24Gi) briefly blocked the replacement pod → ~1 min Immich outage. Recovered. (Red herring — not the root cause.)
  • 2026-05-30 ~17:1x: identified two role:master redis pods (redis-v2-0 dbsize 30320, isolated, 0 connected slaves; redis-v2-2 dbsize 442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out IPv6 (cluster is single-stack IPv4) and eviction (evicted_keys=0).
  • 2026-05-30 ~17:30: reverted redis-v2 to a single standalone instance. Queues drained immediately; newest Immich assets served HTTP 200.

Root cause

redis-v2's init container (generate-sentinel-conf) falls through to "Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer Sentinels/Redis. During a network partition, redis-v2-0 hit that fallback and became a second master. HAProxy's health check (tcp-check expect rstring role:master) matches any master, so with two masters it placed both in rotation and round-robined writes/reads across diverging datasets. BullMQ's enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances → jobs never consumed.

This is the third Sentinel-class incident (after 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). The 3-sentinel design was built to prevent split-brain, but the bootstrap fallback re-introduced it.

Resolution

Reverted redis-v2 to a single standalone instance (replicas=1, Sentinel

  • HAProxy removed), collapsing onto redis-v2-0's dataset (preserved Immich's queued jobs). Eviction policy changed allkeys-lruvolatile-lru so the shared cache+queue workload is served correctly by one instance (evict only TTL'd cache keys; never TTL-less queue keys). redis-master service name/DNS unchanged → no consumer edits. Decision rationale: a homelab cache/broker does not need HA; a few-seconds restart blip beats chasing Sentinel correctness. Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion.

Follow-ups

  • Re-upload the ~99 Immich images + 12 timeline videos whose originals are missing/zero-filled on disk (pre-existing data loss, unrelated to the split-brain — re-running jobs can't regenerate them). Owner: Viktor.
  • requirepass auth on Redis + creds rollout to all consumers (carried over from the 2026-04-19 rework; still open).
  • Consider whether any queue user (Immich/Celery) warrants its own dedicated Redis if the shared instance's memory ever becomes contended (currently ~30MB / 640MB — not a concern).

Lessons

  • HA that re-introduces its own failure class is worse than no HA. For a single-node-tolerant homelab, prefer a standalone instance + a small accepted downtime window.
  • allkeys-lru on a shared cache+queue Redis silently drops queue jobs under pressure; volatile-lru is the correct single-instance policy (Immich even logs IMPORTANT! Eviction policy ... should be "noeviction").
  • A "bootstrap master" fallback that fires under partition is a split-brain generator — avoid deterministic self-promotion when peers are unreachable.