redis: revert 3-node Sentinel HA to single standalone instance [ci skip]

The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).

Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].

Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.

Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-30 17:49:43 +00:00
parent 5bcb4525a4
commit e1ab23193d
4 changed files with 196 additions and 515 deletions

View file

@ -1676,30 +1676,28 @@ serverFiles:
labels:
severity: critical
annotations:
summary: "Redis has no ready replicas"
summary: "Redis is down — statefulset redis-v2 has no ready pod"
- alert: RedisMemoryPressure
expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85
# Single instance, volatile-lru (2026-05-30): at maxmemory, TTL'd
# (cache) keys are evicted but TTL-less keys (Immich BullMQ + Celery
# jobs) are NOT — so once cache headroom is gone, queue writes start
# erroring. 80% is the backstop to intervene (bump maxmemory) first.
expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent"
summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — volatile-lru evicting cache keys; queue writes at risk"
- alert: RedisEvictions
# allkeys-lru is configured so evictions under cache pressure are
# expected, but sustained evictions mean we're thrashing — raise it.
# volatile-lru evicts only TTL'd (cache) keys under pressure — an
# occasional eviction is by design, but a sustained rate means we're
# near maxmemory and should raise it before queue writes error.
expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)"
- alert: RedisReplicationLagHigh
expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30
for: 3m
labels:
severity: warning
annotations:
summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master"
summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s) — near maxmemory"
- alert: RedisForkLatencyHigh
# latest_fork_usec > 500ms means BGSAVE fork is stalling the main
# thread long enough to drop client requests. COW pressure or
@ -1717,16 +1715,6 @@ serverFiles:
severity: warning
annotations:
summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate"
- alert: RedisReplicasMissing
# redis-v2 StatefulSet should always have 3 replicas connected to
# the master (2 replicas + itself). <2 connected_slaves means one
# replica is unreachable or still syncing.
expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)"
- alert: HeadscaleReplicasMismatch
expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1
for: 5m