infra/docs/architecture
Viktor Barzin e1ab23193d redis: revert 3-node Sentinel HA to single standalone instance [ci skip]
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).

Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].

Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.

Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:49:43 +00:00
..
agent-task-tracking.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
authentication.md docs/auth: sync to current auth enum (required/app/public/none) 2026-05-11 19:28:42 +00:00
automated-upgrades.md kured: drop Mon-Fri restriction, reboot any day 2026-05-16 12:29:01 +00:00
backup-dr.md backup-dr docs: refresh diagrams for daily/immich-only architecture 2026-05-26 20:00:31 +00:00
chrome-service.md chrome-service: open NP for Traefik → noVNC sidecar (port 6080) 2026-05-07 18:40:11 +00:00
ci-cd.md [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 18:30:02 +00:00
compute.md immich: GPU-accelerate video transcoding (NVENC + NVDEC) 2026-05-29 18:05:34 +00:00
databases.md redis: revert 3-node Sentinel HA to single standalone instance [ci skip] 2026-05-30 17:49:43 +00:00
dns.md phpipam-pfsense-import: every 5min → hourly 2026-04-26 22:48:43 +00:00
homepage.md add homepage auto-discovery documentation [ci skip] 2026-03-25 13:06:43 +02:00
incident-response.md [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP 2026-04-18 10:12:02 +00:00
llama-cpp.md infra/llama-cpp: benchmark report + -fa flag fix 2026-05-10 15:03:16 +00:00
mailserver.md monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m 2026-04-21 22:39:46 +00:00
monitoring.md security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE 2026-05-19 06:37:54 +00:00
multi-tenancy.md add architecture documentation for all infrastructure subsystems [ci skip] 2026-03-24 00:55:25 +02:00
networking.md traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] 2026-05-30 17:46:59 +00:00
overview.md gpu: schedule off NFD label, not k8s-node1 hostname 2026-04-22 13:43:07 +00:00
secrets.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
security.md security(wave1): W1.7 analysis snapshot — observation data → allowlist plan 2026-05-22 15:22:25 +00:00
storage.md storage docs: document the per-VM SCSI-LUN cap (proxmox-csi) 2026-05-26 02:56:27 +00:00
vpn.md docs: Technitium DNS IP — 10.0.20.101 → 10.0.20.201 2026-05-23 08:53:52 +00:00
wave1-egress-observation-2026-05-22.md security(wave1): W1.7 analysis snapshot — observation data → allowlist plan 2026-05-22 15:22:25 +00:00