[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only

Phase 3 — replication chain (old → v2):
 - Discovered the v2 cluster was running redis:7.4-alpine, but the
   Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
   the 7.4 replicas rejected the stream with "Can't handle RDB format
   version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
   restore PSYNC compatibility.
 - Discovered that sentinel on BOTH v2 and old Bitnami clusters
   auto-discovered the cross-cluster replication chain when v2-0
   REPLICAOF'd the old master, triggering a failover that reparented
   old-master to a v2 replica and took HAProxy's backend offline.
   Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
   clusters) during the REPLICAOF surgery, then re-MONITOR after
   cutover. This must be done on the OLD sentinels too, not just v2 —
   they're the ones that kept fighting our REPLICAOF.
 - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
   All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
   BullMQ queues and `_kombu.*` Celery queues — the user-stated
   must-survive data class.

Phase 4 — HAProxy cutover:
 - Updated `kubernetes_config_map.haproxy` to point at
   `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
   redis_sentinel backends (removed redis-node-{0,1}).
 - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
   ConfigMap apply so HAProxy's 1s health-check interval found a
   role:master within a few seconds. Cutover disruption on HAProxy
   rollout was brief; old clients naturally moved to new HAProxy pods
   within the rolling update window.
 - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
   mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
   + `announce-hostnames yes` were active — this ensures sentinel
   stores the hostname (not resolved IP) in its rewritten config, so
   pod-IP churn on restart doesn't break failover.

Phase 5 — chaos:
 - Round 1: killed master v2-0 mid-probe. First run exposed the
   sentinel IP-storage issue (stored 10.10.107.222, went stale on
   restart) — ~12s probe disruption. Fixed hostname persistence and
   re-MONITORed.
 - Round 2: killed new master v2-2 with hostnames correctly stored.
   Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
   60s — target <3s of actual user-visible disruption.

Phase 6 — Nextcloud simplification:
 - `zzz-redis.config.php` no longer queries sentinel in-process —
   just points at `redis-master.redis.svc.cluster.local`. Removed 20
   lines of PHP. HAProxy handles master tracking transparently now
   that it's scaled to 3 + PDB minAvailable=2.

Phase 7 step 1:
 - `kubectl scale statefulset/redis-node --replicas=0` (transient —
   TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
   preserved as cold rollback.

Docs:
 - Rewrote `databases.md` Redis section to reflect post-cutover reality
   and the sentinel hostname gotcha (so future sessions don't relearn it).
 - `.claude/reference/service-catalog.md` entry updated.

The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.

Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 16:13:43 +00:00
parent f6685a23a9
commit b6cd83f85a
4 changed files with 25 additions and 41 deletions

View file

@ -19,7 +19,7 @@
| Service | Description | Stack |
|---------|-------------|-------|
| vaultwarden | Bitwarden-compatible password manager | platform |
| redis | Shared Redis at `redis.redis.svc.cluster.local` | redis |
| redis | Shared Redis 8.x via HAProxy at `redis-master.redis.svc.cluster.local` — 3-pod raw StatefulSet `redis-v2` (redis+sentinel+exporter per pod), quorum=2. Clients use HAProxy only, no sentinel fallback. | redis |
| immich | Photo management (GPU) | immich |
| nvidia | GPU device plugin | nvidia |
| metrics-server | K8s metrics | metrics-server |

View file

@ -122,20 +122,18 @@ graph TB
Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`.
**Current state (as of 2026-04-19, interim — parallel cluster during rework)**:
**Current state (as of 2026-04-19, cutover complete)**:
| Cluster | Pods | Source | Purpose |
|---|---|---|---|
| Legacy `redis-node-*` | 1 master + 1 replica (2 sentinels) | Bitnami Helm chart v25.3.2 | Serving live traffic via HAProxy |
| New `redis-v2-*` | 3 pods, each co-locating redis + sentinel + exporter | Raw `kubernetes_stateful_set_v1` with `redis:7.4-alpine` | Standing by for REPLICAOF-based cutover |
Active cluster: `redis-v2-*` — 3 pods, each co-locating redis + sentinel + redis_exporter, using `docker.io/library/redis:8-alpine` (8.6.2). HAProxy backends point at `redis-v2-{0,1,2}.redis-v2-headless.redis.svc.cluster.local`. DBSIZE matched between old master and new at cutover; all data (including `immich_bull:*` and `_kombu.*` queues) preserved via chained `REPLICAOF`. Steady-state probe: 45/45 PING OK. Two chaos drills (kill master, sentinel failover) passed — first drill ~12s disruption, second ~1s after hostname fix below.
Both clusters live in the `redis` namespace. See `infra/stacks/redis/modules/redis/main.tf` (end-state; legacy `helm_release.redis` + `kubernetes_stateful_set_v1.redis_v2` coexist until cutover).
Legacy `redis-node-*` StatefulSet is scaled to 0 (kept as cold rollback for 24h). Helm release `helm_release.redis` + PVCs `redis-data-redis-node-{0,1}` are pending Terraform removal in a follow-up commit (see beads follow-up task).
**Target architecture (post-cutover)**:
**Architecture**:
- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain.
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master. No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master (priority: sentinel vote → peer role:master with slaves → deterministic pod-0 fallback). No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident).
- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof <master> 6379` (replicas), so pods come up already in the right role — no bootstrap race.
- **Sentinel hostname persistence**: `sentinel resolve-hostnames yes` + `sentinel announce-hostnames yes` in the init-generated sentinel.conf are mandatory — without them, sentinel stores resolved IPs in its rewritten config, and pod-IP churn on restart breaks failover. The MONITOR command itself must be issued with a hostname and the flags must be active before MONITOR, otherwise sentinel stores an IP that goes stale the next time the pod is deleted.
- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide 40+ year runway at the 20% TBW budget.
- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`.

View file

@ -29,35 +29,19 @@ nextcloud:
configs:
zzz-redis.config.php: |
<?php
// Redis with Sentinel-based master discovery
// Queries Sentinel to find the current master, falls back to HAProxy service
// which health-checks Redis nodes and routes only to the master.
$sentinels = [
['redis-node-0.redis-headless.redis.svc.cluster.local', 26379],
['redis-node-1.redis-headless.redis.svc.cluster.local', 26379],
];
// Fallback: HAProxy master-only service (safe even if Sentinel is unavailable)
$redisHost = 'redis-master.redis.svc.cluster.local';
$redisPort = 6379;
foreach ($sentinels as [$sHost, $sPort]) {
try {
$s = new Redis();
if ($s->connect($sHost, $sPort, 0.5)) {
$master = $s->rawCommand('SENTINEL', 'get-master-addr-by-name', 'mymaster');
if ($master) {
$redisHost = $master[0];
$redisPort = (int)$master[1];
break;
}
}
} catch (\Exception $e) {}
}
// Redis via HAProxy master-only service. HAProxy (3 replicas, PDB
// minAvailable=2) health-checks all v2 pods via `INFO replication` and
// routes to the current role:master. Sentinel failover takes <30s, and
// HAProxy detects the new master via its 1s tcp-check interval and
// starts routing within ~3s of detection. Removed the old in-process
// sentinel-query loop on 2026-04-19 after the Redis rework — see
// beads code-v2b and infra/docs/architecture/databases.md.
$CONFIG = array(
'memcache.distributed' => '\\OC\\Memcache\\Redis',
'memcache.locking' => '\\OC\\Memcache\\Redis',
'redis' => array(
'host' => $redisHost,
'port' => $redisPort,
'host' => 'redis-master.redis.svc.cluster.local',
'port' => 6379,
'password' => '',
'timeout' => 1.5,
'read_timeout' => 1.5,

View file

@ -186,13 +186,15 @@ resource "kubernetes_config_map" "haproxy" {
tcp-check expect rstring role:master
tcp-check send "QUIT\r\n"
tcp-check expect string +OK
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:6379 check inter 1s fall 2 rise 2 resolvers kubernetes init-addr last,libc,none
backend redis_sentinel
balance roundrobin
server redis-node-0 redis-node-0.redis-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
server redis-node-1 redis-node-1.redis-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
server redis-v2-0 redis-v2-0.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
server redis-v2-1 redis-v2-1.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
server redis-v2-2 redis-v2-2.redis-v2-headless.redis.svc.cluster.local:26379 check inter 5s resolvers kubernetes init-addr last,libc,none
EOT
}
}
@ -596,7 +598,7 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {
init_container {
name = "generate-sentinel-conf"
image = "docker.io/library/redis:7.4-alpine"
image = "docker.io/library/redis:8-alpine"
command = ["/bin/sh", "/bootstrap/init.sh"]
resources {
@ -622,7 +624,7 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {
container {
name = "redis"
image = "docker.io/library/redis:7.4-alpine"
image = "docker.io/library/redis:8-alpine"
command = ["redis-server", "/etc/redis/redis.conf"]
port {
@ -678,7 +680,7 @@ resource "kubernetes_stateful_set_v1" "redis_v2" {
container {
name = "sentinel"
image = "docker.io/library/redis:7.4-alpine"
image = "docker.io/library/redis:8-alpine"
command = ["redis-sentinel", "/shared/sentinel.conf"]
port {