[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f6685a23a9
commit
b6cd83f85a
4 changed files with 25 additions and 41 deletions
|
|
@ -29,35 +29,19 @@ nextcloud:
|
|||
configs:
|
||||
zzz-redis.config.php: |
|
||||
<?php
|
||||
// Redis with Sentinel-based master discovery
|
||||
// Queries Sentinel to find the current master, falls back to HAProxy service
|
||||
// which health-checks Redis nodes and routes only to the master.
|
||||
$sentinels = [
|
||||
['redis-node-0.redis-headless.redis.svc.cluster.local', 26379],
|
||||
['redis-node-1.redis-headless.redis.svc.cluster.local', 26379],
|
||||
];
|
||||
// Fallback: HAProxy master-only service (safe even if Sentinel is unavailable)
|
||||
$redisHost = 'redis-master.redis.svc.cluster.local';
|
||||
$redisPort = 6379;
|
||||
foreach ($sentinels as [$sHost, $sPort]) {
|
||||
try {
|
||||
$s = new Redis();
|
||||
if ($s->connect($sHost, $sPort, 0.5)) {
|
||||
$master = $s->rawCommand('SENTINEL', 'get-master-addr-by-name', 'mymaster');
|
||||
if ($master) {
|
||||
$redisHost = $master[0];
|
||||
$redisPort = (int)$master[1];
|
||||
break;
|
||||
}
|
||||
}
|
||||
} catch (\Exception $e) {}
|
||||
}
|
||||
// Redis via HAProxy master-only service. HAProxy (3 replicas, PDB
|
||||
// minAvailable=2) health-checks all v2 pods via `INFO replication` and
|
||||
// routes to the current role:master. Sentinel failover takes <30s, and
|
||||
// HAProxy detects the new master via its 1s tcp-check interval and
|
||||
// starts routing within ~3s of detection. Removed the old in-process
|
||||
// sentinel-query loop on 2026-04-19 after the Redis rework — see
|
||||
// beads code-v2b and infra/docs/architecture/databases.md.
|
||||
$CONFIG = array(
|
||||
'memcache.distributed' => '\\OC\\Memcache\\Redis',
|
||||
'memcache.locking' => '\\OC\\Memcache\\Redis',
|
||||
'redis' => array(
|
||||
'host' => $redisHost,
|
||||
'port' => $redisPort,
|
||||
'host' => 'redis-master.redis.svc.cluster.local',
|
||||
'port' => 6379,
|
||||
'password' => '',
|
||||
'timeout' => 1.5,
|
||||
'read_timeout' => 1.5,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue