[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts
Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm
release so data can migrate via REPLICAOF during a future short maintenance
window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still
points at redis-node-{0,1}.
Architecture:
- 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter
- podManagementPolicy=Parallel + init container that writes fresh
sentinel.conf on every boot by probing peer sentinels and redis for
consensus master (priority: sentinel vote > role:master with slaves >
pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM.
- redis.conf `include /shared/replica.conf` — init container writes
`replicaof <master> 6379` for non-master pods so they come up already in
the correct role. No bootstrap race.
- master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork
COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn.
- RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec.
- PodDisruptionBudget minAvailable=2.
Also:
- HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since
Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes
the sole client-facing path for all 17 consumers.
- New Prometheus alerts: RedisMemoryPressure, RedisEvictions,
RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong,
RedisReplicasMissing. Updated RedisDown to cover both statefulsets
during the migration.
- databases.md updated to describe the interim parallel-cluster state.
Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status
up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded
into Prometheus and inactive.
Beads: code-v2b (still in progress — Phase 3-7 await maintenance window).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
6ee283c2f0
commit
150f196095
3 changed files with 578 additions and 6 deletions
|
|
@ -1355,12 +1355,65 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "PostgreSQL pod {{ $labels.pod }} is not ready"
|
||||
- alert: RedisDown
|
||||
expr: kube_statefulset_status_replicas_ready{namespace="redis", statefulset="redis-node"} < 1
|
||||
# Covers both the legacy Bitnami StatefulSet (redis-node) and the
|
||||
# new raw StatefulSet (redis-v2) during the 2026-04-19 migration.
|
||||
# Drop the redis-node branch after helm_release.redis is removed.
|
||||
expr: (sum(kube_statefulset_status_replicas_ready{namespace="redis", statefulset=~"redis-node|redis-v2"}) or on() vector(0)) < 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Redis has no ready replicas"
|
||||
summary: "Redis has no ready replicas across both clusters"
|
||||
- alert: RedisMemoryPressure
|
||||
expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent"
|
||||
- alert: RedisEvictions
|
||||
# allkeys-lru is configured so evictions under cache pressure are
|
||||
# expected, but sustained evictions mean we're thrashing — raise it.
|
||||
expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)"
|
||||
- alert: RedisReplicationLagHigh
|
||||
expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master"
|
||||
- alert: RedisForkLatencyHigh
|
||||
# latest_fork_usec > 500ms means BGSAVE fork is stalling the main
|
||||
# thread long enough to drop client requests. COW pressure or
|
||||
# constrained memory headroom are the usual causes.
|
||||
expr: redis_latest_fork_usec{namespace="redis"} > 500000
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis pod {{ $labels.pod }} fork took {{ $value }}us (>500ms) — investigate memory headroom"
|
||||
- alert: RedisAOFRewriteLong
|
||||
expr: redis_aof_rewrite_in_progress{namespace="redis"} == 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate"
|
||||
- alert: RedisReplicasMissing
|
||||
# redis-v2 StatefulSet should always have 3 replicas connected to
|
||||
# the master (2 replicas + itself). <2 connected_slaves means one
|
||||
# replica is unreachable or still syncing.
|
||||
expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)"
|
||||
- alert: HeadscaleDown
|
||||
expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1
|
||||
for: 5m
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue