[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts

Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof <master> 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:23:05 +00:00 · 2026-04-19 15:23:05 +00:00 · 150f196095
commit 150f196095
parent 6ee283c2f0
3 changed files with 578 additions and 6 deletions
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1355,12 +1355,65 @@ serverFiles:
            annotations:
              summary: "PostgreSQL pod {{ $labels.pod }} is not ready"
          - alert: RedisDown
-            expr: kube_statefulset_status_replicas_ready{namespace="redis", statefulset="redis-node"} < 1
+            # Covers both the legacy Bitnami StatefulSet (redis-node) and the
+            # new raw StatefulSet (redis-v2) during the 2026-04-19 migration.
+            # Drop the redis-node branch after helm_release.redis is removed.
+            expr: (sum(kube_statefulset_status_replicas_ready{namespace="redis", statefulset=~"redis-node|redis-v2"}) or on() vector(0)) < 1
            for: 5m
            labels:
              severity: critical
            annotations:
-              summary: "Redis has no ready replicas"
+              summary: "Redis has no ready replicas across both clusters"
+          - alert: RedisMemoryPressure
+            expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent"
+          - alert: RedisEvictions
+            # allkeys-lru is configured so evictions under cache pressure are
+            # expected, but sustained evictions mean we're thrashing — raise it.
+            expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)"
+          - alert: RedisReplicationLagHigh
+            expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30
+            for: 3m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master"
+          - alert: RedisForkLatencyHigh
+            # latest_fork_usec > 500ms means BGSAVE fork is stalling the main
+            # thread long enough to drop client requests. COW pressure or
+            # constrained memory headroom are the usual causes.
+            expr: redis_latest_fork_usec{namespace="redis"} > 500000
+            for: 0m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis pod {{ $labels.pod }} fork took {{ $value }}us (>500ms) — investigate memory headroom"
+          - alert: RedisAOFRewriteLong
+            expr: redis_aof_rewrite_in_progress{namespace="redis"} == 1
+            for: 10m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate"
+          - alert: RedisReplicasMissing
+            # redis-v2 StatefulSet should always have 3 replicas connected to
+            # the master (2 replicas + itself). <2 connected_slaves means one
+            # replica is unreachable or still syncing.
+            expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1
+            for: 10m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)"
          - alert: HeadscaleDown
            expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1
            for: 5m