From 150f19609562a0c284f2a42eb981556a9354d84b Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 15:23:05 +0000 Subject: [PATCH] [redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/architecture/databases.md | 28 +- .../monitoring/prometheus_chart_values.tpl | 57 +- stacks/redis/modules/redis/main.tf | 499 +++++++++++++++++- 3 files changed, 578 insertions(+), 6 deletions(-) diff --git a/docs/architecture/databases.md b/docs/architecture/databases.md index 5500ed32..b8fd20f8 100644 --- a/docs/architecture/databases.md +++ b/docs/architecture/databases.md @@ -120,9 +120,31 @@ graph TB ### Redis -- Shared instance at `redis.redis.svc.cluster.local` -- Used for caching and session storage -- No persistence (ephemeral) +Single shared cluster for all 17 consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Traefik, etc.). HAProxy (3 replicas, PDB minAvailable=2) is the sole client-facing path — clients talk only to `redis-master.redis.svc.cluster.local:6379` and HAProxy health-checks backends via `INFO replication`, routing only to `role:master`. + +**Current state (as of 2026-04-19, interim — parallel cluster during rework)**: + +| Cluster | Pods | Source | Purpose | +|---|---|---|---| +| Legacy `redis-node-*` | 1 master + 1 replica (2 sentinels) | Bitnami Helm chart v25.3.2 | Serving live traffic via HAProxy | +| New `redis-v2-*` | 3 pods, each co-locating redis + sentinel + exporter | Raw `kubernetes_stateful_set_v1` with `redis:7.4-alpine` | Standing by for REPLICAOF-based cutover | + +Both clusters live in the `redis` namespace. See `infra/stacks/redis/modules/redis/main.tf` (end-state; legacy `helm_release.redis` + `kubernetes_stateful_set_v1.redis_v2` coexist until cutover). + +**Target architecture (post-cutover)**: + +- 3 redis pods + 3 co-located sentinels (quorum=2). Odd sentinel count eliminates split-brain. +- `podManagementPolicy=Parallel` + init container that regenerates `sentinel.conf` on every boot by probing peer sentinels for consensus master. No persistent sentinel runtime state — can't drift out of sync with reality (root cause of 2026-04-19 PM incident). +- redis.conf has `include /shared/replica.conf`; the init container writes either an empty file (master) or `replicaof 6379` (replicas), so pods come up already in the right role — no bootstrap race. +- Memory: master + replicas `requests=limits=768Mi`. Concurrent BGSAVE + AOF-rewrite fork can double RSS via COW, so headroom must cover it. `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency. +- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. Disk-wear analysis on 2026-04-19 (sdb Samsung 850 EVO 1TB, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway at the 20% TBW budget. +- `maxmemory=640mb` (83% of 768Mi limit), `maxmemory-policy=allkeys-lru`. +- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, pushes Pushgateway metrics). +- Auth disabled this phase — NetworkPolicy is the isolation layer. Enabling `requirepass` + rolling creds to all 17 clients is a planned follow-up. + +**Observability** (redis-v2 only): `oliver006/redis_exporter:v1.62.0` sidecar per pod on port 9121, auto-scraped via Prometheus pod annotation. Alerts: `RedisDown`, `RedisMemoryPressure`, `RedisEvictions`, `RedisReplicationLagHigh`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisReplicasMissing`, `RedisBackupStale`, `RedisBackupNeverSucceeded`. + +**Why this design** — three incidents in April 2026 drove the rework: (a) 2026-04-04 service selector routed reads+writes to master+replica causing `READONLY` errors; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC with the 256Mi limit too tight for a 204 MB working set under COW amplification; (c) 2026-04-19 PM sentinel runtime state drifted (only 2 sentinels, no majority) and routed writes to a slave. See beads epic `code-v2b` for the full plan and linked challenger analyses. ### SQLite (Per-App) diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 16fe8c3b..c202c589 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1355,12 +1355,65 @@ serverFiles: annotations: summary: "PostgreSQL pod {{ $labels.pod }} is not ready" - alert: RedisDown - expr: kube_statefulset_status_replicas_ready{namespace="redis", statefulset="redis-node"} < 1 + # Covers both the legacy Bitnami StatefulSet (redis-node) and the + # new raw StatefulSet (redis-v2) during the 2026-04-19 migration. + # Drop the redis-node branch after helm_release.redis is removed. + expr: (sum(kube_statefulset_status_replicas_ready{namespace="redis", statefulset=~"redis-node|redis-v2"}) or on() vector(0)) < 1 for: 5m labels: severity: critical annotations: - summary: "Redis has no ready replicas" + summary: "Redis has no ready replicas across both clusters" + - alert: RedisMemoryPressure + expr: redis_memory_used_bytes{namespace="redis"} / redis_memory_max_bytes{namespace="redis"} > 0.85 + for: 5m + labels: + severity: warning + annotations: + summary: "Redis pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of maxmemory — eviction imminent" + - alert: RedisEvictions + # allkeys-lru is configured so evictions under cache pressure are + # expected, but sustained evictions mean we're thrashing — raise it. + expr: rate(redis_evicted_keys_total{namespace="redis"}[5m]) > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Redis pod {{ $labels.pod }} evicting keys ({{ $value }} keys/s)" + - alert: RedisReplicationLagHigh + expr: redis_connected_slave_lag_seconds{namespace="redis"} > 30 + for: 3m + labels: + severity: warning + annotations: + summary: "Redis replica {{ $labels.slave_ip }} lagging {{ $value }}s behind master" + - alert: RedisForkLatencyHigh + # latest_fork_usec > 500ms means BGSAVE fork is stalling the main + # thread long enough to drop client requests. COW pressure or + # constrained memory headroom are the usual causes. + expr: redis_latest_fork_usec{namespace="redis"} > 500000 + for: 0m + labels: + severity: warning + annotations: + summary: "Redis pod {{ $labels.pod }} fork took {{ $value }}us (>500ms) — investigate memory headroom" + - alert: RedisAOFRewriteLong + expr: redis_aof_rewrite_in_progress{namespace="redis"} == 1 + for: 10m + labels: + severity: warning + annotations: + summary: "Redis pod {{ $labels.pod }} AOF rewrite running >10m — COW memory risk, investigate" + - alert: RedisReplicasMissing + # redis-v2 StatefulSet should always have 3 replicas connected to + # the master (2 replicas + itself). <2 connected_slaves means one + # replica is unreachable or still syncing. + expr: redis_connected_slaves{namespace="redis", pod=~"redis-v2-.*"} < 2 and redis_instance_info{namespace="redis", pod=~"redis-v2-.*", role="master"} == 1 + for: 10m + labels: + severity: warning + annotations: + summary: "Redis master {{ $labels.pod }} has only {{ $value }} connected replicas (expected 2)" - alert: HeadscaleDown expr: (kube_deployment_status_replicas_available{namespace="headscale"} or on() vector(0)) < 1 for: 5m diff --git a/stacks/redis/modules/redis/main.tf b/stacks/redis/modules/redis/main.tf index 5aadb11c..efa6c8b7 100644 --- a/stacks/redis/modules/redis/main.tf +++ b/stacks/redis/modules/redis/main.tf @@ -206,7 +206,11 @@ resource "kubernetes_deployment" "haproxy" { } } spec { - replicas = 2 + # 3 replicas + PDB minAvailable=2 (see kubernetes_pod_disruption_budget_v1.redis_haproxy). + # After Nextcloud drops its sentinel fallback in Phase 6 of the 2026-04-19 redis + # rework, HAProxy is the sole client-facing path for all 17 redis consumers, so + # it needs HA equivalent to other critical-path pods (Traefik, Authentik, PgBouncer). + replicas = 3 selector { match_labels = { app = "redis-haproxy" @@ -336,6 +340,499 @@ module "nfs_backup_host" { nfs_path = "/srv/nfs/redis-backup" } +#### Redis v2 — parallel 3-node raw StatefulSet (target architecture) +# +# Built alongside the Bitnami helm_release.redis so data can migrate via +# REPLICAOF with <60s cutover downtime (see session plan / beads code-v2b). +# +# Pattern: MySQL standalone precedent (stacks/dbaas/modules/dbaas/main.tf, +# 2026-04-16 migration) — raw kubernetes_stateful_set_v1 + official image, +# no Bitnami Helm chart (deprecated by Broadcom Aug 2025; atomic-Helm trap +# caused the 2026-04-04 memory-bump deadlock). +# +# Design choices driven by incident cluster in April 2026: +# - 3 sentinels (odd count, quorum=2) — eliminates the split-brain class +# that caused the 2026-04-19 PM incident (2 sentinels, stale master state). +# - Init container regenerates sentinel.conf on every boot by probing +# peers for role:master — no persistent sentinel runtime state, so stale +# entries can never resurface across pod restarts. +# - podManagementPolicy=Parallel — all 3 pods start together, avoiding the +# "sentinel-0 elects before -2 booted" ordering bug. +# - Memory 768Mi (up from 512Mi) — concurrent BGSAVE + AOF-rewrite fork can +# double RSS via COW. auto-aof-rewrite-percentage 200 + min-size 128mb +# tune down rewrite frequency. +# - Persistence: RDB snapshots + AOF everysec. Measured <1 GB/day write +# volume (2026-04-19 disk-wear analysis) → 40+ year SSD runway. +# - HAProxy remains sole client-facing path for all 17 consumers. + +resource "kubernetes_config_map" "redis_v2_conf" { + metadata { + name = "redis-v2-conf" + namespace = kubernetes_namespace.redis.metadata[0].name + } + data = { + "redis.conf" = <<-EOT + bind 0.0.0.0 -::* + port 6379 + protected-mode no + dir /data + + maxmemory 640mb + maxmemory-policy allkeys-lru + + save 900 1 + save 300 100 + save 60 10000 + rdbcompression yes + rdbchecksum yes + stop-writes-on-bgsave-error no + + appendonly yes + appendfsync everysec + no-appendfsync-on-rewrite no + auto-aof-rewrite-percentage 200 + auto-aof-rewrite-min-size 128mb + aof-load-truncated yes + aof-use-rdb-preamble yes + + replica-read-only yes + replica-serve-stale-data yes + + timeout 0 + tcp-keepalive 300 + tcp-backlog 511 + databases 16 + + loglevel notice + + # Included last so `replicaof` directive written by the init container + # overrides the "standalone master" default. Prevents the parallel- + # bootstrap race where all 3 pods claim role:master simultaneously. + include /shared/replica.conf + EOT + } +} + +resource "kubernetes_config_map" "redis_v2_sentinel_bootstrap" { + metadata { + name = "redis-v2-sentinel-bootstrap" + namespace = kubernetes_namespace.redis.metadata[0].name + } + data = { + "init.sh" = <<-EOT + #!/bin/sh + set -eu + + HOSTNAME=$(hostname) + MY_NUM=$${HOSTNAME##*-} + MY_DNS="$HOSTNAME.redis-v2-headless.redis.svc.cluster.local" + MASTER_HOST="" + + echo "=== Redis v2 bootstrap ===" + echo "hostname: $HOSTNAME (index $MY_NUM)" + + # Priority 1: ask peer sentinels for the consensus master. Covers the + # "steady-state pod restart" case — sentinels already agree on reality + # and a restarting pod should join that topology. + votes_0=0; votes_1=0; votes_2=0; votes_total=0 + for i in 0 1 2; do + if [ "$i" = "$MY_NUM" ]; then continue; fi + peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local" + reply=$(redis-cli -h "$peer" -p 26379 -t 2 SENTINEL get-master-addr-by-name mymaster 2>/dev/null | head -n1 || true) + echo "sentinel probe $peer: master=$${reply:-unreachable}" + case "$reply" in + *redis-v2-0*) votes_0=$((votes_0 + 1)); votes_total=$((votes_total + 1)) ;; + *redis-v2-1*) votes_1=$((votes_1 + 1)); votes_total=$((votes_total + 1)) ;; + *redis-v2-2*) votes_2=$((votes_2 + 1)); votes_total=$((votes_total + 1)) ;; + esac + done + if [ "$votes_total" -gt 0 ]; then + if [ "$votes_0" -ge "$votes_1" ] && [ "$votes_0" -ge "$votes_2" ] && [ "$votes_0" -gt 0 ]; then + MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local" + elif [ "$votes_1" -ge "$votes_2" ] && [ "$votes_1" -gt 0 ]; then + MASTER_HOST="redis-v2-1.redis-v2-headless.redis.svc.cluster.local" + elif [ "$votes_2" -gt 0 ]; then + MASTER_HOST="redis-v2-2.redis-v2-headless.redis.svc.cluster.local" + fi + [ -n "$MASTER_HOST" ] && echo "sentinel vote winner: $MASTER_HOST" + fi + + # Priority 2: look for a peer redis that's a master WITH at least one + # replica connected. "Standalone master" peers (bootstrap race) are + # skipped — connected_slaves=0 is ambiguous. + if [ -z "$MASTER_HOST" ]; then + for i in 0 1 2; do + if [ "$i" = "$MY_NUM" ]; then continue; fi + peer="redis-v2-$i.redis-v2-headless.redis.svc.cluster.local" + info=$(redis-cli -h "$peer" -t 2 INFO replication 2>/dev/null || true) + role=$(echo "$info" | awk -F: '/^role:/ {gsub(/\r/,""); print $2; exit}') + slaves=$(echo "$info" | awk -F: '/^connected_slaves:/ {gsub(/\r/,""); print $2; exit}') + echo "redis probe $peer: role=$${role:-unreachable} slaves=$${slaves:-0}" + if [ "$role" = "master" ] && [ "$${slaves:-0}" -gt 0 ]; then + MASTER_HOST="$peer" + break + fi + done + fi + + # Priority 3: deterministic fallback — pod -0 is always the bootstrap + # master on a fresh cluster. All sentinels converge here, no race. + if [ -z "$MASTER_HOST" ]; then + MASTER_HOST="redis-v2-0.redis-v2-headless.redis.svc.cluster.local" + echo "no master found via probes — bootstrap default: $MASTER_HOST" + fi + + cat > /shared/sentinel.conf <`. + # This way pods come up already in the right role — no post-start race. + if [ "$MY_DNS" = "$MASTER_HOST" ]; then + : > /shared/replica.conf + echo "role: master" + else + echo "replicaof $MASTER_HOST 6379" > /shared/replica.conf + echo "role: replica of $MASTER_HOST" + fi + + echo "=== bootstrap complete ===" + cat /shared/sentinel.conf + echo "--- replica.conf ---" + cat /shared/replica.conf + EOT + } +} + +resource "kubernetes_service" "redis_v2_headless" { + metadata { + name = "redis-v2-headless" + namespace = kubernetes_namespace.redis.metadata[0].name + labels = { + app = "redis-v2" + } + } + spec { + cluster_ip = "None" + publish_not_ready_addresses = true + selector = { + app = "redis-v2" + } + port { + name = "redis" + port = 6379 + } + port { + name = "sentinel" + port = 26379 + } + port { + name = "exporter" + port = 9121 + } + } +} + +resource "kubernetes_stateful_set_v1" "redis_v2" { + metadata { + name = "redis-v2" + namespace = kubernetes_namespace.redis.metadata[0].name + labels = { + app = "redis-v2" + } + } + spec { + service_name = kubernetes_service.redis_v2_headless.metadata[0].name + replicas = 3 + pod_management_policy = "Parallel" + + selector { + match_labels = { + app = "redis-v2" + } + } + + template { + metadata { + labels = { + app = "redis-v2" + } + annotations = { + "prometheus.io/scrape" = "true" + "prometheus.io/port" = "9121" + "checksum/conf" = sha256(kubernetes_config_map.redis_v2_conf.data["redis.conf"]) + "checksum/bootstrap" = sha256(kubernetes_config_map.redis_v2_sentinel_bootstrap.data["init.sh"]) + } + } + spec { + termination_grace_period_seconds = 30 + + affinity { + pod_anti_affinity { + preferred_during_scheduling_ignored_during_execution { + weight = 100 + pod_affinity_term { + label_selector { + match_expressions { + key = "app" + operator = "In" + values = ["redis-v2"] + } + } + topology_key = "kubernetes.io/hostname" + } + } + } + } + + init_container { + name = "generate-sentinel-conf" + image = "docker.io/library/redis:7.4-alpine" + command = ["/bin/sh", "/bootstrap/init.sh"] + + resources { + requests = { + cpu = "10m" + memory = "32Mi" + } + limits = { + memory = "32Mi" + } + } + + volume_mount { + name = "bootstrap" + mount_path = "/bootstrap" + read_only = true + } + volume_mount { + name = "shared" + mount_path = "/shared" + } + } + + container { + name = "redis" + image = "docker.io/library/redis:7.4-alpine" + command = ["redis-server", "/etc/redis/redis.conf"] + + port { + container_port = 6379 + name = "redis" + } + + resources { + requests = { + cpu = "100m" + memory = "768Mi" + } + limits = { + memory = "768Mi" + } + } + + volume_mount { + name = "data" + mount_path = "/data" + } + volume_mount { + name = "conf" + mount_path = "/etc/redis" + read_only = true + } + volume_mount { + # redis.conf `include /shared/replica.conf` — written by init container. + name = "shared" + mount_path = "/shared" + read_only = true + } + + liveness_probe { + exec { + command = ["redis-cli", "PING"] + } + initial_delay_seconds = 15 + period_seconds = 10 + timeout_seconds = 3 + failure_threshold = 3 + } + readiness_probe { + exec { + command = ["redis-cli", "PING"] + } + initial_delay_seconds = 5 + period_seconds = 5 + timeout_seconds = 3 + failure_threshold = 3 + } + } + + container { + name = "sentinel" + image = "docker.io/library/redis:7.4-alpine" + command = ["redis-sentinel", "/shared/sentinel.conf"] + + port { + container_port = 26379 + name = "sentinel" + } + + resources { + requests = { + cpu = "20m" + memory = "64Mi" + } + limits = { + memory = "64Mi" + } + } + + volume_mount { + name = "shared" + mount_path = "/shared" + } + + liveness_probe { + exec { + command = ["redis-cli", "-p", "26379", "PING"] + } + initial_delay_seconds = 20 + period_seconds = 10 + timeout_seconds = 3 + failure_threshold = 3 + } + readiness_probe { + exec { + command = ["redis-cli", "-p", "26379", "PING"] + } + initial_delay_seconds = 10 + period_seconds = 5 + timeout_seconds = 3 + failure_threshold = 3 + } + } + + container { + name = "exporter" + image = "docker.io/oliver006/redis_exporter:v1.62.0" + + port { + container_port = 9121 + name = "exporter" + } + + env { + name = "REDIS_ADDR" + value = "redis://localhost:6379" + } + + resources { + requests = { + cpu = "10m" + memory = "32Mi" + } + limits = { + memory = "32Mi" + } + } + + liveness_probe { + http_get { + path = "/" + port = 9121 + } + initial_delay_seconds = 15 + period_seconds = 30 + timeout_seconds = 5 + } + } + + volume { + name = "conf" + config_map { + name = kubernetes_config_map.redis_v2_conf.metadata[0].name + } + } + volume { + name = "bootstrap" + config_map { + name = kubernetes_config_map.redis_v2_sentinel_bootstrap.metadata[0].name + default_mode = "0755" + } + } + volume { + name = "shared" + empty_dir {} + } + } + } + + volume_claim_template { + metadata { + name = "data" + annotations = { + "resize.topolvm.io/threshold" = "80%" + "resize.topolvm.io/increase" = "100%" + "resize.topolvm.io/storage_limit" = "20Gi" + } + } + spec { + access_modes = ["ReadWriteOnce"] + storage_class_name = "proxmox-lvm-encrypted" + resources { + requests = { + storage = "5Gi" + } + } + } + } + } + + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].template[0].spec[0].dns_config] + } +} + +resource "kubernetes_pod_disruption_budget_v1" "redis_v2" { + metadata { + name = "redis-v2" + namespace = kubernetes_namespace.redis.metadata[0].name + } + spec { + min_available = 2 + selector { + match_labels = { + app = "redis-v2" + } + } + } +} + +resource "kubernetes_pod_disruption_budget_v1" "redis_haproxy" { + metadata { + name = "redis-haproxy" + namespace = kubernetes_namespace.redis.metadata[0].name + } + spec { + min_available = 2 + selector { + match_labels = { + app = "redis-haproxy" + } + } + } +} + # Hourly backup: copy RDB snapshot from master to NFS resource "kubernetes_cron_job_v1" "redis-backup" { metadata {