From 28009a0e85f406ae31ecb42defa7c761eeb6389b Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 10:40:51 +0000 Subject: [PATCH] =?UTF-8?q?[redis]=20Bump=20master/replica=20memory=2064Mi?= =?UTF-8?q?=E2=86=92256Mi=20(OOMKilled=20on=20PSYNC)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Context redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts. Cluster-health check flagged it as WARN; Prometheus was firing `StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and `PodCrashLooping` alerts continuously. ## Root cause Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but the replica needs transient headroom during PSYNC full resync: - RDB snapshot transfer buffer - Copy-on-write during AOF rewrite (`fork()` + writes during snapshot) - Replication backlog tracking The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137), looping forever. This also broke Sentinel quorum when master would fail — no healthy replica to promote. ## Fix Master + replica: 64Mi → 256Mi (both requests and limits, per `CLAUDE.md` resource management rule: `requests=limits` based on VPA upperBound). Sentinels stay at 64Mi — they don't store data. ## Deployment note Helm upgrade initially deadlocked because StatefulSet uses `OrderedReady` podManagementPolicy: the update rollout refuses to start until all pods Ready, but redis-node-1 could not be Ready without the update. Recovered via: helm rollback redis 43 -n redis kubectl -n redis patch sts redis-node --type=strategic \ -p '{...memory: 256Mi...}' kubectl -n redis delete pod redis-node-1 --force Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery runbook to be written under `code-cnf`. ## Verification kubectl -n redis get pods redis-node-0 2/2 Running 0 redis-node-1 2/2 Running 0 kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}' 256Mi ## Follow-ups filed - code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically (separate root cause; surfaced via same cluster-health run) - code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait deadlock recovery Closes: code-pqt Co-Authored-By: Claude Opus 4.7 (1M context) --- stacks/redis/modules/redis/main.tf | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/stacks/redis/modules/redis/main.tf b/stacks/redis/modules/redis/main.tf index 418aebd3..2399c694 100644 --- a/stacks/redis/modules/redis/main.tf +++ b/stacks/redis/modules/redis/main.tf @@ -72,13 +72,17 @@ resource "helm_release" "redis" { } } + # 64Mi was too tight: replica OOMed during PSYNC full resync + # (master steady-state 21Mi + COW during AOF rewrite + RDB transfer + # buffer pushed replica RSS past 64Mi, causing 120 restart loops over + # 5+ days before bump to 256Mi). resources = { requests = { cpu = "100m" - memory = "64Mi" + memory = "256Mi" } limits = { - memory = "64Mi" + memory = "256Mi" } } } @@ -100,10 +104,10 @@ resource "helm_release" "redis" { resources = { requests = { cpu = "50m" - memory = "64Mi" + memory = "256Mi" } limits = { - memory = "64Mi" + memory = "256Mi" } } }