From 28009a0e85f406ae31ecb42defa7c761eeb6389b Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Sun, 19 Apr 2026 10:40:51 +0000
Subject: [PATCH] =?UTF-8?q?[redis]=20Bump=20master/replica=20memory=2064Mi?=
 =?UTF-8?q?=E2=86=92256Mi=20(OOMKilled=20on=20PSYNC)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.

## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:

- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking

The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.

## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).

Sentinels stay at 64Mi — they don't store data.

## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:

  helm rollback redis 43 -n redis
  kubectl -n redis patch sts redis-node --type=strategic \
    -p '{...memory: 256Mi...}'
  kubectl -n redis delete pod redis-node-1 --force

Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.

## Verification
  kubectl -n redis get pods
    redis-node-0   2/2  Running  0  <bounce>
    redis-node-1   2/2  Running  0  <bounce>
  kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
    256Mi

## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
  (separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
  deadlock recovery

Closes: code-pqt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 stacks/redis/modules/redis/main.tf | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/stacks/redis/modules/redis/main.tf b/stacks/redis/modules/redis/main.tf
index 418aebd3..2399c694 100644
--- a/stacks/redis/modules/redis/main.tf
+++ b/stacks/redis/modules/redis/main.tf
@@ -72,13 +72,17 @@ resource "helm_release" "redis" {
         }
       }
 
+      # 64Mi was too tight: replica OOMed during PSYNC full resync
+      # (master steady-state 21Mi + COW during AOF rewrite + RDB transfer
+      # buffer pushed replica RSS past 64Mi, causing 120 restart loops over
+      # 5+ days before bump to 256Mi).
       resources = {
         requests = {
           cpu    = "100m"
-          memory = "64Mi"
+          memory = "256Mi"
         }
         limits = {
-          memory = "64Mi"
+          memory = "256Mi"
         }
       }
     }
@@ -100,10 +104,10 @@ resource "helm_release" "redis" {
       resources = {
         requests = {
           cpu    = "50m"
-          memory = "64Mi"
+          memory = "256Mi"
         }
         limits = {
-          memory = "64Mi"
+          memory = "256Mi"
         }
       }
     }