[redis] Bump master/replica memory 64Mi→256Mi (OOMKilled on PSYNC)

## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.

## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:

- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking

The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.

## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).

Sentinels stay at 64Mi — they don't store data.

## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:

  helm rollback redis 43 -n redis
  kubectl -n redis patch sts redis-node --type=strategic \
    -p '{...memory: 256Mi...}'
  kubectl -n redis delete pod redis-node-1 --force

Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.

## Verification
  kubectl -n redis get pods
    redis-node-0   2/2  Running  0  <bounce>
    redis-node-1   2/2  Running  0  <bounce>
  kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
    256Mi

## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
  (separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
  deadlock recovery

Closes: code-pqt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 10:40:51 +00:00
parent 468a7a266b
commit 28009a0e85

View file

@ -72,13 +72,17 @@ resource "helm_release" "redis" {
}
}
# 64Mi was too tight: replica OOMed during PSYNC full resync
# (master steady-state 21Mi + COW during AOF rewrite + RDB transfer
# buffer pushed replica RSS past 64Mi, causing 120 restart loops over
# 5+ days before bump to 256Mi).
resources = {
requests = {
cpu = "100m"
memory = "64Mi"
memory = "256Mi"
}
limits = {
memory = "64Mi"
memory = "256Mi"
}
}
}
@ -100,10 +104,10 @@ resource "helm_release" "redis" {
resources = {
requests = {
cpu = "50m"
memory = "64Mi"
memory = "256Mi"
}
limits = {
memory = "64Mi"
memory = "256Mi"
}
}
}