## Context
redis-node-1 was stuck in CrashLoopBackOff for 5d10h with 120 restarts.
Cluster-health check flagged it as WARN; Prometheus was firing
`StatefulSetReplicasMismatch` (redis/redis-node: 1/2 ready) and
`PodCrashLooping` alerts continuously.
## Root cause
Memory limit 64Mi is too tight. Master steady-state is only 21Mi, but
the replica needs transient headroom during PSYNC full resync:
- RDB snapshot transfer buffer
- Copy-on-write during AOF rewrite (`fork()` + writes during snapshot)
- Replication backlog tracking
The replica RSS crossed 64Mi during sync and was OOM-killed (exit 137),
looping forever. This also broke Sentinel quorum when master would
fail — no healthy replica to promote.
## Fix
Master + replica: 64Mi → 256Mi (both requests and limits, per
`CLAUDE.md` resource management rule: `requests=limits` based on
VPA upperBound).
Sentinels stay at 64Mi — they don't store data.
## Deployment note
Helm upgrade initially deadlocked because StatefulSet uses
`OrderedReady` podManagementPolicy: the update rollout refuses to start
until all pods Ready, but redis-node-1 could not be Ready without the
update. Recovered via:
helm rollback redis 43 -n redis
kubectl -n redis patch sts redis-node --type=strategic \
-p '{...memory: 256Mi...}'
kubectl -n redis delete pod redis-node-1 --force
Then `scripts/tg apply` cleanly reconciled state. Deadlock-recovery
runbook to be written under `code-cnf`.
## Verification
kubectl -n redis get pods
redis-node-0 2/2 Running 0 <bounce>
redis-node-1 2/2 Running 0 <bounce>
kubectl -n redis get sts redis-node -o jsonpath='{.spec.template.spec.containers[?(@.name=="redis")].resources.limits.memory}'
256Mi
## Follow-ups filed
- code-a3j: lvm-pvc-snapshot Pushgateway push fails sporadically
(separate root cause; surfaced via same cluster-health run)
- code-cnf: runbook / TF tweak for the OrderedReady + atomic-wait
deadlock recovery
Closes: code-pqt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>