infra/stacks/kured
Viktor Barzin 51313ee088 kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.

Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.

Docs: post-mortem + automated-upgrades.md gate note.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:49:04 +00:00
..
main.tf kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard 2026-05-31 14:49:04 +00:00
secrets [infra] Adopt kured + sentinel-gate into Terraform (Wave 5a) 2026-04-18 22:33:29 +00:00
terragrunt.hcl [infra] Adopt kured + sentinel-gate into Terraform (Wave 5a) 2026-04-18 22:33:29 +00:00