From 98d2b896140b602f576873d87d647ea74f50e9fd Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 23 Jun 2026 12:46:28 +0000 Subject: [PATCH] calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi startup spike (re-listing resources to build informer caches), both at/over the 256Mi limit, so the first time the pod restarted it could never finish startup (exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit was always too tight; it only bit once the pod restarted. Data plane was never affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom (now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration (which never touched calico); cluster churn was at most the trigger that exposed the tight limit. Co-Authored-By: Claude Opus 4.8 --- stacks/calico/main.tf | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 878a594c..dd2ef2a4 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -162,6 +162,11 @@ resource "helm_release" "tigera_operator" { # are installed -> "ensure CRDs are installed first". Not needed here. goldmane = { enabled = false } whisker = { enabled = false } - resources = { limits = { memory = "256Mi" } } + # 512Mi (was 256Mi): the operator idles at ~38Mi but its STARTUP spike + # (re-listing resources to build informer caches) exceeded 256Mi and + # OOM-crashlooped on 2026-06-23 the first time the pod restarted (a latent + # landmine — any restart would have triggered it). 512Mi covers the spike; + # data plane (calico-node) is unaffected by an operator restart. + resources = { limits = { memory = "512Mi" } } })] }