From 1bca799bb427a1a2b714fae13c36f5a215b7a086 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <viktorbarzin@meta.com>
Date: Fri, 26 Jun 2026 09:06:31 +0000
Subject: [PATCH] monitoring: give kube-state-metrics a 512Mi memory limit
 (Burstable)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../modules/monitoring/prometheus_chart_values.tpl  | 13 +++++++++++++
 1 file changed, 13 insertions(+)
diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
index f526e7ac..e98c9918 100755
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@@ -253,6 +253,19 @@ alertmanager:
       memory: 256Mi
     limits:
       memory: 256Mi
+# kube-state-metrics idles ~45Mi but briefly spikes past the monitoring-namespace
+# LimitRange default (256Mi) during a full object relist (450+ pods, 150+ jobs, all
+# secrets/endpoints), so it gets OOMKilled. Each OOM blacks out KSM-derived series
+# for ~5min and cascades into a wall of false "<svc>Down" criticals that self-resolve
+# (storm 2026-06-26 08:42). Burstable: low request (minimal reservation) + a 512Mi
+# limit to absorb the relist peak. No CPU limit (cluster-wide policy).
+kube-state-metrics:
+  resources:
+    requests:
+      cpu: 100m
+      memory: 64Mi
+    limits:
+      memory: 512Mi
 prometheus-node-exporter:
   enabled: true
   resources: