infra/stacks/monitoring
Viktor Barzin 1bca799bb4
Some checks failed
ci/woodpecker/push/default Pipeline failed
monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:06:31 +00:00
..
modules/monitoring monitoring: give kube-state-metrics a 512Mi memory limit (Burstable) 2026-06-26 09:06:31 +00:00
imports.tf monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply) 2026-06-16 11:31:17 +00:00
main.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
secrets fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
terragrunt.hcl fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00