monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
kube-state-metrics had no explicit resources, so the monitoring-namespace LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles around 45Mi but momentarily spikes past 256Mi during a full object relist (450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM blacks out the KSM-exported series that ~10 alert rules read, so they all fire false "<svc>Down" criticals at once and self-resolve when KSM recovers ~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC. Set explicit Burstable resources: keep the request low (64Mi, just above idle) so we don't reserve memory we don't use, and raise only the limit to 512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
d105713ae7
commit
1bca799bb4
1 changed files with 13 additions and 0 deletions
|
|
@ -253,6 +253,19 @@ alertmanager:
|
|||
memory: 256Mi
|
||||
limits:
|
||||
memory: 256Mi
|
||||
# kube-state-metrics idles ~45Mi but briefly spikes past the monitoring-namespace
|
||||
# LimitRange default (256Mi) during a full object relist (450+ pods, 150+ jobs, all
|
||||
# secrets/endpoints), so it gets OOMKilled. Each OOM blacks out KSM-derived series
|
||||
# for ~5min and cascades into a wall of false "<svc>Down" criticals that self-resolve
|
||||
# (storm 2026-06-26 08:42). Burstable: low request (minimal reservation) + a 512Mi
|
||||
# limit to absorb the relist peak. No CPU limit (cluster-wide policy).
|
||||
kube-state-metrics:
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 64Mi
|
||||
limits:
|
||||
memory: 512Mi
|
||||
prometheus-node-exporter:
|
||||
enabled: true
|
||||
resources:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue