infra/stacks/platform/modules/monitoring
Viktor Barzin 9490f1afad fix: eliminate memory overcommit to prevent node OOM crashes
Set requests = limits (Guaranteed QoS) across LimitRange defaults and
explicit pod resources. Node2 crashed 2026-03-14 from 250% memory
overcommit (61GB limits on 24GB node).

Changes:
- LimitRange: default = defaultRequest for all 6 tiers
- Grafana: 3 → 2 replicas
- Grampsweb: document why replicas=0
- Prometheus: 1Gi/4Gi → 3Gi/3Gi
- OpenClaw: 512Mi/2Gi → 768Mi/768Mi
- Immich server: 256Mi/2Gi → 512Mi/512Mi
- Immich postgresql: 256Mi/1Gi → 512Mi/512Mi
- Calibre: 256Mi/1536Mi → 256Mi/256Mi
- Linkwarden: 256Mi/1536Mi → 768Mi/768Mi
- N8N: 256Mi/1Gi → 512Mi/512Mi
- MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi
- pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi
- DBaaS ResourceQuota limits.memory: 64Gi → 12Gi

[ci skip]
2026-03-18 08:03:59 +00:00
..
dashboards Add node hang instrumentation and scale down chromium services 2026-03-18 08:03:58 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-18 08:03:58 +00:00
caretta.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-18 08:03:59 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-18 08:03:58 +00:00
grafana.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-18 08:03:59 +00:00
grafana_chart_values.yaml fix: eliminate memory overcommit to prevent node OOM crashes 2026-03-18 08:03:59 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf feat(monitoring): Disable Loki centralized logging while preserving configuration 2026-03-17 16:51:02 +00:00
loki.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-18 08:03:58 +00:00
main.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-18 08:03:58 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl fix: eliminate memory overcommit to prevent node OOM crashes 2026-03-18 08:03:59 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
snmp_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00