infra/stacks/platform/modules/monitoring
Viktor Barzin 194281e527 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-15 15:30:18 +00:00
..
dashboards Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
caretta.tf right-size cluster memory: reduce overprovisioned, fix under-provisioned services 2026-03-15 15:30:18 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
grafana.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
grafana_chart_values.yaml equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf feat(monitoring): Disable Loki centralized logging while preserving configuration 2026-03-13 08:41:23 +00:00
loki.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
main.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl right-size cluster memory: reduce overprovisioned, fix under-provisioned services 2026-03-15 15:30:18 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
snmp_exporter.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00