infra/stacks/monitoring/modules/monitoring
Viktor Barzin 12a51c4ffa right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip]
- nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop
  inflating GPU operator init containers; saves ~2.5Gi on GPU node
- nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi)
- monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi)
- onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi)
- immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default)
- dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica

Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init
container (no explicit resources), wasting ~2.5Gi scheduling overhead on the
GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.
2026-03-17 22:35:54 +00:00
..
dashboards extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
server-power-cycle extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
alloy.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
caretta.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
Dockerfile extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
goflow2.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
grafana.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
grafana_chart_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
idrac.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
k8s-monitoring-values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
loki.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
loki.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
main.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
prometheus.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
prometheus_chart_values.tpl right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip] 2026-03-17 22:35:54 +00:00
prometheus_snmp_chart_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
pve_exporter.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
snmp_exporter.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
ups_snmp_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00