infra/stacks/platform/modules/monitoring
Viktor Barzin fca4e02c54 prometheus: increase memory to 4Gi and probe delays for TSDB compaction
Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts
in 8h), causing sustained IO pressure on the PVE host spinning disk.
Increase liveness probe delay to 300s so WAL replay completes before
the probe kills the pod.
2026-03-15 01:44:52 +00:00
..
dashboards Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
caretta.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
grafana.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
grafana_chart_values.yaml equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf feat(monitoring): Disable Loki centralized logging while preserving configuration 2026-03-13 08:41:23 +00:00
loki.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
main.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl prometheus: increase memory to 4Gi and probe delays for TSDB compaction 2026-03-15 01:44:52 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
snmp_exporter.tf equalize memory req=lim across 70+ containers using Prometheus 7d max data 2026-03-14 21:46:49 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00