infra/stacks/monitoring/modules/monitoring
Viktor Barzin b92e1166a8 monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.

Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
  LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
  PowerOutage) for stability above the new 2m global cadence

Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.

Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
2026-05-22 14:17:00 +00:00
..
dashboards monitoring(wealth): drop 6y timeFrom override on META vest cadence 2026-05-22 14:16:57 +00:00
server-power-cycle Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
alloy.yaml alloy: switch pod log shipping from apiserver to file-tail 2026-05-22 14:17:00 +00:00
Dockerfile extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
goflow2.tf [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
grafana.tf monitoring(grafana): swap python3 for jq in folder-ACL local-exec 2026-05-22 14:16:41 +00:00
grafana_chart_values.yaml monitoring: protect grafana ingress with authentik + disable anonymous 2026-05-22 14:16:41 +00:00
idrac.tf infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
k8s-monitoring-values.yaml cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
loki.tf security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE 2026-05-22 14:16:59 +00:00
loki.yaml [infra] TrueNAS decommission — remove active references from Terraform + configs 2026-04-19 16:57:05 +00:00
main.tf keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-22 14:16:56 +00:00
prometheus.tf fix: HA Sofia REST sensors + PVC drift safety 2026-05-22 14:16:43 +00:00
prometheus_chart_values.tpl monitoring: prometheus global scrape 1m -> 2m + UPS pinned 30s 2026-05-22 14:17:00 +00:00
prometheus_snmp_chart_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
pve_exporter.tf [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
snmp_exporter.tf infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
ups_snmp_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00