infra/modules/kubernetes/monitoring
Viktor Barzin 0c18a86a7b
[ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
..
dashboards [ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin 2026-02-10 21:29:54 +00:00
server-power-cycle remove kubectl manifests bc drone is not happy running them :/ 2021-05-08 14:03:34 +01:00
alloy.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
Dockerfile add repo for the dockerfile for the redifsh exporter [ci skip] 2023-10-24 11:46:18 +00:00
grafana.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
grafana_chart_values.yaml Migrate all service modules from nginx-ingress to Traefik 2026-02-07 13:25:49 +00:00
idrac.tf reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
k8s-monitoring-values.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
loki.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
loki.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
main.tf Migrate all service modules from nginx-ingress to Traefik 2026-02-07 13:25:49 +00:00
prometheus.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
prometheus_chart_values.tpl [ci skip] Fix all active Prometheus alerts 2026-02-11 22:40:56 +00:00
prometheus_snmp_chart_values.yaml update redifhs exporter to new implementation [ci skip] 2023-10-24 11:44:19 +00:00
pve_exporter.tf add tier to all deployments [ci skip] 2026-01-10 16:28:14 +00:00
snmp_exporter.tf reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
ups_snmp_values.yaml add 2 more oids for ups to monitor active and reactive power consumption [ci skip] 2025-03-15 17:54:04 +00:00