infra/modules/kubernetes/monitoring
Viktor Barzin c8a41ac567 [ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
..
dashboards [ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin 2026-02-10 21:29:54 +00:00
server-power-cycle remove kubectl manifests bc drone is not happy running them :/ 2021-05-08 14:03:34 +01:00
alloy.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
Dockerfile add repo for the dockerfile for the redifsh exporter [ci skip] 2023-10-24 11:46:18 +00:00
grafana.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
grafana_chart_values.yaml Migrate all service modules from nginx-ingress to Traefik 2026-02-07 13:25:49 +00:00
idrac.tf reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
k8s-monitoring-values.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
loki.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
loki.yaml add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
main.tf Migrate all service modules from nginx-ingress to Traefik 2026-02-07 13:25:49 +00:00
prometheus.tf replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
prometheus_chart_values.tpl [ci skip] Add 12 Prometheus alert rules for monitoring gaps 2026-02-11 22:14:30 +00:00
pve_exporter.tf add tier to all deployments [ci skip] 2026-01-10 16:28:14 +00:00
snmp_exporter.tf reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
ups_snmp_values.yaml add 2 more oids for ups to monitor active and reactive power consumption [ci skip] 2025-03-15 17:54:04 +00:00