infra

Viktor Barzin da4cf18d6d Add per-pod GPU memory metrics exporter - Add DaemonSet that runs on GPU node and exposes Prometheus metrics - Uses nvidia-smi to collect per-process GPU memory usage - Maps PIDs to container IDs via /proc/<pid>/cgroup - Exposes gpu_pod_memory_used_bytes metric at :9401/metrics - Add Prometheus scrape config for gpu-pod-memory job [ci skip]		2026-01-31 16:58:14 +00:00
..
dashboards	add registry low cache hit rate alert [ci skip]	2025-12-29 10:43:57 +00:00
server-power-cycle	remove kubectl manifests bc drone is not happy running them :/	2021-05-08 14:03:34 +01:00
alloy.yaml	add loki + alloy deployments for logs collection [ci skip]	2025-05-04 11:25:39 +00:00
Dockerfile	add repo for the dockerfile for the redifsh exporter [ci skip]	2023-10-24 11:46:18 +00:00
grafana.tf	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
grafana_chart_values.yaml	scale grafana to 3 pods for resilience [ci skip]	2026-01-12 18:27:54 +00:00
idrac.tf	reduce the frequency of polling idrac and remove some duplicates [ci skip]	2026-01-24 18:47:22 +00:00
k8s-monitoring-values.yaml	add loki + alloy deployments for logs collection [ci skip]	2025-05-04 11:25:39 +00:00
loki.tf	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
loki.yaml	add loki + alloy deployments for logs collection [ci skip]	2025-05-04 11:25:39 +00:00
main.tf	add tier to all deployments [ci skip]	2026-01-10 16:28:14 +00:00
prometheus.tf	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
prometheus_chart_values.tpl	Add per-pod GPU memory metrics exporter	2026-01-31 16:58:14 +00:00
pve_exporter.tf	add tier to all deployments [ci skip]	2026-01-10 16:28:14 +00:00
snmp_exporter.tf	reduce the frequency of polling idrac and remove some duplicates [ci skip]	2026-01-24 18:47:22 +00:00
ups_snmp_values.yaml	add 2 more oids for ups to monitor active and reactive power consumption [ci skip]	2025-03-15 17:54:04 +00:00