infra

Author	SHA1	Message	Date
Viktor Barzin	da4cf18d6d	Add per-pod GPU memory metrics exporter - Add DaemonSet that runs on GPU node and exposes Prometheus metrics - Uses nvidia-smi to collect per-process GPU memory usage - Maps PIDs to container IDs via /proc/<pid>/cgroup - Exposes gpu_pod_memory_used_bytes metric at :9401/metrics - Add Prometheus scrape config for gpu-pod-memory job [ci skip]	2026-01-31 16:58:14 +00:00
Viktor Barzin	1eb3c30479	increase the num of nvidia slices to 20 [ci skip]	2026-01-26 20:41:59 +00:00
Viktor Barzin	f1e9fb9afe	add tier to all deployments [ci skip]	2026-01-10 16:28:14 +00:00
Viktor Barzin	f1dde96d80	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
Viktor Barzin	8af9e6b5bd	set the time slicing config in the nvidia chart values[ci skip]	2025-12-28 08:35:44 +00:00
Viktor Barzin	308ce0019d	downgrade nvidia driver to work with 12.8 cuda[ci skip]	2025-12-14 19:09:20 +00:00
Viktor Barzin	58240d640b	add nvidia deplaoyment [ci skip]	2025-12-14 09:50:26 +00:00