Viktor Barzin
|
4a857ebefd
|
Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
|
2026-01-31 16:58:14 +00:00 |
|
Viktor Barzin
|
92e58d3b62
|
increase the num of nvidia slices to 20 [ci skip]
|
2026-01-26 20:41:59 +00:00 |
|
Viktor Barzin
|
8abb8eddc0
|
add tier to all deployments [ci skip]
|
2026-01-10 16:28:14 +00:00 |
|
Viktor Barzin
|
a3624f80e0
|
replace hardcoded namespace with module reference [ci skip]
|
2025-12-29 10:23:42 +00:00 |
|
Viktor Barzin
|
7a88c26b5b
|
set the time slicing config in the nvidia chart values[ci skip]
|
2025-12-28 08:35:44 +00:00 |
|
Viktor Barzin
|
64f8eb1fe7
|
downgrade nvidia driver to work with 12.8 cuda[ci skip]
|
2025-12-14 19:09:20 +00:00 |
|
Viktor Barzin
|
e17f10f9ee
|
add nvidia deplaoyment [ci skip]
|
2025-12-14 09:50:26 +00:00 |
|