Add per-pod GPU memory metrics exporter

- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
This commit is contained in:
Viktor Barzin 2026-01-31 16:58:14 +00:00
parent 09a5e3a273
commit 4a857ebefd
2 changed files with 294 additions and 0 deletions

View file

@ -623,4 +623,9 @@ extraScrapeConfigs: |
action: replace
regex: '(.*)'
replacement: 'nvidia_tesla_t4_$${1}'
- job_name: 'gpu-pod-memory'
static_configs:
- targets:
- "gpu-pod-exporter.nvidia.svc.cluster.local"
metrics_path: '/metrics'