Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics - Uses nvidia-smi to collect per-process GPU memory usage - Maps PIDs to container IDs via /proc/<pid>/cgroup - Exposes gpu_pod_memory_used_bytes metric at :9401/metrics - Add Prometheus scrape config for gpu-pod-memory job [ci skip]
This commit is contained in:
parent
09a5e3a273
commit
4a857ebefd
2 changed files with 294 additions and 0 deletions
|
|
@ -623,4 +623,9 @@ extraScrapeConfigs: |
|
|||
action: replace
|
||||
regex: '(.*)'
|
||||
replacement: 'nvidia_tesla_t4_$${1}'
|
||||
- job_name: 'gpu-pod-memory'
|
||||
static_configs:
|
||||
- targets:
|
||||
- "gpu-pod-exporter.nvidia.svc.cluster.local"
|
||||
metrics_path: '/metrics'
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue