Commit graph

9 commits

Author SHA1 Message Date
Viktor Barzin
5a81ce5774 [ci skip] allow 100 time slicing of nvidia gpu 2026-02-09 21:00:15 +00:00
Viktor Barzin
1275697f2b Add GPU node taint tolerations and enhance GPU memory exporter
Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama)
to support NoSchedule taint on GPU nodes. Update nvidia operator
helm values with daemonset tolerations. Enhance GPU pod memory
exporter with Kubernetes API integration to resolve container IDs
to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00
Viktor Barzin
da4cf18d6d Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
1eb3c30479 increase the num of nvidia slices to 20 [ci skip] 2026-01-26 20:41:59 +00:00
Viktor Barzin
f1e9fb9afe add tier to all deployments [ci skip] 2026-01-10 16:28:14 +00:00
Viktor Barzin
f1dde96d80 replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
Viktor Barzin
8af9e6b5bd set the time slicing config in the nvidia chart values[ci skip] 2025-12-28 08:35:44 +00:00
Viktor Barzin
308ce0019d downgrade nvidia driver to work with 12.8 cuda[ci skip] 2025-12-14 19:09:20 +00:00
Viktor Barzin
58240d640b add nvidia deplaoyment [ci skip] 2025-12-14 09:50:26 +00:00