Commit graph

9 commits

Author SHA1 Message Date
Viktor Barzin
cb761f90d7
[ci skip] allow 100 time slicing of nvidia gpu 2026-02-09 21:00:15 +00:00
Viktor Barzin
9689b67895 Add GPU node taint tolerations and enhance GPU memory exporter
Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama)
to support NoSchedule taint on GPU nodes. Update nvidia operator
helm values with daemonset tolerations. Enhance GPU pod memory
exporter with Kubernetes API integration to resolve container IDs
to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00
Viktor Barzin
4a857ebefd Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
92e58d3b62
increase the num of nvidia slices to 20 [ci skip] 2026-01-26 20:41:59 +00:00
Viktor Barzin
8abb8eddc0
add tier to all deployments [ci skip] 2026-01-10 16:28:14 +00:00
Viktor Barzin
a3624f80e0
replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
Viktor Barzin
7a88c26b5b set the time slicing config in the nvidia chart values[ci skip] 2025-12-28 08:35:44 +00:00
Viktor Barzin
64f8eb1fe7
downgrade nvidia driver to work with 12.8 cuda[ci skip] 2025-12-14 19:09:20 +00:00
Viktor Barzin
e17f10f9ee add nvidia deplaoyment [ci skip] 2025-12-14 09:50:26 +00:00