Viktor Barzin
349fffc124
Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip]
...
- Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance)
- Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead)
- Increase gpu-pod-exporter liveness probe timeout from 1s to 5s
- Add osm-routing NFS exports (osrm-data, otp-data)
2026-02-15 17:20:47 +00:00
Viktor Barzin
cb761f90d7
[ci skip] allow 100 time slicing of nvidia gpu
2026-02-09 21:00:15 +00:00
Viktor Barzin
9689b67895
Add GPU node taint tolerations and enhance GPU memory exporter
...
Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama)
to support NoSchedule taint on GPU nodes. Update nvidia operator
helm values with daemonset tolerations. Enhance GPU pod memory
exporter with Kubernetes API integration to resolve container IDs
to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00
Viktor Barzin
4a857ebefd
Add per-pod GPU memory metrics exporter
...
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
92e58d3b62
increase the num of nvidia slices to 20 [ci skip]
2026-01-26 20:41:59 +00:00
Viktor Barzin
8abb8eddc0
add tier to all deployments [ci skip]
2026-01-10 16:28:14 +00:00
Viktor Barzin
a3624f80e0
replace hardcoded namespace with module reference [ci skip]
2025-12-29 10:23:42 +00:00
Viktor Barzin
7a88c26b5b
set the time slicing config in the nvidia chart values[ci skip]
2025-12-28 08:35:44 +00:00
Viktor Barzin
64f8eb1fe7
downgrade nvidia driver to work with 12.8 cuda[ci skip]
2025-12-14 19:09:20 +00:00
Viktor Barzin
e17f10f9ee
add nvidia deplaoyment [ci skip]
2025-12-14 09:50:26 +00:00