infra

Author	SHA1	Message	Date
Viktor Barzin	349fffc124	Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip] - Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance) - Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead) - Increase gpu-pod-exporter liveness probe timeout from 1s to 5s - Add osm-routing NFS exports (osrm-data, otp-data)	2026-02-15 17:20:47 +00:00
Viktor Barzin	cb761f90d7	[ci skip] allow 100 time slicing of nvidia gpu	2026-02-09 21:00:15 +00:00
Viktor Barzin	9689b67895	Add GPU node taint tolerations and enhance GPU memory exporter Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama) to support NoSchedule taint on GPU nodes. Update nvidia operator helm values with daemonset tolerations. Enhance GPU pod memory exporter with Kubernetes API integration to resolve container IDs to pod names/namespaces, adding RBAC resources for API access.	2026-02-06 20:19:26 +00:00
Viktor Barzin	4a857ebefd	Add per-pod GPU memory metrics exporter - Add DaemonSet that runs on GPU node and exposes Prometheus metrics - Uses nvidia-smi to collect per-process GPU memory usage - Maps PIDs to container IDs via /proc/<pid>/cgroup - Exposes gpu_pod_memory_used_bytes metric at :9401/metrics - Add Prometheus scrape config for gpu-pod-memory job [ci skip]	2026-01-31 16:58:14 +00:00
Viktor Barzin	92e58d3b62	increase the num of nvidia slices to 20 [ci skip]	2026-01-26 20:41:59 +00:00
Viktor Barzin	8abb8eddc0	add tier to all deployments [ci skip]	2026-01-10 16:28:14 +00:00
Viktor Barzin	a3624f80e0	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
Viktor Barzin	7a88c26b5b	set the time slicing config in the nvidia chart values[ci skip]	2025-12-28 08:35:44 +00:00
Viktor Barzin	64f8eb1fe7	downgrade nvidia driver to work with 12.8 cuda[ci skip]	2025-12-14 19:09:20 +00:00
Viktor Barzin	e17f10f9ee	add nvidia deplaoyment [ci skip]	2025-12-14 09:50:26 +00:00

10 commits