Add GPU node taint tolerations and enhance GPU memory exporter

Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama)
to support NoSchedule taint on GPU nodes. Update nvidia operator
helm values with daemonset tolerations. Enhance GPU pod memory
exporter with Kubernetes API integration to resolve container IDs
to pod names/namespaces, adding RBAC resources for API access.
This commit is contained in:
Viktor Barzin 2026-02-06 20:19:26 +00:00
parent ffa80f0df6
commit 1275697f2b
5 changed files with 188 additions and 12 deletions

View file

@ -17,3 +17,11 @@ driver:
devicePlugin:
config:
name: time-slicing-config
# Tolerate GPU node taint for all GPU operator components
daemonsets:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"