Add GPU node taint tolerations and enhance GPU memory exporter

Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama) to support NoSchedule taint on GPU nodes. Update nvidia operator helm values with daemonset tolerations. Enhance GPU pod memory exporter with Kubernetes API integration to resolve container IDs to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00 · 2026-02-06 20:19:26 +00:00 · 1275697f2b
commit 1275697f2b
parent ffa80f0df6
5 changed files with 188 additions and 12 deletions
--- a/modules/kubernetes/nvidia/values.yaml
+++ b/modules/kubernetes/nvidia/values.yaml
@ -17,3 +17,11 @@ driver:
  devicePlugin:
    config:
      name: time-slicing-config
+
+# Tolerate GPU node taint for all GPU operator components
+daemonsets:
+  tolerations:
+    - key: "nvidia.com/gpu"
+      operator: "Equal"
+      value: "true"
+      effect: "NoSchedule"