Add GPU node taint tolerations and enhance GPU memory exporter

Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama) to support NoSchedule taint on GPU nodes. Update nvidia operator helm values with daemonset tolerations. Enhance GPU pod memory exporter with Kubernetes API integration to resolve container IDs to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00 · 2026-02-06 20:19:26 +00:00 · 9689b67895
commit 9689b67895
parent d9a4417257
5 changed files with 188 additions and 12 deletions
--- a/modules/kubernetes/nvidia/values.yaml
+++ b/modules/kubernetes/nvidia/values.yaml
@ -17,3 +17,11 @@ driver:
  devicePlugin:
    config:
      name: time-slicing-config
+
+# Tolerate GPU node taint for all GPU operator components
+daemonsets:
+  tolerations:
+    - key: "nvidia.com/gpu"
+      operator: "Equal"
+      value: "true"
+      effect: "NoSchedule"