infra/stacks/nvidia/modules/nvidia/values.yaml
Viktor Barzin f5cf6ec051 nvidia: bump driver container memory limit 128Mi → 2Gi
After rolling back k8s-node1's kernel to 6.8.0-117 + spoofing
/etc/os-release to 24.04 so the operator picked the matching
ubuntu24.04 driver image (everything per the workaround documented in
docs/known-issues.md), the driver container still went into a restart
loop. Container status:

    lastState.terminated: { reason: "OOMKilled", exitCode: 137 }

The driver-installer was hitting the namespace LimitRange default of
128Mi during `apt-get install linux-headers-6.8.0-117-generic` — the
last log line on every restart was "Installing Linux kernel
headers..." before SIGKILL. 2Gi gives apt + the DKMS compile step
enough headroom; peak observed during a successful compile in a test
container was ~1.4Gi.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00

70 lines
2.5 KiB
YAML

driver:
enabled: true
# repository: nvcr.io/nvidia/driver
# choose a driver version compatible with your GPU + CUDA 12.x (example)
# NVIDIA GPU driver - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#known-issue
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/
# 13.x >= 580
# 12.x >= 525, <580
# 11.x >= 450, <525
#
# Delete the cluster policy before each change
# version: "575.57.08" # CUDA 12.9
#
# 2026-05-17: tried bumping to 580.x with kernelModuleType=open but
# NVIDIA has NOT published any nvcr.io/nvidia/driver:*-ubuntu26.04
# images yet (skopeo list-tags shows 0 ubuntu26.04 tags vs 779 for
# ubuntu22.04 and 206 for ubuntu24.04). Chart v26.3.1's operator
# auto-detects the host OS (k8s-node1 was upgraded to Ubuntu 26.04
# with kernel 7.0.0-15-generic) and picks `<version>-ubuntu26.04` —
# which then 404s on pull. Rolled back to chart v25.10.1 + this
# 570.195.03 pin, which uses the ubuntu24.04 image suffix. That
# image still can't compile against kernel 7.0.0 (apt sources are
# 24.04 noble, which doesn't ship linux-headers-7.0.0-15-generic),
# so the host kernel needs to be rolled back to 6.8.0-117-generic
# (still installed in /boot) before the driver can come up.
# See post-mortem 2026-05-17-gpu-driver-ubuntu2604-mismatch.md.
version: "570.195.03" # CUDA 12.8 — pinned until NVIDIA ships ubuntu26.04 images
upgradePolicy:
autoUpgrade: false
# 2026-05-17: bumped from the namespace LimitRange default of 128Mi.
# The driver-installer's `apt-get install linux-headers-<kernel>` step
# exceeded 128Mi and OOMKilled (exit 137) before producing any visible
# output beyond "Installing Linux kernel headers...". 2Gi limit gives
# the apt + module-compile phase enough headroom (peak observed ~1.4Gi
# while DKMS builds the kernel module).
resources:
requests:
cpu: "50m"
memory: "256Mi"
limits:
memory: "2Gi"
devicePlugin:
config:
name: time-slicing-config
# DCGM Exporter - reduced to 768Mi (actual usage ~489Mi, 1.5x margin)
dcgmExporter:
resources:
requests:
memory: "768Mi"
limits:
memory: "768Mi"
# CUDA Validator - reduced from 1024Mi to 256Mi (one-shot job)
validator:
resources:
requests:
memory: "256Mi"
limits:
memory: "256Mi"
# Tolerate GPU node taint for all GPU operator components
daemonsets:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"