From e2146e69168c38fad87c10b4558a7d3cef965d2a Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 22 Apr 2026 13:43:07 +0000 Subject: [PATCH] gpu: schedule off NFD label, not k8s-node1 hostname MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string. --- .claude/reference/proxmox-inventory.md | 11 ++++---- AGENTS.md | 4 +-- docs/architecture/compute.md | 28 ++++++++++++------- docs/architecture/overview.md | 2 +- stacks/dbaas/modules/dbaas/main.tf | 4 +-- stacks/ebook2audiobook/main.tf | 6 ++-- stacks/frigate/main.tf | 2 +- stacks/immich/main.tf | 2 +- .../files/src/routes/agent/+server.ts | 2 +- stacks/nvidia/modules/nvidia/main.tf | 21 +++++++++----- stacks/whisper/main.tf | 4 +-- stacks/ytdlp/main.tf | 2 +- 12 files changed, 52 insertions(+), 36 deletions(-) diff --git a/.claude/reference/proxmox-inventory.md b/.claude/reference/proxmox-inventory.md index 1d1ab9bb..60dfab0b 100644 --- a/.claude/reference/proxmox-inventory.md +++ b/.claude/reference/proxmox-inventory.md @@ -122,8 +122,9 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB | `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) | | `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` | -## GPU Node (k8s-node1) -- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) -- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true` -- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration -- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf` +## GPU Node (currently k8s-node1) +- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin +- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node) +- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) +- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration +- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it diff --git a/AGENTS.md b/AGENTS.md index 0f1794f1..5f9c0839 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -163,10 +163,10 @@ lifecycle { ## Infrastructure - **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM) - **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4 -- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu` +- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits. - **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass. - **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding) -- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node) +- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated - **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch) ## Contributor Onboarding diff --git a/docs/architecture/compute.md b/docs/architecture/compute.md index bf456030..cc9c4786 100644 --- a/docs/architecture/compute.md +++ b/docs/architecture/compute.md @@ -18,7 +18,7 @@ graph TB subgraph Proxmox["Proxmox VE"] direction TB MASTER["VM 200: k8s-master
8c / 32GB
10.0.20.100"] - NODE1["VM 201: k8s-node1
16c / 32GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"] + NODE1["VM 201: k8s-node1
16c / 32GB
GPU Passthrough
nvidia.com/gpu=true:PreferNoSchedule"] NODE2["VM 202: k8s-node2
8c / 32GB"] NODE3["VM 203: k8s-node3
8c / 32GB"] NODE4["VM 204: k8s-node4
8c / 32GB"] @@ -72,7 +72,7 @@ graph TB | VM | VMID | vCPUs | RAM | Network | Role | Taints | |----|------|-------|-----|---------|------|--------| | k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` | -| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` | +| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) | | k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None | | k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None | | k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None | @@ -85,9 +85,9 @@ graph TB |-----------|-------| | Device | NVIDIA Tesla T4 (16GB GDDR6) | | PCIe Address | 0000:06:00.0 | -| Assigned VM | VMID 201 (k8s-node1) | -| Node Label | `gpu=true` | -| Node Taint | `nvidia.com/gpu=true:NoSchedule` | +| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin | +| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) | +| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) | | Driver | NVIDIA GPU Operator | | Resource Name | `nvidia.com/gpu` | @@ -273,8 +273,8 @@ resources { ### GPU Resource Management **Node Selection**: GPU pods must: -1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint -2. Select `gpu=true` label +1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint +2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is) 3. Request `nvidia.com/gpu: 1` resource **Example**: @@ -286,7 +286,7 @@ spec: value: "true" effect: NoSchedule nodeSelector: - gpu: "true" + nvidia.com/gpu.present: "true" containers: - name: app resources: @@ -294,6 +294,14 @@ spec: nvidia.com/gpu: 1 ``` +**Portability**: No Terraform code references a specific hostname for +GPU scheduling. If the GPU card is physically moved to a different +node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true` +label with it, and `null_resource.gpu_node_config` re-applies the +`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the +next apply (discovery keyed on +`feature.node.kubernetes.io/pci-10de.present=true`). + **GPU Workloads**: - Ollama (LLM inference) - ComfyUI (Stable Diffusion workflows) @@ -529,7 +537,7 @@ kubectl describe pod -n ``` 0/5 nodes are available: 5 Insufficient nvidia.com/gpu. ``` - **Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`. + **Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken). ### Pods OOMKilled repeatedly @@ -614,7 +622,7 @@ spec: value: "true" effect: NoSchedule nodeSelector: - gpu: "true" + nvidia.com/gpu.present: "true" containers: - name: app resources: diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 9e0fe7be..cb0f8e6d 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -139,7 +139,7 @@ The Kubernetes cluster consists of 5 nodes: - **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only - **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads -GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there. +GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits. ### Service Organization diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 8389aa93..d3e33617 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -157,9 +157,9 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" { required_during_scheduling_ignored_during_execution { node_selector_term { match_expressions { - key = "kubernetes.io/hostname" + key = "nvidia.com/gpu.present" operator = "NotIn" - values = ["k8s-node1"] + values = ["true"] } } } diff --git a/stacks/ebook2audiobook/main.tf b/stacks/ebook2audiobook/main.tf index f9871882..8492991f 100644 --- a/stacks/ebook2audiobook/main.tf +++ b/stacks/ebook2audiobook/main.tf @@ -72,7 +72,7 @@ resource "kubernetes_deployment" "ebook2audiobook" { spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" @@ -290,7 +290,7 @@ resource "kubernetes_deployment" "audiblez" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" @@ -356,7 +356,7 @@ resource "kubernetes_deployment" "audiblez-web" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" diff --git a/stacks/frigate/main.tf b/stacks/frigate/main.tf index 31079be9..489daa63 100644 --- a/stacks/frigate/main.tf +++ b/stacks/frigate/main.tf @@ -87,7 +87,7 @@ resource "kubernetes_deployment" "frigate" { } spec { node_selector = { - "gpu" : true + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index b17e7d55..8cf5162a 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -559,7 +559,7 @@ resource "kubernetes_deployment" "immich-machine-learning" { spec { priority_class_name = "gpu-workload" node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts index f96f4d56..21405a94 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/agent/+server.ts @@ -138,7 +138,7 @@ Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier la - **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM) - **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4 -- **GPU workloads**: \`node_selector = { "gpu": "true" }\` + toleration \`nvidia.com/gpu\` +- **GPU workloads**: \`node_selector = { "nvidia.com/gpu.present" : "true" }\` + toleration \`nvidia.com/gpu\` (label auto-applied by gpu-feature-discovery, no hostname pins) - **Pull-through cache**: 10.0.20.10 — use versioned image tags (cache serves stale :latest manifests) - **MySQL InnoDB Cluster**: 3 instances on iSCSI - **SMTP**: \`var.mail_host\` port 587 STARTTLS diff --git a/stacks/nvidia/modules/nvidia/main.tf b/stacks/nvidia/modules/nvidia/main.tf index f11bd2c3..720f6daf 100644 --- a/stacks/nvidia/modules/nvidia/main.tf +++ b/stacks/nvidia/modules/nvidia/main.tf @@ -63,18 +63,25 @@ resource "kubernetes_resource_quota" "nvidia_quota" { } } -# Apply GPU taint and label to ensure only GPU workloads run on GPU node +# Apply GPU taint dynamically based on NFD-discovered GPU nodes. The +# NFD label `feature.node.kubernetes.io/pci-10de.present=true` is +# auto-applied on any node with an NVIDIA PCI device (vendor 0x10de), +# so the taint follows the card if it moves between nodes. Workload +# nodeSelectors key off `nvidia.com/gpu.present=true` (applied by +# gpu-feature-discovery once the operator is up). resource "null_resource" "gpu_node_config" { provisioner "local-exec" { command = <<-EOT - kubectl taint nodes k8s-node1 nvidia.com/gpu=true:PreferNoSchedule --overwrite - kubectl label nodes k8s-node1 gpu=true --overwrite + set -euo pipefail + for node in $(kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true -o jsonpath='{.items[*].metadata.name}'); do + kubectl taint nodes "$node" nvidia.com/gpu=true:PreferNoSchedule --overwrite + done EOT } - # Re-run if namespace changes (proxy for cluster changes) triggers = { - namespace = kubernetes_namespace.nvidia.metadata[0].name + namespace = kubernetes_namespace.nvidia.metadata[0].name + command_hash = "dynamic-taint-v1" } } @@ -141,7 +148,7 @@ resource "kubernetes_deployment" "nvidia-exporter" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" @@ -604,7 +611,7 @@ resource "kubernetes_daemonset" "gpu_pod_exporter" { service_account_name = kubernetes_service_account.gpu_pod_exporter.metadata[0].name node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { diff --git a/stacks/whisper/main.tf b/stacks/whisper/main.tf index 2858d680..27096f86 100644 --- a/stacks/whisper/main.tf +++ b/stacks/whisper/main.tf @@ -73,7 +73,7 @@ resource "kubernetes_deployment" "whisper" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" @@ -195,7 +195,7 @@ resource "kubernetes_deployment" "piper" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu" diff --git a/stacks/ytdlp/main.tf b/stacks/ytdlp/main.tf index a73f434a..d8dcfcc6 100644 --- a/stacks/ytdlp/main.tf +++ b/stacks/ytdlp/main.tf @@ -227,7 +227,7 @@ resource "kubernetes_deployment" "yt_highlights" { } spec { node_selector = { - "gpu" : "true" + "nvidia.com/gpu.present" : "true" } toleration { key = "nvidia.com/gpu"