gpu: schedule off NFD label, not k8s-node1 hostname

Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
This commit is contained in:
Viktor Barzin 2026-04-22 13:43:07 +00:00
parent 134d6b9a82
commit e2146e6916
12 changed files with 52 additions and 36 deletions

View file

@ -122,8 +122,9 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
## GPU Node (k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
## GPU Node (currently k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it

View file

@ -163,10 +163,10 @@ lifecycle {
## Infrastructure
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits.
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
## Contributor Onboarding

View file

@ -18,7 +18,7 @@ graph TB
subgraph Proxmox["Proxmox VE"]
direction TB
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
@ -72,7 +72,7 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
@ -85,9 +85,9 @@ graph TB
|-----------|-------|
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
| PCIe Address | 0000:06:00.0 |
| Assigned VM | VMID 201 (k8s-node1) |
| Node Label | `gpu=true` |
| Node Taint | `nvidia.com/gpu=true:NoSchedule` |
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
| Driver | NVIDIA GPU Operator |
| Resource Name | `nvidia.com/gpu` |
@ -273,8 +273,8 @@ resources {
### GPU Resource Management
**Node Selection**: GPU pods must:
1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint
2. Select `gpu=true` label
1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint
2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is)
3. Request `nvidia.com/gpu: 1` resource
**Example**:
@ -286,7 +286,7 @@ spec:
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
nvidia.com/gpu.present: "true"
containers:
- name: app
resources:
@ -294,6 +294,14 @@ spec:
nvidia.com/gpu: 1
```
**Portability**: No Terraform code references a specific hostname for
GPU scheduling. If the GPU card is physically moved to a different
node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true`
label with it, and `null_resource.gpu_node_config` re-applies the
`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the
next apply (discovery keyed on
`feature.node.kubernetes.io/pci-10de.present=true`).
**GPU Workloads**:
- Ollama (LLM inference)
- ComfyUI (Stable Diffusion workflows)
@ -529,7 +537,7 @@ kubectl describe pod <pod-name> -n <namespace>
```
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
```
**Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`.
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
### Pods OOMKilled repeatedly
@ -614,7 +622,7 @@ spec:
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
nvidia.com/gpu.present: "true"
containers:
- name: app
resources:

View file

@ -139,7 +139,7 @@ The Kubernetes cluster consists of 5 nodes:
- **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
- **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads
GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there.
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
### Service Organization

View file

@ -157,9 +157,9 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
required_during_scheduling_ignored_during_execution {
node_selector_term {
match_expressions {
key = "kubernetes.io/hostname"
key = "nvidia.com/gpu.present"
operator = "NotIn"
values = ["k8s-node1"]
values = ["true"]
}
}
}

View file

@ -72,7 +72,7 @@ resource "kubernetes_deployment" "ebook2audiobook" {
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"
@ -290,7 +290,7 @@ resource "kubernetes_deployment" "audiblez" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"
@ -356,7 +356,7 @@ resource "kubernetes_deployment" "audiblez-web" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"

View file

@ -87,7 +87,7 @@ resource "kubernetes_deployment" "frigate" {
}
spec {
node_selector = {
"gpu" : true
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"

View file

@ -559,7 +559,7 @@ resource "kubernetes_deployment" "immich-machine-learning" {
spec {
priority_class_name = "gpu-workload"
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"

View file

@ -138,7 +138,7 @@ Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier la
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- **GPU workloads**: \`node_selector = { "gpu": "true" }\` + toleration \`nvidia.com/gpu\`
- **GPU workloads**: \`node_selector = { "nvidia.com/gpu.present" : "true" }\` + toleration \`nvidia.com/gpu\` (label auto-applied by gpu-feature-discovery, no hostname pins)
- **Pull-through cache**: 10.0.20.10 use versioned image tags (cache serves stale :latest manifests)
- **MySQL InnoDB Cluster**: 3 instances on iSCSI
- **SMTP**: \`var.mail_host\` port 587 STARTTLS

View file

@ -63,18 +63,25 @@ resource "kubernetes_resource_quota" "nvidia_quota" {
}
}
# Apply GPU taint and label to ensure only GPU workloads run on GPU node
# Apply GPU taint dynamically based on NFD-discovered GPU nodes. The
# NFD label `feature.node.kubernetes.io/pci-10de.present=true` is
# auto-applied on any node with an NVIDIA PCI device (vendor 0x10de),
# so the taint follows the card if it moves between nodes. Workload
# nodeSelectors key off `nvidia.com/gpu.present=true` (applied by
# gpu-feature-discovery once the operator is up).
resource "null_resource" "gpu_node_config" {
provisioner "local-exec" {
command = <<-EOT
kubectl taint nodes k8s-node1 nvidia.com/gpu=true:PreferNoSchedule --overwrite
kubectl label nodes k8s-node1 gpu=true --overwrite
set -euo pipefail
for node in $(kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true -o jsonpath='{.items[*].metadata.name}'); do
kubectl taint nodes "$node" nvidia.com/gpu=true:PreferNoSchedule --overwrite
done
EOT
}
# Re-run if namespace changes (proxy for cluster changes)
triggers = {
namespace = kubernetes_namespace.nvidia.metadata[0].name
namespace = kubernetes_namespace.nvidia.metadata[0].name
command_hash = "dynamic-taint-v1"
}
}
@ -141,7 +148,7 @@ resource "kubernetes_deployment" "nvidia-exporter" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"
@ -604,7 +611,7 @@ resource "kubernetes_daemonset" "gpu_pod_exporter" {
service_account_name = kubernetes_service_account.gpu_pod_exporter.metadata[0].name
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {

View file

@ -73,7 +73,7 @@ resource "kubernetes_deployment" "whisper" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"
@ -195,7 +195,7 @@ resource "kubernetes_deployment" "piper" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"

View file

@ -227,7 +227,7 @@ resource "kubernetes_deployment" "yt_highlights" {
}
spec {
node_selector = {
"gpu" : "true"
"nvidia.com/gpu.present" : "true"
}
toleration {
key = "nvidia.com/gpu"