gpu: schedule off NFD label, not k8s-node1 hostname

Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
This commit is contained in:
Viktor Barzin 2026-04-22 13:43:07 +00:00
parent 134d6b9a82
commit e2146e6916
12 changed files with 52 additions and 36 deletions

View file

@ -163,10 +163,10 @@ lifecycle {
## Infrastructure
- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu`
- **GPU**: `node_selector = { "nvidia.com/gpu.present" : "true" }` + toleration `nvidia.com/gpu`. The label is auto-applied by NFD/gpu-feature-discovery on any node with an NVIDIA PCI device — nothing is hostname-pinned, so the GPU card can move between nodes without Terraform edits.
- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass.
- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes k8s-node1 (GPU node)
- **MySQL InnoDB Cluster**: 1 instance on proxmox-lvm (scaled from 3 — only Uptime Kuma + phpIPAM remain), PriorityClass `mysql-critical` + PDB, anti-affinity excludes any GPU node (`nvidia.com/gpu.present=true`) so MySQL moves off the GPU host automatically if the card is relocated
- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch)
## Contributor Onboarding