gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string.
This commit is contained in:
parent
134d6b9a82
commit
e2146e6916
12 changed files with 52 additions and 36 deletions
|
|
@ -18,7 +18,7 @@ graph TB
|
|||
subgraph Proxmox["Proxmox VE"]
|
||||
direction TB
|
||||
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
|
||||
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
|
||||
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
|
||||
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
|
||||
|
|
@ -72,7 +72,7 @@ graph TB
|
|||
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|
||||
|----|------|-------|-----|---------|------|--------|
|
||||
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
|
||||
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
|
|
@ -85,9 +85,9 @@ graph TB
|
|||
|-----------|-------|
|
||||
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
|
||||
| PCIe Address | 0000:06:00.0 |
|
||||
| Assigned VM | VMID 201 (k8s-node1) |
|
||||
| Node Label | `gpu=true` |
|
||||
| Node Taint | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
|
||||
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
|
||||
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
|
||||
| Driver | NVIDIA GPU Operator |
|
||||
| Resource Name | `nvidia.com/gpu` |
|
||||
|
||||
|
|
@ -273,8 +273,8 @@ resources {
|
|||
### GPU Resource Management
|
||||
|
||||
**Node Selection**: GPU pods must:
|
||||
1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint
|
||||
2. Select `gpu=true` label
|
||||
1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint
|
||||
2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is)
|
||||
3. Request `nvidia.com/gpu: 1` resource
|
||||
|
||||
**Example**:
|
||||
|
|
@ -286,7 +286,7 @@ spec:
|
|||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
gpu: "true"
|
||||
nvidia.com/gpu.present: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
|
|
@ -294,6 +294,14 @@ spec:
|
|||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
**Portability**: No Terraform code references a specific hostname for
|
||||
GPU scheduling. If the GPU card is physically moved to a different
|
||||
node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true`
|
||||
label with it, and `null_resource.gpu_node_config` re-applies the
|
||||
`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the
|
||||
next apply (discovery keyed on
|
||||
`feature.node.kubernetes.io/pci-10de.present=true`).
|
||||
|
||||
**GPU Workloads**:
|
||||
- Ollama (LLM inference)
|
||||
- ComfyUI (Stable Diffusion workflows)
|
||||
|
|
@ -529,7 +537,7 @@ kubectl describe pod <pod-name> -n <namespace>
|
|||
```
|
||||
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
|
||||
```
|
||||
**Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`.
|
||||
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
|
||||
|
||||
### Pods OOMKilled repeatedly
|
||||
|
||||
|
|
@ -614,7 +622,7 @@ spec:
|
|||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
gpu: "true"
|
||||
nvidia.com/gpu.present: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
|
|
|
|||
|
|
@ -139,7 +139,7 @@ The Kubernetes cluster consists of 5 nodes:
|
|||
- **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
|
||||
- **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads
|
||||
|
||||
GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there.
|
||||
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
|
||||
|
||||
### Service Organization
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue