infra/docs/architecture
Viktor Barzin e2146e6916 gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
..
agent-task-tracking.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
authentication.md fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 2026-04-15 06:41:56 +00:00
automated-upgrades.md [docs] automated-upgrades: document long-lived OAuth + expiry monitoring 2026-04-18 13:00:07 +00:00
backup-dr.md monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection 2026-04-22 08:54:37 +00:00
ci-cd.md [docs] Architecture docs: registry integrity probe, pin, new CI pipelines 2026-04-19 17:51:26 +00:00
compute.md gpu: schedule off NFD label, not k8s-node1 hostname 2026-04-22 13:43:07 +00:00
databases.md [redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs 2026-04-19 16:32:14 +00:00
dns.md [dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) 2026-04-19 16:12:23 +00:00
homepage.md add homepage auto-discovery documentation [ci skip] 2026-03-25 13:06:43 +02:00
incident-response.md [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP 2026-04-18 10:12:02 +00:00
mailserver.md monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m 2026-04-21 22:39:46 +00:00
monitoring.md monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m 2026-04-21 22:39:46 +00:00
multi-tenancy.md add architecture documentation for all infrastructure subsystems [ci skip] 2026-03-24 00:55:25 +02:00
networking.md [docs] TrueNAS decommission cleanup — remove references from active docs 2026-04-19 16:55:43 +00:00
overview.md gpu: schedule off NFD label, not k8s-node1 hostname 2026-04-22 13:43:07 +00:00
secrets.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
security.md [docs] Update anti-AI and rybbit docs after rewrite-body removal 2026-04-17 21:43:13 +00:00
storage.md [docs] TrueNAS decommission cleanup — remove references from active docs 2026-04-19 16:55:43 +00:00
vpn.md [docs] TrueNAS decommission cleanup — remove references from active docs 2026-04-19 16:55:43 +00:00