- Add gpu=true label to Terraform (nvidia null_resource alongside taint) - Improve API server OIDC config to detect value changes, not just flag presence - Add policy_hash trigger to audit-policy so rule changes auto-reapply - Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook - Document full node rebuild procedure in CLAUDE.md - Save Talos Linux migration evaluation for future reference
11 KiB
Talos Linux Migration Evaluation
Date: 2026-02-22 Status: Parked (evaluating ROI) Decision: Not yet decided — saved for future reference
Problem Statement
The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of:
- Cloud-init (packages, repos, containerd, kubelet, kubeadm join)
- Terraform
null_resourcewith SSH (containerd mirrors, API server OIDC, audit policy, GPU taint) - Ansible playbook (node exporter — optional)
- DaemonSets (sysctl inotify limits)
- Manual steps (GPU label, node upgrades, containerd mirror fixes)
This creates a drift surface and makes full from-scratch reprovisioning non-trivial.
Goals:
- Prevent configuration drift — ensure nodes match what's declared in code
- Single-command bootstrap — recover from complete node/cluster failure
- Everything managed as code in the infra repository
Options Evaluated
Option 1: Chef on Ubuntu — Rejected
- Chef is effectively dead (Progress acquisition, shrinking ecosystem)
- Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system
- Drift detection is reactive (periodic convergence), not preventive
- Doesn't simplify the provisioning chain, just replaces SSH commands with recipes
Option 2: NixOS — Not pursued
- Strongest drift guarantees (entire OS derived from Nix expression)
- Steep learning curve (functional language, unhelpful error messages)
- NVIDIA + containerd + K8s on NixOS is a niche combination
- Proxmox cloud-init integration less mature than Ubuntu
- Significant migration effort for marginal benefit over Talos
Option 3: Talos Linux — Preferred candidate (if migrating)
Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible.
Option 4: Improve current setup — Low-cost alternative
Consolidate existing null_resource SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below.
Talos Linux — Detailed Assessment
What Maps Cleanly
| Current (Ubuntu) | Talos Equivalent | Complexity |
|---|---|---|
| cloud_init.yaml packages | Eliminated (no packages needed) | None |
| containerd registry mirrors | machine.registries.mirrors in machine config |
Simple |
kubeadm join |
Talos manages K8s lifecycle natively | Simple |
| sysctl DaemonSet (inotify) | machine.sysctls in machine config |
Simple |
| API server OIDC flags (SSH+sed) | cluster.apiServer.extraArgs |
Simple |
| Audit policy (SSH+sed) | cluster.apiServer.extraArgs + extraVolumes |
Simple |
| GPU label (manual) | machine.nodeLabels |
Simple |
| GPU taint (null_resource) | machine.nodeTaints or machine config |
Simple |
| Static IPs | machine.network.interfaces |
Simple |
| QEMU guest agent | qemu-guest-agent system extension |
Simple |
What Has Friction
| Component | Issue | Severity |
|---|---|---|
| NFS volumes | nfs-utils extension is "contrib" tier (community-maintained) |
Medium |
| NVIDIA GPU | Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules | Medium |
| No SSH | Debugging via talosctl only (dmesg, logs, dashboard, pcap) |
Low-Medium |
| Not kubeadm | Cannot in-place migrate; must build parallel cluster | High (one-time) |
| Proxmox templates | Different provisioning model (ISO boot vs cloud-init clone) | Medium |
| No arbitrary packages | No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers | Low |
Terraform Integration
Official provider: siderolabs/talos v0.10.1
# Key resources:
# - talos_machine_secrets — cluster-wide secrets (generated once)
# - talos_machine_configuration — per-node machine config (data source)
# - talos_machine_configuration_apply — apply config to a node
# - talos_machine_bootstrap — bootstrap control plane (once)
# - talos_cluster_kubeconfig — retrieve kubeconfig
Would fit as stacks/talos/ alongside existing stacks/infra/.
Example Machine Configs
Worker node (e.g., k8s-node2)
version: v1alpha1
machine:
type: worker
network:
hostname: k8s-node2
interfaces:
- interface: eth0
addresses:
- 10.0.20.102/24
routes:
- network: 0.0.0.0/0
gateway: 10.0.20.1
nameservers:
- 10.0.20.101 # Technitium
- 1.1.1.1
registries:
mirrors:
docker.io:
endpoints: ["http://10.0.20.10:5000"]
ghcr.io:
endpoints: ["http://10.0.20.10:5010"]
quay.io:
endpoints: ["http://10.0.20.10:5020"]
registry.k8s.io:
endpoints: ["http://10.0.20.10:5030"]
reg.kyverno.io:
endpoints: ["http://10.0.20.10:5040"]
sysctls:
fs.inotify.max_user_watches: "1048576"
fs.inotify.max_user_instances: "8192"
net.ipv4.ip_forward: "1"
kubelet:
extraConfig:
serializeImagePulls: false
maxParallelImagePulls: 50
install:
disk: /dev/sda
extensions:
- image: ghcr.io/siderolabs/nfs-utils:v2.7.2
- image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
cluster:
controlPlane:
endpoint: https://10.0.20.100:6443
GPU node (k8s-node1) — additional config
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
nodeLabels:
gpu: "true"
nodeTaints:
nvidia.com/gpu: "true:NoSchedule"
install:
extensions:
- image: ghcr.io/siderolabs/nfs-utils:v2.7.2
- image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5
- image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x
Control plane (k8s-master) — OIDC + audit
cluster:
apiServer:
extraArgs:
oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/
oidc-client-id: kubernetes
oidc-username-claim: email
oidc-groups-claim: groups
audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml
audit-log-path: /var/log/kubernetes/audit.log
audit-log-maxage: "7"
audit-log-maxbackup: "3"
audit-log-maxsize: "100"
extraVolumes:
- hostPath: /etc/kubernetes/policies
mountPath: /etc/kubernetes/policies
readOnly: true
- hostPath: /var/log/kubernetes
mountPath: /var/log/kubernetes
Migration Path (if proceeding)
This is NOT an in-place migration. Talos replaces kubeadm entirely.
- Build Talos machine configs in the repo (YAML per node, templated via Terraform)
- Create
stacks/talos/stack — Proxmox VM creation + Talos provider resources - Download Talos ISO with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory
- Stand up parallel cluster — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom)
- Bootstrap control plane via
talosctl bootstrap - Point existing Terraform service stacks at new cluster kubeconfig
- Apply all service stacks — NFS-backed services point at same data, no data migration
- Validate everything works — run cluster healthcheck, test all services
- Tear down old Ubuntu VMs
- Reassign IPs if desired (reconfigure Talos nodes to use original IPs)
What Gets Eliminated
If migrated, these files/patterns become unnecessary:
modules/create-template-vm/cloud_init.yamlmodules/create-template-vm/(entire module)modules/create-vm/(replaced by Talos provider)scripts/setup_containerd_mirrors.shstacks/platform/modules/rbac/apiserver-oidc.tf(SSH+sed block)stacks/platform/modules/rbac/audit-policy.tf(SSH+sed block)stacks/platform/modules/monitoring/loki.tfsysctl-inotify DaemonSetplaybooks/deploy_node_exporter.yamlnull_resource.gpu_node_taintin nvidia module- The undocumented GPU label manual step
ROI Analysis
Costs
| Cost | Estimate |
|---|---|
| Learn Talos + talosctl workflow | Significant (new paradigm, no SSH) |
| Build Talos Terraform stack | Medium (new stack, provider, machine configs) |
| Build custom Talos ISO with extensions | Low (Image Factory makes this easy) |
| Parallel cluster setup + validation | Medium-High (must test every service) |
| NVIDIA driver testing on Talos | Medium (version-locking, open kernel modules) |
| Loss of SSH node access | Ongoing (workflow change) |
| Ongoing: Talos upgrades require extension version alignment | Low-Medium |
Benefits
| Benefit | Value |
|---|---|
| Zero configuration drift (structural guarantee) | High (but current drift risk is actually low) |
| Single-command node rebuild | High |
| Eliminates ~10 files/patterns of provisioning code | Medium |
| Atomic OS upgrades with rollback | Medium |
| Declarative API server config (no SSH+sed) | Medium |
| GPU label/taint properly codified | Low (could fix this today in 5 minutes) |
| Immutable, minimal attack surface | Low-Medium (nodes aren't internet-exposed) |
Honest Assessment
The current drift surface is small and well-understood. The highest-risk items are:
- API server OIDC/audit config — SSH+sed is fragile but rarely changes
- containerd mirrors — baked into template, stable once set
- GPU label — missing from code but trivially fixable
Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice.
Talos solves a real problem, but the problem isn't causing real pain today. The migration cost is high relative to the current risk. It would make sense to revisit if:
- Adding more nodes frequently (scaling the cluster)
- Experiencing actual drift incidents
- Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change)
- The current SSH+sed patterns break and need rework anyway
Quick Wins (Do Instead / Do Now)
These close most of the drift gap without changing the OS:
- Add GPU label to Terraform —
kubectl labelin existing nvidianull_resourceor akubernetes_labelsresource - Make API server OIDC config idempotent — improve the grep-before-sed checks
- Move node-exporter to K8s DaemonSet — instead of Ansible playbook on host
- Document the full node rebuild procedure — cloud-init template → clone → join → verify
References
- Talos docs: https://docs.siderolabs.com/talos/v1.9/
- Talos Proxmox guide: https://docs.siderolabs.com/talos/v1.9/platform-specific-installations/virtualized-platforms/proxmox/
- Talos NVIDIA GPU: https://docs.siderolabs.com/talos/v1.9/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu
- Talos Terraform provider: https://registry.terraform.io/providers/siderolabs/talos/latest (v0.10.1)
- Talos system extensions: https://github.com/siderolabs/extensions
- Talos Image Factory: https://factory.talos.dev/