# Talos Linux Migration Evaluation **Date**: 2026-02-22 **Status**: Parked (evaluating ROI) **Decision**: Not yet decided — saved for future reference ## Problem Statement The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of: - Cloud-init (packages, repos, containerd, kubelet, kubeadm join) - Terraform `null_resource` with SSH (containerd mirrors, API server OIDC, audit policy, GPU taint) - Ansible playbook (node exporter — optional) - DaemonSets (sysctl inotify limits) - Manual steps (GPU label, node upgrades, containerd mirror fixes) This creates a drift surface and makes full from-scratch reprovisioning non-trivial. **Goals:** 1. Prevent configuration drift — ensure nodes match what's declared in code 2. Single-command bootstrap — recover from complete node/cluster failure 3. Everything managed as code in the infra repository ## Options Evaluated ### Option 1: Chef on Ubuntu — Rejected - Chef is effectively dead (Progress acquisition, shrinking ecosystem) - Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system - Drift detection is reactive (periodic convergence), not preventive - Doesn't simplify the provisioning chain, just replaces SSH commands with recipes ### Option 2: NixOS — Not pursued - Strongest drift guarantees (entire OS derived from Nix expression) - Steep learning curve (functional language, unhelpful error messages) - NVIDIA + containerd + K8s on NixOS is a niche combination - Proxmox cloud-init integration less mature than Ubuntu - Significant migration effort for marginal benefit over Talos ### Option 3: Talos Linux — Preferred candidate (if migrating) Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible. ### Option 4: Improve current setup — Low-cost alternative Consolidate existing `null_resource` SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below. ## Talos Linux — Detailed Assessment ### What Maps Cleanly | Current (Ubuntu) | Talos Equivalent | Complexity | |---|---|---| | cloud_init.yaml packages | Eliminated (no packages needed) | None | | containerd registry mirrors | `machine.registries.mirrors` in machine config | Simple | | `kubeadm join` | Talos manages K8s lifecycle natively | Simple | | sysctl DaemonSet (inotify) | `machine.sysctls` in machine config | Simple | | API server OIDC flags (SSH+sed) | `cluster.apiServer.extraArgs` | Simple | | Audit policy (SSH+sed) | `cluster.apiServer.extraArgs` + `extraVolumes` | Simple | | GPU label (manual) | `machine.nodeLabels` | Simple | | GPU taint (null_resource) | `machine.nodeTaints` or machine config | Simple | | Static IPs | `machine.network.interfaces` | Simple | | QEMU guest agent | `qemu-guest-agent` system extension | Simple | ### What Has Friction | Component | Issue | Severity | |---|---|---| | NFS volumes | `nfs-utils` extension is "contrib" tier (community-maintained) | Medium | | NVIDIA GPU | Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules | Medium | | No SSH | Debugging via `talosctl` only (dmesg, logs, dashboard, pcap) | Low-Medium | | Not kubeadm | Cannot in-place migrate; must build parallel cluster | High (one-time) | | Proxmox templates | Different provisioning model (ISO boot vs cloud-init clone) | Medium | | No arbitrary packages | No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers | Low | ### Terraform Integration Official provider: `siderolabs/talos` v0.10.1 ```hcl # Key resources: # - talos_machine_secrets — cluster-wide secrets (generated once) # - talos_machine_configuration — per-node machine config (data source) # - talos_machine_configuration_apply — apply config to a node # - talos_machine_bootstrap — bootstrap control plane (once) # - talos_cluster_kubeconfig — retrieve kubeconfig ``` Would fit as `stacks/talos/` alongside existing `stacks/infra/`. ### Example Machine Configs #### Worker node (e.g., k8s-node2) ```yaml version: v1alpha1 machine: type: worker network: hostname: k8s-node2 interfaces: - interface: eth0 addresses: - 10.0.20.102/24 routes: - network: 0.0.0.0/0 gateway: 10.0.20.1 nameservers: - 10.0.20.101 # Technitium - 1.1.1.1 registries: mirrors: docker.io: endpoints: ["http://10.0.20.10:5000"] ghcr.io: endpoints: ["http://10.0.20.10:5010"] quay.io: endpoints: ["http://10.0.20.10:5020"] registry.k8s.io: endpoints: ["http://10.0.20.10:5030"] reg.kyverno.io: endpoints: ["http://10.0.20.10:5040"] sysctls: fs.inotify.max_user_watches: "1048576" fs.inotify.max_user_instances: "8192" net.ipv4.ip_forward: "1" kubelet: extraConfig: serializeImagePulls: false maxParallelImagePulls: 50 install: disk: /dev/sda extensions: - image: ghcr.io/siderolabs/nfs-utils:v2.7.2 - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0 cluster: controlPlane: endpoint: https://10.0.20.100:6443 ``` #### GPU node (k8s-node1) — additional config ```yaml machine: kernel: modules: - name: nvidia - name: nvidia_uvm - name: nvidia_drm - name: nvidia_modeset nodeLabels: gpu: "true" nodeTaints: nvidia.com/gpu: "true:NoSchedule" install: extensions: - image: ghcr.io/siderolabs/nfs-utils:v2.7.2 - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0 - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5 - image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x ``` #### Control plane (k8s-master) — OIDC + audit ```yaml cluster: apiServer: extraArgs: oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/ oidc-client-id: kubernetes oidc-username-claim: email oidc-groups-claim: groups audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml audit-log-path: /var/log/kubernetes/audit.log audit-log-maxage: "7" audit-log-maxbackup: "3" audit-log-maxsize: "100" extraVolumes: - hostPath: /etc/kubernetes/policies mountPath: /etc/kubernetes/policies readOnly: true - hostPath: /var/log/kubernetes mountPath: /var/log/kubernetes ``` ### Migration Path (if proceeding) This is NOT an in-place migration. Talos replaces kubeadm entirely. 1. **Build Talos machine configs** in the repo (YAML per node, templated via Terraform) 2. **Create `stacks/talos/` stack** — Proxmox VM creation + Talos provider resources 3. **Download Talos ISO** with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory 4. **Stand up parallel cluster** — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom) 5. **Bootstrap control plane** via `talosctl bootstrap` 6. **Point existing Terraform service stacks** at new cluster kubeconfig 7. **Apply all service stacks** — NFS-backed services point at same data, no data migration 8. **Validate everything works** — run cluster healthcheck, test all services 9. **Tear down old Ubuntu VMs** 10. **Reassign IPs** if desired (reconfigure Talos nodes to use original IPs) ### What Gets Eliminated If migrated, these files/patterns become unnecessary: - `modules/create-template-vm/cloud_init.yaml` - `modules/create-template-vm/` (entire module) - `modules/create-vm/` (replaced by Talos provider) - `scripts/setup_containerd_mirrors.sh` - `stacks/platform/modules/rbac/apiserver-oidc.tf` (SSH+sed block) - `stacks/platform/modules/rbac/audit-policy.tf` (SSH+sed block) - `stacks/platform/modules/monitoring/loki.tf` sysctl-inotify DaemonSet - `playbooks/deploy_node_exporter.yaml` - `null_resource.gpu_node_taint` in nvidia module - The undocumented GPU label manual step ## ROI Analysis ### Costs | Cost | Estimate | |---|---| | Learn Talos + talosctl workflow | Significant (new paradigm, no SSH) | | Build Talos Terraform stack | Medium (new stack, provider, machine configs) | | Build custom Talos ISO with extensions | Low (Image Factory makes this easy) | | Parallel cluster setup + validation | Medium-High (must test every service) | | NVIDIA driver testing on Talos | Medium (version-locking, open kernel modules) | | Loss of SSH node access | Ongoing (workflow change) | | Ongoing: Talos upgrades require extension version alignment | Low-Medium | ### Benefits | Benefit | Value | |---|---| | Zero configuration drift (structural guarantee) | High (but current drift risk is actually low) | | Single-command node rebuild | High | | Eliminates ~10 files/patterns of provisioning code | Medium | | Atomic OS upgrades with rollback | Medium | | Declarative API server config (no SSH+sed) | Medium | | GPU label/taint properly codified | Low (could fix this today in 5 minutes) | | Immutable, minimal attack surface | Low-Medium (nodes aren't internet-exposed) | ### Honest Assessment The current drift surface is small and well-understood. The highest-risk items are: 1. **API server OIDC/audit config** — SSH+sed is fragile but rarely changes 2. **containerd mirrors** — baked into template, stable once set 3. **GPU label** — missing from code but trivially fixable Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice. **Talos solves a real problem, but the problem isn't causing real pain today.** The migration cost is high relative to the current risk. It would make sense to revisit if: - Adding more nodes frequently (scaling the cluster) - Experiencing actual drift incidents - Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change) - The current SSH+sed patterns break and need rework anyway ## Quick Wins (Do Instead / Do Now) These close most of the drift gap without changing the OS: 1. **Add GPU label to Terraform** — `kubectl label` in existing nvidia `null_resource` or a `kubernetes_labels` resource 2. **Make API server OIDC config idempotent** — improve the grep-before-sed checks 3. **Move node-exporter to K8s DaemonSet** — instead of Ansible playbook on host 4. **Document the full node rebuild procedure** — cloud-init template → clone → join → verify ## References - Talos docs: https://docs.siderolabs.com/talos/v1.9/ - Talos Proxmox guide: https://docs.siderolabs.com/talos/v1.9/platform-specific-installations/virtualized-platforms/proxmox/ - Talos NVIDIA GPU: https://docs.siderolabs.com/talos/v1.9/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu - Talos Terraform provider: https://registry.terraform.io/providers/siderolabs/talos/latest (v0.10.1) - Talos system extensions: https://github.com/siderolabs/extensions - Talos Image Factory: https://factory.talos.dev/