infra/docs/plans/2026-02-22-talos-linux-migration-evaluation.md
Viktor Barzin cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00

11 KiB

Talos Linux Migration Evaluation

Date: 2026-02-22 Status: Parked (evaluating ROI) Decision: Not yet decided — saved for future reference

Problem Statement

The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of:

  • Cloud-init (packages, repos, containerd, kubelet, kubeadm join)
  • Terraform null_resource with SSH (containerd mirrors, API server OIDC, audit policy, GPU taint)
  • Ansible playbook (node exporter — optional)
  • DaemonSets (sysctl inotify limits)
  • Manual steps (GPU label, node upgrades, containerd mirror fixes)

This creates a drift surface and makes full from-scratch reprovisioning non-trivial.

Goals:

  1. Prevent configuration drift — ensure nodes match what's declared in code
  2. Single-command bootstrap — recover from complete node/cluster failure
  3. Everything managed as code in the infra repository

Options Evaluated

Option 1: Chef on Ubuntu — Rejected

  • Chef is effectively dead (Progress acquisition, shrinking ecosystem)
  • Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system
  • Drift detection is reactive (periodic convergence), not preventive
  • Doesn't simplify the provisioning chain, just replaces SSH commands with recipes

Option 2: NixOS — Not pursued

  • Strongest drift guarantees (entire OS derived from Nix expression)
  • Steep learning curve (functional language, unhelpful error messages)
  • NVIDIA + containerd + K8s on NixOS is a niche combination
  • Proxmox cloud-init integration less mature than Ubuntu
  • Significant migration effort for marginal benefit over Talos

Option 3: Talos Linux — Preferred candidate (if migrating)

Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible.

Option 4: Improve current setup — Low-cost alternative

Consolidate existing null_resource SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below.

Talos Linux — Detailed Assessment

What Maps Cleanly

Current (Ubuntu) Talos Equivalent Complexity
cloud_init.yaml packages Eliminated (no packages needed) None
containerd registry mirrors machine.registries.mirrors in machine config Simple
kubeadm join Talos manages K8s lifecycle natively Simple
sysctl DaemonSet (inotify) machine.sysctls in machine config Simple
API server OIDC flags (SSH+sed) cluster.apiServer.extraArgs Simple
Audit policy (SSH+sed) cluster.apiServer.extraArgs + extraVolumes Simple
GPU label (manual) machine.nodeLabels Simple
GPU taint (null_resource) machine.nodeTaints or machine config Simple
Static IPs machine.network.interfaces Simple
QEMU guest agent qemu-guest-agent system extension Simple

What Has Friction

Component Issue Severity
NFS volumes nfs-utils extension is "contrib" tier (community-maintained) Medium
NVIDIA GPU Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules Medium
No SSH Debugging via talosctl only (dmesg, logs, dashboard, pcap) Low-Medium
Not kubeadm Cannot in-place migrate; must build parallel cluster High (one-time)
Proxmox templates Different provisioning model (ISO boot vs cloud-init clone) Medium
No arbitrary packages No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers Low

Terraform Integration

Official provider: siderolabs/talos v0.10.1

# Key resources:
# - talos_machine_secrets        — cluster-wide secrets (generated once)
# - talos_machine_configuration  — per-node machine config (data source)
# - talos_machine_configuration_apply — apply config to a node
# - talos_machine_bootstrap      — bootstrap control plane (once)
# - talos_cluster_kubeconfig     — retrieve kubeconfig

Would fit as stacks/talos/ alongside existing stacks/infra/.

Example Machine Configs

Worker node (e.g., k8s-node2)

version: v1alpha1
machine:
  type: worker
  network:
    hostname: k8s-node2
    interfaces:
      - interface: eth0
        addresses:
          - 10.0.20.102/24
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.20.1
    nameservers:
      - 10.0.20.101  # Technitium
      - 1.1.1.1
  registries:
    mirrors:
      docker.io:
        endpoints: ["http://10.0.20.10:5000"]
      ghcr.io:
        endpoints: ["http://10.0.20.10:5010"]
      quay.io:
        endpoints: ["http://10.0.20.10:5020"]
      registry.k8s.io:
        endpoints: ["http://10.0.20.10:5030"]
      reg.kyverno.io:
        endpoints: ["http://10.0.20.10:5040"]
  sysctls:
    fs.inotify.max_user_watches: "1048576"
    fs.inotify.max_user_instances: "8192"
    net.ipv4.ip_forward: "1"
  kubelet:
    extraConfig:
      serializeImagePulls: false
      maxParallelImagePulls: 50
  install:
    disk: /dev/sda
    extensions:
      - image: ghcr.io/siderolabs/nfs-utils:v2.7.2
      - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
cluster:
  controlPlane:
    endpoint: https://10.0.20.100:6443

GPU node (k8s-node1) — additional config

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  nodeLabels:
    gpu: "true"
  nodeTaints:
    nvidia.com/gpu: "true:NoSchedule"
  install:
    extensions:
      - image: ghcr.io/siderolabs/nfs-utils:v2.7.2
      - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
      - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5
      - image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x

Control plane (k8s-master) — OIDC + audit

cluster:
  apiServer:
    extraArgs:
      oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/
      oidc-client-id: kubernetes
      oidc-username-claim: email
      oidc-groups-claim: groups
      audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml
      audit-log-path: /var/log/kubernetes/audit.log
      audit-log-maxage: "7"
      audit-log-maxbackup: "3"
      audit-log-maxsize: "100"
    extraVolumes:
      - hostPath: /etc/kubernetes/policies
        mountPath: /etc/kubernetes/policies
        readOnly: true
      - hostPath: /var/log/kubernetes
        mountPath: /var/log/kubernetes

Migration Path (if proceeding)

This is NOT an in-place migration. Talos replaces kubeadm entirely.

  1. Build Talos machine configs in the repo (YAML per node, templated via Terraform)
  2. Create stacks/talos/ stack — Proxmox VM creation + Talos provider resources
  3. Download Talos ISO with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory
  4. Stand up parallel cluster — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom)
  5. Bootstrap control plane via talosctl bootstrap
  6. Point existing Terraform service stacks at new cluster kubeconfig
  7. Apply all service stacks — NFS-backed services point at same data, no data migration
  8. Validate everything works — run cluster healthcheck, test all services
  9. Tear down old Ubuntu VMs
  10. Reassign IPs if desired (reconfigure Talos nodes to use original IPs)

What Gets Eliminated

If migrated, these files/patterns become unnecessary:

  • modules/create-template-vm/cloud_init.yaml
  • modules/create-template-vm/ (entire module)
  • modules/create-vm/ (replaced by Talos provider)
  • scripts/setup_containerd_mirrors.sh
  • stacks/platform/modules/rbac/apiserver-oidc.tf (SSH+sed block)
  • stacks/platform/modules/rbac/audit-policy.tf (SSH+sed block)
  • stacks/platform/modules/monitoring/loki.tf sysctl-inotify DaemonSet
  • playbooks/deploy_node_exporter.yaml
  • null_resource.gpu_node_taint in nvidia module
  • The undocumented GPU label manual step

ROI Analysis

Costs

Cost Estimate
Learn Talos + talosctl workflow Significant (new paradigm, no SSH)
Build Talos Terraform stack Medium (new stack, provider, machine configs)
Build custom Talos ISO with extensions Low (Image Factory makes this easy)
Parallel cluster setup + validation Medium-High (must test every service)
NVIDIA driver testing on Talos Medium (version-locking, open kernel modules)
Loss of SSH node access Ongoing (workflow change)
Ongoing: Talos upgrades require extension version alignment Low-Medium

Benefits

Benefit Value
Zero configuration drift (structural guarantee) High (but current drift risk is actually low)
Single-command node rebuild High
Eliminates ~10 files/patterns of provisioning code Medium
Atomic OS upgrades with rollback Medium
Declarative API server config (no SSH+sed) Medium
GPU label/taint properly codified Low (could fix this today in 5 minutes)
Immutable, minimal attack surface Low-Medium (nodes aren't internet-exposed)

Honest Assessment

The current drift surface is small and well-understood. The highest-risk items are:

  1. API server OIDC/audit config — SSH+sed is fragile but rarely changes
  2. containerd mirrors — baked into template, stable once set
  3. GPU label — missing from code but trivially fixable

Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice.

Talos solves a real problem, but the problem isn't causing real pain today. The migration cost is high relative to the current risk. It would make sense to revisit if:

  • Adding more nodes frequently (scaling the cluster)
  • Experiencing actual drift incidents
  • Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change)
  • The current SSH+sed patterns break and need rework anyway

Quick Wins (Do Instead / Do Now)

These close most of the drift gap without changing the OS:

  1. Add GPU label to Terraformkubectl label in existing nvidia null_resource or a kubernetes_labels resource
  2. Make API server OIDC config idempotent — improve the grep-before-sed checks
  3. Move node-exporter to K8s DaemonSet — instead of Ansible playbook on host
  4. Document the full node rebuild procedure — cloud-init template → clone → join → verify

References