Viktor Barzin cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs

- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference

2026-02-22 22:59:38 +00:00

11 KiB

Raw Blame History

Talos Linux Migration Evaluation

Date: 2026-02-22 Status: Parked (evaluating ROI) Decision: Not yet decided — saved for future reference

Problem Statement

The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of:

Cloud-init (packages, repos, containerd, kubelet, kubeadm join)
Terraform null_resource with SSH (containerd mirrors, API server OIDC, audit policy, GPU taint)
Ansible playbook (node exporter — optional)
DaemonSets (sysctl inotify limits)
Manual steps (GPU label, node upgrades, containerd mirror fixes)

This creates a drift surface and makes full from-scratch reprovisioning non-trivial.

Goals:

Prevent configuration drift — ensure nodes match what's declared in code
Single-command bootstrap — recover from complete node/cluster failure
Everything managed as code in the infra repository

Options Evaluated

Option 1: Chef on Ubuntu — Rejected

Chef is effectively dead (Progress acquisition, shrinking ecosystem)
Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system
Drift detection is reactive (periodic convergence), not preventive
Doesn't simplify the provisioning chain, just replaces SSH commands with recipes

Option 2: NixOS — Not pursued

Strongest drift guarantees (entire OS derived from Nix expression)
Steep learning curve (functional language, unhelpful error messages)
NVIDIA + containerd + K8s on NixOS is a niche combination
Proxmox cloud-init integration less mature than Ubuntu
Significant migration effort for marginal benefit over Talos

Option 3: Talos Linux — Preferred candidate (if migrating)

Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible.

Option 4: Improve current setup — Low-cost alternative

Consolidate existing null_resource SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below.

Talos Linux — Detailed Assessment

What Maps Cleanly

Current (Ubuntu)	Talos Equivalent	Complexity
cloud_init.yaml packages	Eliminated (no packages needed)	None
containerd registry mirrors	`machine.registries.mirrors` in machine config	Simple
`kubeadm join`	Talos manages K8s lifecycle natively	Simple
sysctl DaemonSet (inotify)	`machine.sysctls` in machine config	Simple
API server OIDC flags (SSH+sed)	`cluster.apiServer.extraArgs`	Simple
Audit policy (SSH+sed)	`cluster.apiServer.extraArgs` + `extraVolumes`	Simple
GPU label (manual)	`machine.nodeLabels`	Simple
GPU taint (null_resource)	`machine.nodeTaints` or machine config	Simple
Static IPs	`machine.network.interfaces`	Simple
QEMU guest agent	`qemu-guest-agent` system extension	Simple

What Has Friction

Component	Issue	Severity
NFS volumes	`nfs-utils` extension is "contrib" tier (community-maintained)	Medium
NVIDIA GPU	Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules	Medium
No SSH	Debugging via `talosctl` only (dmesg, logs, dashboard, pcap)	Low-Medium
Not kubeadm	Cannot in-place migrate; must build parallel cluster	High (one-time)
Proxmox templates	Different provisioning model (ISO boot vs cloud-init clone)	Medium
No arbitrary packages	No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers	Low

Terraform Integration

Official provider: siderolabs/talos v0.10.1

# Key resources:
# - talos_machine_secrets        — cluster-wide secrets (generated once)
# - talos_machine_configuration  — per-node machine config (data source)
# - talos_machine_configuration_apply — apply config to a node
# - talos_machine_bootstrap      — bootstrap control plane (once)
# - talos_cluster_kubeconfig     — retrieve kubeconfig

Would fit as stacks/talos/ alongside existing stacks/infra/.

Example Machine Configs

Worker node (e.g., k8s-node2)

version: v1alpha1
machine:
  type: worker
  network:
    hostname: k8s-node2
    interfaces:
      - interface: eth0
        addresses:
          - 10.0.20.102/24
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.20.1
    nameservers:
      - 10.0.20.101  # Technitium
      - 1.1.1.1
  registries:
    mirrors:
      docker.io:
        endpoints: ["http://10.0.20.10:5000"]
      ghcr.io:
        endpoints: ["http://10.0.20.10:5010"]
      quay.io:
        endpoints: ["http://10.0.20.10:5020"]
      registry.k8s.io:
        endpoints: ["http://10.0.20.10:5030"]
      reg.kyverno.io:
        endpoints: ["http://10.0.20.10:5040"]
  sysctls:
    fs.inotify.max_user_watches: "1048576"
    fs.inotify.max_user_instances: "8192"
    net.ipv4.ip_forward: "1"
  kubelet:
    extraConfig:
      serializeImagePulls: false
      maxParallelImagePulls: 50
  install:
    disk: /dev/sda
    extensions:
      - image: ghcr.io/siderolabs/nfs-utils:v2.7.2
      - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
cluster:
  controlPlane:
    endpoint: https://10.0.20.100:6443

GPU node (k8s-node1) — additional config

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  nodeLabels:
    gpu: "true"
  nodeTaints:
    nvidia.com/gpu: "true:NoSchedule"
  install:
    extensions:
      - image: ghcr.io/siderolabs/nfs-utils:v2.7.2
      - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
      - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5
      - image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x

Control plane (k8s-master) — OIDC + audit

cluster:
  apiServer:
    extraArgs:
      oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/
      oidc-client-id: kubernetes
      oidc-username-claim: email
      oidc-groups-claim: groups
      audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml
      audit-log-path: /var/log/kubernetes/audit.log
      audit-log-maxage: "7"
      audit-log-maxbackup: "3"
      audit-log-maxsize: "100"
    extraVolumes:
      - hostPath: /etc/kubernetes/policies
        mountPath: /etc/kubernetes/policies
        readOnly: true
      - hostPath: /var/log/kubernetes
        mountPath: /var/log/kubernetes

Migration Path (if proceeding)

This is NOT an in-place migration. Talos replaces kubeadm entirely.

Build Talos machine configs in the repo (YAML per node, templated via Terraform)
Create stacks/talos/ stack — Proxmox VM creation + Talos provider resources
Download Talos ISO with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory
Stand up parallel cluster — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom)
Bootstrap control plane via talosctl bootstrap
Point existing Terraform service stacks at new cluster kubeconfig
Apply all service stacks — NFS-backed services point at same data, no data migration
Validate everything works — run cluster healthcheck, test all services
Tear down old Ubuntu VMs
Reassign IPs if desired (reconfigure Talos nodes to use original IPs)

What Gets Eliminated

If migrated, these files/patterns become unnecessary:

modules/create-template-vm/cloud_init.yaml
modules/create-template-vm/ (entire module)
modules/create-vm/ (replaced by Talos provider)
scripts/setup_containerd_mirrors.sh
stacks/platform/modules/rbac/apiserver-oidc.tf (SSH+sed block)
stacks/platform/modules/rbac/audit-policy.tf (SSH+sed block)
stacks/platform/modules/monitoring/loki.tf sysctl-inotify DaemonSet
playbooks/deploy_node_exporter.yaml
null_resource.gpu_node_taint in nvidia module
The undocumented GPU label manual step

ROI Analysis

Costs

Cost	Estimate
Learn Talos + talosctl workflow	Significant (new paradigm, no SSH)
Build Talos Terraform stack	Medium (new stack, provider, machine configs)
Build custom Talos ISO with extensions	Low (Image Factory makes this easy)
Parallel cluster setup + validation	Medium-High (must test every service)
NVIDIA driver testing on Talos	Medium (version-locking, open kernel modules)
Loss of SSH node access	Ongoing (workflow change)
Ongoing: Talos upgrades require extension version alignment	Low-Medium

Benefits

Benefit	Value
Zero configuration drift (structural guarantee)	High (but current drift risk is actually low)
Single-command node rebuild	High
Eliminates ~10 files/patterns of provisioning code	Medium
Atomic OS upgrades with rollback	Medium
Declarative API server config (no SSH+sed)	Medium
GPU label/taint properly codified	Low (could fix this today in 5 minutes)
Immutable, minimal attack surface	Low-Medium (nodes aren't internet-exposed)

Honest Assessment

The current drift surface is small and well-understood. The highest-risk items are:

API server OIDC/audit config — SSH+sed is fragile but rarely changes
containerd mirrors — baked into template, stable once set
GPU label — missing from code but trivially fixable

Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice.

Talos solves a real problem, but the problem isn't causing real pain today. The migration cost is high relative to the current risk. It would make sense to revisit if:

Adding more nodes frequently (scaling the cluster)
Experiencing actual drift incidents
Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change)
The current SSH+sed patterns break and need rework anyway

Quick Wins (Do Instead / Do Now)

These close most of the drift gap without changing the OS:

Add GPU label to Terraform — kubectl label in existing nvidia null_resource or a kubernetes_labels resource
Make API server OIDC config idempotent — improve the grep-before-sed checks
Move node-exporter to K8s DaemonSet — instead of Ansible playbook on host
Document the full node rebuild procedure — cloud-init template → clone → join → verify

References

Talos docs: https://docs.siderolabs.com/talos/v1.9/
Talos Proxmox guide: https://docs.siderolabs.com/talos/v1.9/platform-specific-installations/virtualized-platforms/proxmox/
Talos NVIDIA GPU: https://docs.siderolabs.com/talos/v1.9/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu
Talos Terraform provider: https://registry.terraform.io/providers/siderolabs/talos/latest (v0.10.1)
Talos system extensions: https://github.com/siderolabs/extensions
Talos Image Factory: https://factory.talos.dev/

11 KiB Raw Blame History