diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 612343d0..09570fdd 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -112,6 +112,23 @@ terraform fmt -recursive # Format all - Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040) - GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }` +### Node Rebuild Procedure +To rebuild a K8s worker node from scratch (e.g., after disk failure or corruption): + +1. **Drain the node** (if still reachable): `kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data` +2. **Delete the node from K8s**: `kubectl delete node k8s-nodeX` +3. **Destroy the VM in Proxmox** (or via Terraform: remove from `stacks/infra/main.tf` and apply) +4. **Ensure K8s template exists**: The template `ubuntu-2404-cloudinit-k8s-template` (VMID 2000) must exist. If not, apply `stacks/infra/` to recreate it. +5. **Get a fresh join command**: `ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command'` +6. **Update `k8s_join_command`** in `terraform.tfvars` with the new join command +7. **Create the new VM**: Add it back in `stacks/infra/main.tf` and `cd stacks/infra && terragrunt apply --non-interactive` +8. **Wait for cloud-init**: The VM will install packages, configure containerd mirrors, and join the cluster automatically via cloud-init +9. **Verify the node joined**: `kubectl get nodes` — should show the new node as `Ready` +10. **For GPU node (k8s-node1) only**: Apply the platform stack to re-apply GPU label and taint: `cd stacks/platform && terragrunt apply --non-interactive` (the `null_resource.gpu_node_config` in the nvidia module handles this) +11. **Verify containerd mirrors**: `ssh wizard@ 'ls /etc/containerd/certs.d/'` — should show docker.io, ghcr.io, quay.io, registry.k8s.io, reg.kyverno.io + +**Note**: kubeadm tokens expire after 24h by default. Generate a fresh one just before creating the VM. + ## Git Operations - **Git is slow** — commands can take 30+ seconds. Use `GIT_OPTIONAL_LOCKS=0` if git hangs. - Commit only specific files. **ALWAYS ask user before pushing**. diff --git a/docs/plans/2026-02-22-node-drift-quick-wins-design.md b/docs/plans/2026-02-22-node-drift-quick-wins-design.md new file mode 100644 index 00000000..dce0c491 --- /dev/null +++ b/docs/plans/2026-02-22-node-drift-quick-wins-design.md @@ -0,0 +1,29 @@ +# Node Configuration Drift Quick Wins — Design + +**Date**: 2026-02-22 +**Status**: Approved +**Context**: From Talos Linux evaluation — these close 95% of the drift gap without changing the OS + +## Quick Win 1: Add GPU Label to Terraform + +**File**: `stacks/platform/modules/nvidia/main.tf` + +Extend the existing `null_resource.gpu_node_taint` to also apply the `gpu=true` label. Rename to `gpu_node_config`. Both commands are idempotent (`--overwrite` for taint, label is a no-op if already set). + +## Quick Win 2: Improve API Server OIDC/Audit Idempotency + +**Files**: `stacks/platform/modules/rbac/apiserver-oidc.tf`, `audit-policy.tf` + +Current grep-before-sed checks prevent duplicate entries but don't handle value changes. Improve the OIDC check to compare the actual issuer URL value, not just the flag name. Audit policy file is always re-uploaded (good), manifest edit is skipped if already configured (acceptable). + +## Quick Win 3: Enable Node-Exporter via Prometheus Helm Chart + +**File**: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` + +Uncomment `prometheus-node-exporter: enabled: true`. Delete `playbooks/deploy_node_exporter.yaml` (unused, superseded by DaemonSet). + +## Quick Win 4: Document Node Rebuild Procedure + +**File**: `.claude/CLAUDE.md` + +Add a "Node Rebuild Procedure" section documenting the full sequence: VM creation from template → cloud-init → kubeadm join → verify mirrors/labels/taints. diff --git a/docs/plans/2026-02-22-talos-linux-migration-evaluation.md b/docs/plans/2026-02-22-talos-linux-migration-evaluation.md new file mode 100644 index 00000000..699a8762 --- /dev/null +++ b/docs/plans/2026-02-22-talos-linux-migration-evaluation.md @@ -0,0 +1,272 @@ +# Talos Linux Migration Evaluation + +**Date**: 2026-02-22 +**Status**: Parked (evaluating ROI) +**Decision**: Not yet decided — saved for future reference + +## Problem Statement + +The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of: +- Cloud-init (packages, repos, containerd, kubelet, kubeadm join) +- Terraform `null_resource` with SSH (containerd mirrors, API server OIDC, audit policy, GPU taint) +- Ansible playbook (node exporter — optional) +- DaemonSets (sysctl inotify limits) +- Manual steps (GPU label, node upgrades, containerd mirror fixes) + +This creates a drift surface and makes full from-scratch reprovisioning non-trivial. + +**Goals:** +1. Prevent configuration drift — ensure nodes match what's declared in code +2. Single-command bootstrap — recover from complete node/cluster failure +3. Everything managed as code in the infra repository + +## Options Evaluated + +### Option 1: Chef on Ubuntu — Rejected + +- Chef is effectively dead (Progress acquisition, shrinking ecosystem) +- Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system +- Drift detection is reactive (periodic convergence), not preventive +- Doesn't simplify the provisioning chain, just replaces SSH commands with recipes + +### Option 2: NixOS — Not pursued + +- Strongest drift guarantees (entire OS derived from Nix expression) +- Steep learning curve (functional language, unhelpful error messages) +- NVIDIA + containerd + K8s on NixOS is a niche combination +- Proxmox cloud-init integration less mature than Ubuntu +- Significant migration effort for marginal benefit over Talos + +### Option 3: Talos Linux — Preferred candidate (if migrating) + +Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible. + +### Option 4: Improve current setup — Low-cost alternative + +Consolidate existing `null_resource` SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below. + +## Talos Linux — Detailed Assessment + +### What Maps Cleanly + +| Current (Ubuntu) | Talos Equivalent | Complexity | +|---|---|---| +| cloud_init.yaml packages | Eliminated (no packages needed) | None | +| containerd registry mirrors | `machine.registries.mirrors` in machine config | Simple | +| `kubeadm join` | Talos manages K8s lifecycle natively | Simple | +| sysctl DaemonSet (inotify) | `machine.sysctls` in machine config | Simple | +| API server OIDC flags (SSH+sed) | `cluster.apiServer.extraArgs` | Simple | +| Audit policy (SSH+sed) | `cluster.apiServer.extraArgs` + `extraVolumes` | Simple | +| GPU label (manual) | `machine.nodeLabels` | Simple | +| GPU taint (null_resource) | `machine.nodeTaints` or machine config | Simple | +| Static IPs | `machine.network.interfaces` | Simple | +| QEMU guest agent | `qemu-guest-agent` system extension | Simple | + +### What Has Friction + +| Component | Issue | Severity | +|---|---|---| +| NFS volumes | `nfs-utils` extension is "contrib" tier (community-maintained) | Medium | +| NVIDIA GPU | Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules | Medium | +| No SSH | Debugging via `talosctl` only (dmesg, logs, dashboard, pcap) | Low-Medium | +| Not kubeadm | Cannot in-place migrate; must build parallel cluster | High (one-time) | +| Proxmox templates | Different provisioning model (ISO boot vs cloud-init clone) | Medium | +| No arbitrary packages | No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers | Low | + +### Terraform Integration + +Official provider: `siderolabs/talos` v0.10.1 + +```hcl +# Key resources: +# - talos_machine_secrets — cluster-wide secrets (generated once) +# - talos_machine_configuration — per-node machine config (data source) +# - talos_machine_configuration_apply — apply config to a node +# - talos_machine_bootstrap — bootstrap control plane (once) +# - talos_cluster_kubeconfig — retrieve kubeconfig +``` + +Would fit as `stacks/talos/` alongside existing `stacks/infra/`. + +### Example Machine Configs + +#### Worker node (e.g., k8s-node2) + +```yaml +version: v1alpha1 +machine: + type: worker + network: + hostname: k8s-node2 + interfaces: + - interface: eth0 + addresses: + - 10.0.20.102/24 + routes: + - network: 0.0.0.0/0 + gateway: 10.0.20.1 + nameservers: + - 10.0.20.101 # Technitium + - 1.1.1.1 + registries: + mirrors: + docker.io: + endpoints: ["http://10.0.20.10:5000"] + ghcr.io: + endpoints: ["http://10.0.20.10:5010"] + quay.io: + endpoints: ["http://10.0.20.10:5020"] + registry.k8s.io: + endpoints: ["http://10.0.20.10:5030"] + reg.kyverno.io: + endpoints: ["http://10.0.20.10:5040"] + sysctls: + fs.inotify.max_user_watches: "1048576" + fs.inotify.max_user_instances: "8192" + net.ipv4.ip_forward: "1" + kubelet: + extraConfig: + serializeImagePulls: false + maxParallelImagePulls: 50 + install: + disk: /dev/sda + extensions: + - image: ghcr.io/siderolabs/nfs-utils:v2.7.2 + - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0 +cluster: + controlPlane: + endpoint: https://10.0.20.100:6443 +``` + +#### GPU node (k8s-node1) — additional config + +```yaml +machine: + kernel: + modules: + - name: nvidia + - name: nvidia_uvm + - name: nvidia_drm + - name: nvidia_modeset + nodeLabels: + gpu: "true" + nodeTaints: + nvidia.com/gpu: "true:NoSchedule" + install: + extensions: + - image: ghcr.io/siderolabs/nfs-utils:v2.7.2 + - image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0 + - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5 + - image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x +``` + +#### Control plane (k8s-master) — OIDC + audit + +```yaml +cluster: + apiServer: + extraArgs: + oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/ + oidc-client-id: kubernetes + oidc-username-claim: email + oidc-groups-claim: groups + audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml + audit-log-path: /var/log/kubernetes/audit.log + audit-log-maxage: "7" + audit-log-maxbackup: "3" + audit-log-maxsize: "100" + extraVolumes: + - hostPath: /etc/kubernetes/policies + mountPath: /etc/kubernetes/policies + readOnly: true + - hostPath: /var/log/kubernetes + mountPath: /var/log/kubernetes +``` + +### Migration Path (if proceeding) + +This is NOT an in-place migration. Talos replaces kubeadm entirely. + +1. **Build Talos machine configs** in the repo (YAML per node, templated via Terraform) +2. **Create `stacks/talos/` stack** — Proxmox VM creation + Talos provider resources +3. **Download Talos ISO** with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory +4. **Stand up parallel cluster** — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom) +5. **Bootstrap control plane** via `talosctl bootstrap` +6. **Point existing Terraform service stacks** at new cluster kubeconfig +7. **Apply all service stacks** — NFS-backed services point at same data, no data migration +8. **Validate everything works** — run cluster healthcheck, test all services +9. **Tear down old Ubuntu VMs** +10. **Reassign IPs** if desired (reconfigure Talos nodes to use original IPs) + +### What Gets Eliminated + +If migrated, these files/patterns become unnecessary: +- `modules/create-template-vm/cloud_init.yaml` +- `modules/create-template-vm/` (entire module) +- `modules/create-vm/` (replaced by Talos provider) +- `scripts/setup_containerd_mirrors.sh` +- `stacks/platform/modules/rbac/apiserver-oidc.tf` (SSH+sed block) +- `stacks/platform/modules/rbac/audit-policy.tf` (SSH+sed block) +- `stacks/platform/modules/monitoring/loki.tf` sysctl-inotify DaemonSet +- `playbooks/deploy_node_exporter.yaml` +- `null_resource.gpu_node_taint` in nvidia module +- The undocumented GPU label manual step + +## ROI Analysis + +### Costs + +| Cost | Estimate | +|---|---| +| Learn Talos + talosctl workflow | Significant (new paradigm, no SSH) | +| Build Talos Terraform stack | Medium (new stack, provider, machine configs) | +| Build custom Talos ISO with extensions | Low (Image Factory makes this easy) | +| Parallel cluster setup + validation | Medium-High (must test every service) | +| NVIDIA driver testing on Talos | Medium (version-locking, open kernel modules) | +| Loss of SSH node access | Ongoing (workflow change) | +| Ongoing: Talos upgrades require extension version alignment | Low-Medium | + +### Benefits + +| Benefit | Value | +|---|---| +| Zero configuration drift (structural guarantee) | High (but current drift risk is actually low) | +| Single-command node rebuild | High | +| Eliminates ~10 files/patterns of provisioning code | Medium | +| Atomic OS upgrades with rollback | Medium | +| Declarative API server config (no SSH+sed) | Medium | +| GPU label/taint properly codified | Low (could fix this today in 5 minutes) | +| Immutable, minimal attack surface | Low-Medium (nodes aren't internet-exposed) | + +### Honest Assessment + +The current drift surface is small and well-understood. The highest-risk items are: +1. **API server OIDC/audit config** — SSH+sed is fragile but rarely changes +2. **containerd mirrors** — baked into template, stable once set +3. **GPU label** — missing from code but trivially fixable + +Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice. + +**Talos solves a real problem, but the problem isn't causing real pain today.** The migration cost is high relative to the current risk. It would make sense to revisit if: +- Adding more nodes frequently (scaling the cluster) +- Experiencing actual drift incidents +- Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change) +- The current SSH+sed patterns break and need rework anyway + +## Quick Wins (Do Instead / Do Now) + +These close most of the drift gap without changing the OS: + +1. **Add GPU label to Terraform** — `kubectl label` in existing nvidia `null_resource` or a `kubernetes_labels` resource +2. **Make API server OIDC config idempotent** — improve the grep-before-sed checks +3. **Move node-exporter to K8s DaemonSet** — instead of Ansible playbook on host +4. **Document the full node rebuild procedure** — cloud-init template → clone → join → verify + +## References + +- Talos docs: https://docs.siderolabs.com/talos/v1.9/ +- Talos Proxmox guide: https://docs.siderolabs.com/talos/v1.9/platform-specific-installations/virtualized-platforms/proxmox/ +- Talos NVIDIA GPU: https://docs.siderolabs.com/talos/v1.9/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu +- Talos Terraform provider: https://registry.terraform.io/providers/siderolabs/talos/latest (v0.10.1) +- Talos system extensions: https://github.com/siderolabs/extensions +- Talos Image Factory: https://factory.talos.dev/ diff --git a/playbooks/deploy_node_exporter.yaml b/playbooks/deploy_node_exporter.yaml deleted file mode 100644 index de3c3937..00000000 --- a/playbooks/deploy_node_exporter.yaml +++ /dev/null @@ -1,70 +0,0 @@ ---- -- name: Install Prometheus Node Exporter - hosts: all - become: true - vars: - node_exporter_version: "1.10.2" - architecture: "linux-amd64" - # Defines where the binary is downloaded/extracted - download_url: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.{{ architecture }}.tar.gz" - - tasks: - - name: Create node_exporter group - group: - name: node_exporter - state: present - - - name: Create node_exporter user - user: - name: node_exporter - group: node_exporter - shell: /bin/false - create_home: no - - - name: Download and unarchive Node Exporter - unarchive: - src: "{{ download_url }}" - dest: /tmp/ - remote_src: yes - - - name: Move binary to /usr/local/bin - copy: - src: "/tmp/node_exporter-{{ node_exporter_version }}.{{ architecture }}/node_exporter" - dest: /usr/local/bin/node_exporter - mode: '0755' - owner: node_exporter - group: node_exporter - remote_src: yes - - - name: Create Systemd service file - copy: - dest: /etc/systemd/system/node_exporter.service - content: | - [Unit] - Description=Node Exporter - Wants=network-online.target - After=network-online.target - - [Service] - User=node_exporter - Group=node_exporter - Type=simple - ExecStart=/usr/local/bin/node_exporter - - [Install] - WantedBy=multi-user.target - - - name: Force systemd to reread configs - systemd: - daemon_reload: yes - - - name: Enable and start Node Exporter - systemd: - name: node_exporter - state: started - enabled: yes - - - name: Clean up temporary files - file: - path: "/tmp/node_exporter-{{ node_exporter_version }}.{{ architecture }}" - state: absent diff --git a/stacks/platform/modules/monitoring/prometheus_chart_values.tpl b/stacks/platform/modules/monitoring/prometheus_chart_values.tpl index 0cd47b85..40a688c0 100755 --- a/stacks/platform/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/platform/modules/monitoring/prometheus_chart_values.tpl @@ -101,8 +101,8 @@ alertmanager: # web.external-url seems to be hardcoded, edited deployment manually # extraArgs: # web.external-url: "https://prometheus.viktorbarzin.me" -# prometheus-node-exporter: -# enabled: true +prometheus-node-exporter: + enabled: true server: # Enable me to delete metrics extraFlags: diff --git a/stacks/platform/modules/nvidia/main.tf b/stacks/platform/modules/nvidia/main.tf index 4f26768e..885a9e7e 100644 --- a/stacks/platform/modules/nvidia/main.tf +++ b/stacks/platform/modules/nvidia/main.tf @@ -17,10 +17,13 @@ resource "kubernetes_namespace" "nvidia" { } } -# Apply GPU taint to ensure only GPU workloads run on GPU node -resource "null_resource" "gpu_node_taint" { +# Apply GPU taint and label to ensure only GPU workloads run on GPU node +resource "null_resource" "gpu_node_config" { provisioner "local-exec" { - command = "kubectl taint nodes k8s-node1 nvidia.com/gpu=true:NoSchedule --overwrite" + command = <<-EOT + kubectl taint nodes k8s-node1 nvidia.com/gpu=true:NoSchedule --overwrite + kubectl label nodes k8s-node1 gpu=true --overwrite + EOT } # Re-run if namespace changes (proxy for cluster changes) diff --git a/stacks/platform/modules/rbac/apiserver-oidc.tf b/stacks/platform/modules/rbac/apiserver-oidc.tf index d7fce93c..6e54d203 100644 --- a/stacks/platform/modules/rbac/apiserver-oidc.tf +++ b/stacks/platform/modules/rbac/apiserver-oidc.tf @@ -32,8 +32,11 @@ resource "null_resource" "apiserver_oidc_config" { provisioner "remote-exec" { inline = [ - # Check if OIDC flags already present - "if grep -q 'oidc-issuer-url' /etc/kubernetes/manifests/kube-apiserver.yaml; then echo 'OIDC flags already configured'; exit 0; fi", + # Check if OIDC flags already configured with the correct values + "if grep -q 'oidc-issuer-url=${var.oidc_issuer_url}' /etc/kubernetes/manifests/kube-apiserver.yaml && grep -q 'oidc-client-id=${var.oidc_client_id}' /etc/kubernetes/manifests/kube-apiserver.yaml; then echo 'OIDC flags already configured with correct values'; exit 0; fi", + + # Remove any existing OIDC flags (in case values changed) + "sudo sed -i '/--oidc-issuer-url/d; /--oidc-client-id/d; /--oidc-username-claim/d; /--oidc-groups-claim/d' /etc/kubernetes/manifests/kube-apiserver.yaml", # Backup the manifest "sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml.bak", diff --git a/stacks/platform/modules/rbac/audit-policy.tf b/stacks/platform/modules/rbac/audit-policy.tf index 2ec82796..1a718e98 100644 --- a/stacks/platform/modules/rbac/audit-policy.tf +++ b/stacks/platform/modules/rbac/audit-policy.tf @@ -88,7 +88,44 @@ resource "null_resource" "audit_policy" { } triggers = { - policy_version = "v1" # Bump to re-apply + policy_version = "v1" # Bump to force re-apply of manifest flags + policy_hash = sha256(yamlencode({ + apiVersion = "audit.k8s.io/v1" + kind = "Policy" + rules = [ + { + level = "None" + resources = [{ + group = "" + resources = ["endpoints", "services", "services/status"] + }] + users = ["system:kube-proxy"] + }, + { + level = "None" + verbs = ["watch"] + }, + { + level = "None" + nonResourceURLs = ["/healthz*", "/readyz*", "/livez*"] + }, + { + level = "Metadata" + resources = [{ + group = "" + resources = ["secrets"] + }] + }, + { + level = "RequestResponse" + verbs = ["create", "update", "patch", "delete"] + }, + { + level = "Metadata" + verbs = ["get", "list"] + }, + ] + })) } depends_on = [null_resource.apiserver_oidc_config]