infra/docs/plans/2026-02-22-node-drift-quick-wins-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

29 lines
1.4 KiB
Markdown

# Node Configuration Drift Quick Wins — Design
**Date**: 2026-02-22
**Status**: Approved
**Context**: From Talos Linux evaluation — these close 95% of the drift gap without changing the OS
## Quick Win 1: Add GPU Label to Terraform
**File**: `stacks/platform/modules/nvidia/main.tf`
Extend the existing `null_resource.gpu_node_taint` to also apply the `gpu=true` label. Rename to `gpu_node_config`. Both commands are idempotent (`--overwrite` for taint, label is a no-op if already set).
## Quick Win 2: Improve API Server OIDC/Audit Idempotency
**Files**: `stacks/platform/modules/rbac/apiserver-oidc.tf`, `audit-policy.tf`
Current grep-before-sed checks prevent duplicate entries but don't handle value changes. Improve the OIDC check to compare the actual issuer URL value, not just the flag name. Audit policy file is always re-uploaded (good), manifest edit is skipped if already configured (acceptable).
## Quick Win 3: Enable Node-Exporter via Prometheus Helm Chart
**File**: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`
Uncomment `prometheus-node-exporter: enabled: true`. Delete `playbooks/deploy_node_exporter.yaml` (unused, superseded by DaemonSet).
## Quick Win 4: Document Node Rebuild Procedure
**File**: `.claude/CLAUDE.md`
Add a "Node Rebuild Procedure" section documenting the full sequence: VM creation from template → cloud-init → kubeadm join → verify mirrors/labels/taints.