infra

Viktor Barzin ed53b34bf4 All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-17 16:56:02 +00:00
..
scripts	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)	2026-06-17 16:56:02 +00:00
job-template.yaml	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
main.tf	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)	2026-06-17 16:56:02 +00:00
terragrunt.hcl	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00