infra/docs/runbooks
Viktor Barzin ed53b34bf4
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
..
apiserver-audit-logging.md
beads-auto-dispatch.md
breakglass-ssh.md
breakglass-ui.md
chrome-service-snapshot.md
fan-control.md
forgejo-registry-breakglass.md
forgejo-registry-rebuild-image.md
forgejo-registry-setup.md
grow-pve-nfs-lv.md
immich-transcode-bitrate.md
job-hunter.md
k8s-node-auto-upgrades.md
k8s-version-upgrade.md k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) 2026-06-17 16:56:02 +00:00
kms-public-exposure.md
mailserver-pfsense-haproxy.md
mailserver-proxy-protocol.md
nextcloud-add-archive.md
nfs-prerequisites.md
offboard-user.md
pfsense-unbound.md
proxmox-host.md
r730-ram-upgrade-272gb.md
registry-rebuild-image.md
registry-vm.md
restore-etcd.md
restore-full-cluster.md
restore-lvm-snapshot.md
restore-mysql.md
restore-postgresql.md
restore-pvc-from-backup.md
restore-vault.md
restore-vaultwarden.md
scale-k8s-cluster.md
security-incident.md
synology-storage.md
t3-drop-attribution.md
t3-version-bump.md
technitium-apply.md
tripit-external-signup.md
vault-raft-leader-deadlock.md
vault-token-renew-devvm.md
woodpecker-onboard-forgejo-repo.md