k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)

The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00 · 2026-06-17 16:56:02 +00:00 · ed53b34bf4
commit ed53b34bf4
parent 0c5a9b5f44
4 changed files with 112 additions and 79 deletions
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -4,7 +4,7 @@ This doc covers three independent automation paths:

 1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
 2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
-3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
+3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — daily detection CronJob → chain of phase Jobs (preflight → master → one worker Job per worker, enumerated live → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.

 ## Overview

@ -262,13 +262,14 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
  │ spawns Job 0 = k8s-upgrade-preflight-<target_version>
  ▼

-Job 0 — preflight       (pinned: k8s-node1)
-Job 1 — master upgrade  (pinned: k8s-node1)        drains k8s-master
-Job 2 — worker          (pinned: k8s-node1)        drains k8s-node4
-Job 3 — worker          (pinned: k8s-node1)        drains k8s-node3
-Job 4 — worker          (pinned: k8s-node1)        drains k8s-node2
-Job 5 — worker          (pinned: k8s-master)       drains k8s-node1  ← control-plane toleration
-Job 6 — postflight      (no pinning)
+Job 0 — preflight       (pinned: first worker)
+Job 1 — master upgrade  (pinned: first worker)     drains k8s-master
+Job 2..N — worker       (pinned: k8s-master)       drains each worker still off-target
+                                                   ← control-plane toleration; one Job
+                                                     per worker, enumerated live from
+                                                     `kubectl get nodes` (covers node5/6
+                                                     + any future node automatically)
+Job N+1 — postflight    (no pinning)
 ```

 Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
@ -344,7 +345,7 @@ The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply`

 - **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
 - **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
+- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target: the master-drain Job runs on the first worker; every worker-drain Job runs on k8s-master (already upgraded, control-plane toleration). The worker set is enumerated live from `kubectl get nodes`, so new nodes are covered with no script change; SSH targets are node InternalIPs (no DNS dependency).
 - **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
 - **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
 - **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -10,7 +10,7 @@ drain target** — so no pod in the chain can preempt itself.
 The chain (Sun 12:00 UTC weekly):

 ```
-detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
+detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
 ```

 This is **independent** of the OS-side `unattended-upgrades + kured`
@ -76,14 +76,16 @@ Job 6 — postflight       (no pinning)
  └── Slack: ✅ K8s upgrade complete
 ```

-**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
-  itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
-  toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
-  whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
-  → the `case "${PHASE}:${TARGET_NODE:-}"` block.
+**Pin choices summarised** (dynamic since 2026-06-17 — no hardcoded node list):
+- The **master-drain Job** runs on the first worker (the "runner"); since it
+  drains the control-plane node, it must not run there.
+- **Every worker-drain Job** runs on **k8s-master** (already upgraded by then),
+  with a `node-role.kubernetes.io/control-plane:NoSchedule` toleration — so a
+  Job never runs on the node it drains (self-preemption invariant).
+- The worker set + order come from `kubectl get nodes` at runtime
+  (`worker_nodes` / `next_pending_worker` in `scripts/upgrade-step.sh`), so
+  **adding a node needs no change** — the chain upgrades every worker still
+  off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).

 ## Components

--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -4,11 +4,12 @@
 # Job's pod runs on a node that is NOT its drain target — eliminates the
 # self-preemption bug that killed the agent-based v1 (2026-05-11 incident).
 #
-# Chain (Job 0 → Job 6):
-#   preflight  (pinned: k8s-node1)
-#   master     (pinned: k8s-node1; drains k8s-master)
-#   worker     (pinned: k8s-node1; drains k8s-node4 → 3 → 2)
-#   worker     (pinned: k8s-master + control-plane toleration; drains k8s-node1 last)
+# Chain (dynamic length — one worker Job per worker node, enumerated live):
+#   preflight  (pinned: first worker)
+#   master     (pinned: first worker; drains k8s-master)
+#   worker     (pinned: k8s-master + control-plane toleration; drains each worker)
+#              ↳ repeats for EVERY worker still off-target (kubectl-derived, so
+#                new nodes like node5/node6 are included automatically)
 #   postflight (no pinning)
 #
 # Each phase Job's container runs scripts/upgrade-step.sh which:
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -1,25 +1,23 @@
 #!/usr/bin/env bash
 #
 # Universal upgrade-step body. Each Job in the k8s-version-upgrade chain runs
-# this once, dispatching on $PHASE. On success it computes the next phase and
-# spawns the next Job. The chain is:
+# this once, dispatching on $PHASE. On success it computes the next phase from
+# LIVE cluster state and spawns the next Job. The chain is:
 #
-#   preflight  (run on k8s-node1)
+#   preflight  (run on the first worker)
 #     ↓
-#   master     (drains k8s-master; run on k8s-node1)
-#     ↓
-#   worker k8s-node4   (run on k8s-node1)
-#     ↓
-#   worker k8s-node3   (run on k8s-node1)
-#     ↓
-#   worker k8s-node2   (run on k8s-node1)
-#     ↓
-#   worker k8s-node1   (drains k8s-node1; run on k8s-master with control-plane toleration)
+#   master     (drains k8s-master; run on the first worker = the "runner")
 #     ↓
+#   worker <W> (drains <W>; run on k8s-master w/ control-plane toleration)   ── repeats
+#     ↓         for EVERY worker still off-target, enumerated from kubectl
 #   postflight (no node pinning)
 #
-# k8s-node1 hosts every Job except the one that drains k8s-node1 itself.
-# k8s-node1 is therefore upgraded LAST.
+# The worker list is derived dynamically (worker_nodes / next_pending_worker),
+# so newly-added nodes are upgraded with no script change — the old hardcoded
+# master→node4→3→2→1 chain silently skipped node5/node6 (added 2026-05-26).
+# Self-preemption invariant: a Job never runs on the node it drains — the
+# master-drain Job runs on a worker; each worker-drain Job runs on the
+# already-upgraded master. SSH targets are node InternalIPs (no DNS dependency).
 #
 # Required env vars (set on the Job pod by job-template.yaml):
 #   PHASE              preflight | master | worker | postflight
@ -39,13 +37,11 @@ KUBECTL=kubectl
 JOB_TEMPLATE=/template/job-template.yaml
 UPDATE_K8S_SH=/scripts/update_k8s.sh

-# Pod-side DNS: the cluster's CoreDNS has search domains
-# `<ns>.svc.cluster.local svc.cluster.local cluster.local` (plus ndots=2 via
-# Kyverno mutation). Unqualified `k8s-master` falls through all of these and
-# then queries the upstream DNS (Technitium) for bare `k8s-master`, which
-# returns NXDOMAIN. The FQDN `k8s-master.viktorbarzin.lan` is what Technitium
-# actually serves. Suffix every node SSH target with this domain.
-NODE_DOMAIN=".viktorbarzin.lan"
+# SSH targets are node InternalIPs, resolved live from `kubectl get nodes` (see
+# ssh_target() below) — the pipeline has NO dependency on node DNS records
+# (`k8s-node<N>.viktorbarzin.lan`). This is what lets a freshly-joined node be
+# upgraded with zero DNS/Kea provisioning. (Was FQDN-based until 2026-06-17:
+# node5/node6 had no .viktorbarzin.lan records, which blocked the 1.34.9 chain.)

 # SSH key must be 0400 — refresh from secret mount (defaultMode does this but
 # bind-mount semantics can preserve loose perms; chmod is idempotent).
@ -183,36 +179,66 @@ drain_node() {
 }

 # ---------------------------------------------------------------------------
-# Chain definition — what comes after the current phase
+# Cluster topology — derived live so new nodes are covered automatically
+# ---------------------------------------------------------------------------
+# The old chain hardcoded master→node4→3→2→1 and silently skipped node5/node6
+# (added 2026-05-26). Everything below enumerates nodes from `kubectl get nodes`
+# instead, and SSHes by InternalIP (no .viktorbarzin.lan DNS record needed), so
+# a freshly-joined worker is upgraded with ZERO pipeline changes.
+#
+# Self-preemption invariant preserved: a phase Job never runs on the node it
+# drains. The master-drain Job runs on a worker (the "runner"); every
+# worker-drain Job runs on the (already-upgraded) master.
+worker_nodes() { $KUBECTL get nodes -l '!node-role.kubernetes.io/control-plane' \
+  -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | sort; }
+all_nodes() { { echo k8s-master; worker_nodes; } | sed '/^$/d'; }
+# SSH target = the node's InternalIP (avoids any DNS dependency).
+ssh_target() { echo "wizard@$($KUBECTL get node "$1" \
+  -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}')"; }
+# First worker not yet on $TARGET_VERSION, excluding $1 (the node the current
+# phase is upgrading — still pending when this runs). Empty → all workers done.
+next_pending_worker() {
+  local n v exclude="${1:-}"
+  while read -r n; do
+    [ -z "$n" ] && continue
+    [ "$n" = "$exclude" ] && continue
+    v=$($KUBECTL get node "$n" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
+    [ "$v" != "$TARGET_VERSION" ] && { echo "$n"; return 0; }
+  done < <(worker_nodes)
+  # No worker left off-target. Explicit success — the loop's final `read` exits
+  # 1 at EOF, and `next_w="$(next_pending_worker …)"` under `set -e` would abort
+  # the chain BEFORE the postflight branch (cluster upgraded but no cleanup /
+  # in_flight reset / success). Verified blocker — keep this return 0.
+  return 0
+}
+
+# ---------------------------------------------------------------------------
+# Chain definition — what comes after the current phase (dynamic)
 # ---------------------------------------------------------------------------

 NEXT_PHASE=""
 NEXT_TARGET_NODE=""
 NEXT_RUN_ON=""

-case "${PHASE}:${TARGET_NODE:-}" in
-  preflight:)
+case "$PHASE" in
+  preflight)
+    # master upgrade drains the control-plane node → its Job runs on a worker.
    NEXT_PHASE=master
-    NEXT_RUN_ON=k8s-node1 ;;
-  master:)
-    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node4
-    NEXT_RUN_ON=k8s-node1 ;;
-  worker:k8s-node4)
-    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node3
-    NEXT_RUN_ON=k8s-node1 ;;
-  worker:k8s-node3)
-    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node2
-    NEXT_RUN_ON=k8s-node1 ;;
-  worker:k8s-node2)
-    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node1
-    NEXT_RUN_ON=k8s-master ;;  # control-plane toleration required
-  worker:k8s-node1)
-    NEXT_PHASE=postflight
-    NEXT_RUN_ON="" ;;          # no node pinning for postflight
-  postflight:)
-    NEXT_PHASE="" ;;           # end of chain
+    NEXT_RUN_ON="$(worker_nodes | head -1)" ;;
+  master | worker)
+    # Next worker still off-target (excluding the one this phase handled). Its
+    # Job runs on the already-upgraded master (k8s-master, control-plane
+    # toleration). Dynamic → every worker incl. node5/6 + any future node.
+    next_w="$(next_pending_worker "${TARGET_NODE:-}")"
+    if [ -n "$next_w" ]; then
+      NEXT_PHASE=worker; NEXT_TARGET_NODE="$next_w"; NEXT_RUN_ON=k8s-master
+    else
+      NEXT_PHASE=postflight; NEXT_RUN_ON=""   # all workers on target
+    fi ;;
+  postflight)
+    NEXT_PHASE="" ;;                          # end of chain
  *)
-    echo "ERROR: unknown phase/target combo: ${PHASE}/${TARGET_NODE:-}" >&2
+    echo "ERROR: unknown phase: $PHASE" >&2
    exit 2 ;;
 esac

@ -331,7 +357,7 @@ phase_preflight() {
  # aborts this whole check. Ignore the two CoreDNS checks here too so plan
  # still emits its "kubeadm upgrade apply vX.Y.Z" line. (See update_k8s.sh.)
  local plan_target
-  plan_target=$(ssh "${SSH_OPTS[@]}" "wizard@k8s-master$NODE_DOMAIN" 'sudo kubeadm upgrade plan --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins' \
+  plan_target=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" 'sudo kubeadm upgrade plan --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins' \
    | grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \
    | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
  if [ "$plan_target" != "$TARGET_VERSION" ]; then
@ -372,16 +398,18 @@ phase_preflight() {

  # 7. Containerd skew fix on master (if master < workers)
  local master_ctr worker_max=0.0.0
-  master_ctr=$(ssh "${SSH_OPTS[@]}" "wizard@k8s-master$NODE_DOMAIN" "containerd --version | awk '{print \$3}' | tr -d v")
-  for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  master_ctr=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "containerd --version | awk '{print \$3}' | tr -d v")
+  local n
+  while read -r n; do
+    [ -z "$n" ] && continue
    local v
-    v=$(ssh "${SSH_OPTS[@]}" "wizard@$n$NODE_DOMAIN" "containerd --version | awk '{print \$3}' | tr -d v")
+    v=$(ssh "${SSH_OPTS[@]}" "$(ssh_target "$n")" "containerd --version | awk '{print \$3}' | tr -d v")
    [ "$(printf '%s\n%s' "$v" "$worker_max" | sort -V | tail -1)" = "$v" ] && worker_max="$v"
-  done
+  done < <(worker_nodes)
  if [ "$(printf '%s\n%s' "$master_ctr" "$worker_max" | sort -V | head -1)" = "$master_ctr" ] \
     && [ "$master_ctr" != "$worker_max" ]; then
    slack "Master containerd $master_ctr < workers $worker_max — bumping"
-    ssh "${SSH_OPTS[@]}" "wizard@k8s-master$NODE_DOMAIN" \
+    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
      "sudo apt-mark unhold containerd.io && sudo apt-get install -y containerd.io='$worker_max-1' \
       && sudo apt-mark hold containerd.io && sudo systemctl restart containerd"
    wait_for_node_ready k8s-master "$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)" \
@ -392,14 +420,15 @@ phase_preflight() {
  # 8. Apt repo URL rewrite (minor only)
  if [ "$KIND" = "minor" ]; then
    local target_minor="${TARGET_VERSION%.*}"
-    for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
-      ssh "${SSH_OPTS[@]}" "wizard@$n$NODE_DOMAIN" \
+    while read -r n; do
+      [ -z "$n" ] && continue
+      ssh "${SSH_OPTS[@]}" "$(ssh_target "$n")" \
        "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
         && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' \
              | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
         && sudo apt-get update"
-    done
-    slack "Apt repo rewritten to v$target_minor/deb on all 5 nodes"
+    done < <(all_nodes)
+    slack "Apt repo rewritten to v$target_minor/deb on all nodes"
  fi

  slack "Preflight clean. Snapshot at nfs://...$snap_file ($size bytes). Dispatching master Job."
@ -449,7 +478,7 @@ phase_master() {
  drain_node k8s-master

  slack "Running update_k8s.sh on k8s-master (--role master --release $TARGET_VERSION)"
-  ssh "${SSH_OPTS[@]}" "wizard@k8s-master$NODE_DOMAIN" 'bash -s' \
+  ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" 'bash -s' \
    < "$UPDATE_K8S_SH" -- --role master --release "$TARGET_VERSION"

  $KUBECTL uncordon k8s-master
@ -507,7 +536,7 @@ phase_worker() {
  drain_node "$TARGET_NODE"

  slack "Running update_k8s.sh on $TARGET_NODE (--role worker --release $TARGET_VERSION)"
-  ssh "${SSH_OPTS[@]}" "wizard@$TARGET_NODE$NODE_DOMAIN" 'bash -s' \
+  ssh "${SSH_OPTS[@]}" "$(ssh_target "$TARGET_NODE")" 'bash -s' \
    < "$UPDATE_K8S_SH" -- --role worker --release "$TARGET_VERSION"

  $KUBECTL uncordon "$TARGET_NODE"
@ -539,7 +568,7 @@ phase_worker() {
 phase_postflight() {
  slack "Running postflight"

-  # All 5 nodes at target
+  # All nodes at target
  local versions wrong
  versions=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
  # `grep -v` returns 1 when all nodes are on target (the happy path —