k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.

Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.

Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-18 09:17:46 +00:00
parent 8787d361dc
commit 70e217db24
2 changed files with 36 additions and 12 deletions

View file

@ -356,13 +356,30 @@ phase_preflight() {
# on a Keel-drifted CoreDNS (start version unsupported) and, under pipefail,
# aborts this whole check. Ignore the two CoreDNS checks here too so plan
# still emits its "kubeadm upgrade apply vX.Y.Z" line. (See update_k8s.sh.)
local plan_target
plan_target=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" 'sudo kubeadm upgrade plan --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins' \
| grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
if [ "$plan_target" != "$TARGET_VERSION" ]; then
slack "ABORT preflight — kubeadm plan target $plan_target ≠ requested $TARGET_VERSION"
exit 1
#
# SKIP this gate when k8s-master is ALREADY on TARGET_VERSION — a partial-chain
# resume (master + earlier workers done, later workers still pending). `kubeadm
# upgrade plan` run on an at-target master prints NO "kubeadm upgrade apply
# vX.Y.Z" line, so the parse below yields an EMPTY plan_target and the `!=`
# check aborts every run — even though the chain just needs to finish the
# remaining workers (phase_master self-skips an at-target master the same way,
# below). Confirmed root cause of the 1.34.9 preflight aborts (2026-06-18):
# master was already on 1.34.9 while node2-6 lagged on 1.34.8, so every nightly
# preflight died here with an empty `plan target ≠ requested 1.34.9`.
local master_kubelet_v
master_kubelet_v=$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
if [ "$master_kubelet_v" = "$TARGET_VERSION" ]; then
slack "preflight — k8s-master already on v$TARGET_VERSION; skipping kubeadm-plan-target gate (workers still pending)"
echo "k8s-master already on v$TARGET_VERSION — skipping kubeadm-plan-target gate"
else
local plan_target
plan_target=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" 'sudo kubeadm upgrade plan --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins' \
| grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
if [ "$plan_target" != "$TARGET_VERSION" ]; then
slack "ABORT preflight — kubeadm plan target $plan_target ≠ requested $TARGET_VERSION"
exit 1
fi
fi
# 5. Push in-flight + started_timestamp metrics + ns annotations