k8s-version-upgrade: skip phase_master/phase_worker if node already on target

The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.

Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.

Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.

In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-23 09:53:57 +00:00
parent a0f3e15562
commit 02ea5da8dc

View file

@ -380,6 +380,18 @@ phase_preflight() {
}
phase_master() {
# Idempotency: skip the whole phase if k8s-master is already on target.
# The chain can re-run after a partial failure (e.g. workers got cut
# short); without this short-circuit we re-drain and re-kubeadm an
# already-upgraded master for no reason. Added 2026-05-23.
local current_v
current_v=$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
if [ "$current_v" = "$TARGET_VERSION" ]; then
slack "k8s-master already on v$TARGET_VERSION (kubelet=$current_v) — skipping master phase"
echo "k8s-master already on v$TARGET_VERSION — skipping"
return 0
fi
slack "Draining k8s-master"
# Re-check halt-on-alert before drain. Always ignore RecentNodeReboot —
@ -441,6 +453,18 @@ phase_master() {
phase_worker() {
[ -z "$TARGET_NODE" ] && { echo "ERROR: worker phase requires TARGET_NODE"; exit 2; }
# Idempotency: skip if target node is already on target version. Same
# rationale as phase_master — chains re-running after partial completion
# shouldn't re-drain an already-upgraded worker. Added 2026-05-23.
local current_v
current_v=$($KUBECTL get node "$TARGET_NODE" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v)
if [ "$current_v" = "$TARGET_VERSION" ]; then
slack "$TARGET_NODE already on v$TARGET_VERSION (kubelet=$current_v) — skipping worker phase"
echo "$TARGET_NODE already on v$TARGET_VERSION — skipping"
return 0
fi
slack "Draining $TARGET_NODE"
# Halt-on-alert wait (up to 30 min). Ignore RecentNodeReboot — the chain