From 2dc7e001bd08de433b4c8f6defca13164ceec003 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Thu, 21 May 2026 09:32:29 +0000
Subject: [PATCH] k8s-version-upgrade: retry kubeadm apply on static-pod-hash
 timeout
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.

The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.

Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.

Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/update_k8s.sh | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/scripts/update_k8s.sh b/scripts/update_k8s.sh
index b8abf06f..f98ea30a 100755
--- a/scripts/update_k8s.sh
+++ b/scripts/update_k8s.sh
@@ -83,7 +83,23 @@ sudo apt-get install -y "kubeadm=$RELEASE-*"
 if [[ "$ROLE" == "master" ]]; then
     echo "==> Master path: kubeadm upgrade plan + apply"
     sudo kubeadm upgrade plan
-    sudo kubeadm upgrade apply "v$RELEASE" -y
+    # The first apply may fail with "static Pod hash for component <X> did
+    # not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload
+    # a static pod is too tight on our cluster (apiserver-to-kubelet status
+    # sync latency post-master-reboot can exceed it). The etcd image IS
+    # actually updated by then, so a 2nd attempt sees etcd already on
+    # target and skips it. Up to 3 attempts with a 30s delay between.
+    attempt=1
+    while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
+        if (( attempt >= 3 )); then
+            echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
+            exit 1
+        fi
+        echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
+        sleep 30
+        attempt=$(( attempt + 1 ))
+    done
+    echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
 else
     echo "==> Worker path: kubeadm upgrade node"
     sudo kubeadm upgrade node