k8s-version-upgrade: retry kubeadm apply on static-pod-hash timeout
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap to be picked up by the kubelet (it polls the pod's `kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted master with apiserver-to-kubelet status sync lagging, that 5min isn't enough — kubeadm declares the upgrade failed and rolls back. The thing is: the etcd container HAS already been swapped to the new image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0 when this fires). kubeadm's check is just slow to notice. The 2nd attempt sees etcd already on target, skips it, and proceeds cleanly. Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between. Worker phase doesn't need this — `kubeadm upgrade node` has no static-pod-hash waits. Today's autonomous-pipeline session: master phase Failed at 5m on attempt #1 with this exact error, retried, hit same timeout, gave up (backoffLimit=1). The wrapper turns this from a fatal pipeline halt into a "wait a bit, try again" that usually completes on attempt #2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
6dd1f15881
commit
eca1cc7e2e
1 changed files with 17 additions and 1 deletions
|
|
@ -83,7 +83,23 @@ sudo apt-get install -y "kubeadm=$RELEASE-*"
|
|||
if [[ "$ROLE" == "master" ]]; then
|
||||
echo "==> Master path: kubeadm upgrade plan + apply"
|
||||
sudo kubeadm upgrade plan
|
||||
sudo kubeadm upgrade apply "v$RELEASE" -y
|
||||
# The first apply may fail with "static Pod hash for component <X> did
|
||||
# not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload
|
||||
# a static pod is too tight on our cluster (apiserver-to-kubelet status
|
||||
# sync latency post-master-reboot can exceed it). The etcd image IS
|
||||
# actually updated by then, so a 2nd attempt sees etcd already on
|
||||
# target and skips it. Up to 3 attempts with a 30s delay between.
|
||||
attempt=1
|
||||
while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
|
||||
if (( attempt >= 3 )); then
|
||||
echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
|
||||
sleep 30
|
||||
attempt=$(( attempt + 1 ))
|
||||
done
|
||||
echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
|
||||
else
|
||||
echo "==> Worker path: kubeadm upgrade node"
|
||||
sudo kubeadm upgrade node
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue