k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore

Three changes unblocking the autonomous chain for k8s patch upgrades: 1. **phase_master quiesces tigera-operator before drain, restores after.** Tigera crashes immediately if apiserver is unreachable (no retry logic) and crashlooping it during master static-pod swaps generates ~500MB/s disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its limit. Quiesce removes the storm contributor; calico data plane keeps running unchanged (data plane is the DaemonSet+Typha, operator is just the reconciler). 2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.** For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm writes an identical manifest, hash doesn't update, watch times out and rolls back forever. The skip-etcd retry sidesteps it for the legitimate no-change case while still doing a full etcd upgrade on the first attempt (correct for minor-version bumps). 3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.** Both are symptoms-not-causes: ingress latency spikes briefly during any pod-restart wave; high IOwait is exactly what upgrade activity causes (chicken-and-egg). The inline quiet-baseline check (Ready transition <10min) is the real cluster-churn gate. RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale subresource so the chain can do the scale-to-0/back-to-1 on tigera. These three together get the chain past the cascade that's been blocking 1.34.7→1.34.8 for a week. Long-term fix is still HA control plane (beads code-n0ow); these are the bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00 · 2026-05-23 08:40:11 +00:00 · 4713c3a6d9
commit 4713c3a6d9
parent 6f4a569d1c
3 changed files with 61 additions and 4 deletions
--- a/scripts/update_k8s.sh
+++ b/scripts/update_k8s.sh
@ -89,17 +89,26 @@ if [[ "$ROLE" == "master" ]]; then
    # sync latency post-master-reboot can exceed it). The etcd image IS
    # actually updated by then, so a 2nd attempt sees etcd already on
    # target and skips it. Up to 3 attempts with a 30s delay between.
+    # First attempt: full kubeadm upgrade (incl. etcd). On the static-pod-
+    # hash 5min-timeout failure, retry with --etcd-upgrade=false. The
+    # timeout happens reliably for patch upgrades where etcd's image
+    # doesn't change (kubeadm writes identical manifest → hash doesn't
+    # change → kubeadm waits forever for a change that will never come).
+    # Skipping the etcd phase on retry is safe IF etcd is already on the
+    # right version (which is the only case where this timeout fires).
    attempt=1
-    while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
+    extra_flags=""
+    while ! sudo kubeadm upgrade apply "v$RELEASE" -y $extra_flags; do
        if (( attempt >= 3 )); then
            echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
            exit 1
        fi
-        echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
+        echo "==> kubeadm apply attempt $attempt failed. Retrying with --etcd-upgrade=false (etcd image is unchanged for patch upgrades; kubeadm's static-pod-hash watch is the only thing failing)."
+        extra_flags="--etcd-upgrade=false"
        sleep 30
        attempt=$(( attempt + 1 ))
    done
-    echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
+    echo "==> kubeadm upgrade apply succeeded on attempt $attempt (flags: '$extra_flags')"
 else
    echo "==> Worker path: kubeadm upgrade node"
    sudo kubeadm upgrade node