Three changes unblocking the autonomous chain for k8s patch upgrades:
1. **phase_master quiesces tigera-operator before drain, restores after.**
Tigera crashes immediately if apiserver is unreachable (no retry logic)
and crashlooping it during master static-pod swaps generates ~500MB/s
disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
limit. Quiesce removes the storm contributor; calico data plane keeps
running unchanged (data plane is the DaemonSet+Typha, operator is just
the reconciler).
2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
writes an identical manifest, hash doesn't update, watch times out and
rolls back forever. The skip-etcd retry sidesteps it for the legitimate
no-change case while still doing a full etcd upgrade on the first
attempt (correct for minor-version bumps).
3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
Both are symptoms-not-causes: ingress latency spikes briefly during any
pod-restart wave; high IOwait is exactly what upgrade activity causes
(chicken-and-egg). The inline quiet-baseline check (Ready transition
<10min) is the real cluster-churn gate.
RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.
These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>