Three changes unblocking the autonomous chain for k8s patch upgrades:
1. **phase_master quiesces tigera-operator before drain, restores after.**
Tigera crashes immediately if apiserver is unreachable (no retry logic)
and crashlooping it during master static-pod swaps generates ~500MB/s
disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
limit. Quiesce removes the storm contributor; calico data plane keeps
running unchanged (data plane is the DaemonSet+Typha, operator is just
the reconciler).
2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
writes an identical manifest, hash doesn't update, watch times out and
rolls back forever. The skip-etcd retry sidesteps it for the legitimate
no-change case while still doing a full etcd upgrade on the first
attempt (correct for minor-version bumps).
3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
Both are symptoms-not-causes: ingress latency spikes briefly during any
pod-restart wave; high IOwait is exactly what upgrade activity causes
(chicken-and-egg). The inline quiet-baseline check (Ready transition
<10min) is the real cluster-churn gate.
RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.
These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kubeadm's `upgrade apply` waits 5min for each static-pod manifest swap
to be picked up by the kubelet (it polls the pod's
`kubernetes.io/config.hash` annotation via apiserver). On a freshly-rebooted
master with apiserver-to-kubelet status sync lagging, that 5min isn't
enough — kubeadm declares the upgrade failed and rolls back.
The thing is: the etcd container HAS already been swapped to the new
image by then (verified live — pod is on registry.k8s.io/etcd:3.6.5-0
when this fires). kubeadm's check is just slow to notice. The 2nd
attempt sees etcd already on target, skips it, and proceeds cleanly.
Wrap `kubeadm upgrade apply` in a 3-attempt loop with 30s between.
Worker phase doesn't need this — `kubeadm upgrade node` has no
static-pod-hash waits.
Today's autonomous-pipeline session: master phase Failed at 5m on
attempt #1 with this exact error, retried, hit same timeout, gave up
(backoffLimit=1). The wrapper turns this from a fatal pipeline halt
into a "wait a bit, try again" that usually completes on attempt #2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).