k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
This commit is contained in:
parent
09f83b4e83
commit
e75bcaf394
8 changed files with 1379 additions and 34 deletions
|
|
@ -1890,14 +1890,13 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "Kubelet/apiserver gitVersion skew detected — possible half-done k8s upgrade. Inspect: kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.kubeletVersion}'"
|
||||
# EtcdPreUpgradeSnapshotMissing: the k8s-version-upgrade agent pushes
|
||||
# k8s_upgrade_in_flight=1 when it starts, and k8s_upgrade_snapshot_taken=1
|
||||
# after the etcdctl snapshot is verified. If we see in_flight=1 with no
|
||||
# corresponding snapshot_taken=1 after 10 min, the agent has skipped or
|
||||
# failed the snapshot — that's a critical safety hole.
|
||||
# `k8s_upgrade_in_flight=1` + `k8s_upgrade_snapshot_taken=0` at Stage 0,
|
||||
# then sets snapshot_taken=1 in Stage 2 after etcdctl confirms the
|
||||
# snapshot file size. Anywhere in_flight=1 with snapshot_taken=0
|
||||
# lasting >10m means the agent skipped or failed Stage 2 — a critical
|
||||
# safety hole (no recovery point if master upgrade hangs).
|
||||
- alert: EtcdPreUpgradeSnapshotMissing
|
||||
expr: |
|
||||
k8s_upgrade_in_flight == 1
|
||||
unless on() k8s_upgrade_snapshot_taken == 1
|
||||
expr: k8s_upgrade_in_flight == 1 and k8s_upgrade_snapshot_taken == 0
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue