The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
16 KiB
K8s Version Upgrade Pipeline
Overview
Kubernetes component versions (kubeadm/kubelet/kubectl) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a
chain of small phase Jobs. Each Job is pinned to a node that is NOT its
drain target — so no pod in the chain can preempt itself.
The chain (Sun 12:00 UTC weekly):
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
This is independent of the OS-side unattended-upgrades + kured
pipeline (see k8s-node-auto-upgrades.md). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the RecentNodeReboot alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node.
Architecture
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
│ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
│ push k8s_upgrade_available{kind,running,target} → Pushgateway
│
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ creates k8s-upgrade-preflight-<target_version>
▼
Job 0 — preflight (pinned: k8s-node1)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
└── spawn_next → k8s-upgrade-master-<target_version>
▼
Job 1 — master upgrade (pinned: k8s-node1)
├── halt-on-alert recheck (no firing alerts)
├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
├── kubectl uncordon k8s-master; wait Ready + version match
├── verify control-plane pods Running
├── halt-on-alert recheck (allows RecentNodeReboot)
└── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
▼
Job 2 — worker k8s-node4 (pinned: k8s-node1)
Job 3 — worker k8s-node3 (pinned: k8s-node1)
Job 4 — worker k8s-node2 (pinned: k8s-node1)
(identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
▼
Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
└── spawn_next → k8s-upgrade-postflight-<target_version>
▼
Job 6 — postflight (no pinning)
├── Verify all 5 nodes at target version
├── Verify no firing Upgrade Gates alerts
├── Compute pod-ready ratio (should be ≥ 0.9)
├── Clear k8s-upgrade-* annotations on namespace
├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
└── Slack: ✅ K8s upgrade complete
Pin choices summarised:
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1 itself is upgraded last.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
toleration for
node-role.kubernetes.io/control-plane:NoSchedule. - If anyone reorders the worker sequence, the pin for Job 5 needs to track
whatever worker is upgraded last. The mapping is in
scripts/upgrade-step.sh→ thecase "${PHASE}:${TARGET_NODE:-}"block.
Components
Shared resources (one-time, Terraform-managed)
| Resource | Purpose |
|---|---|
ConfigMap k8s-upgrade-scripts |
Mounts /scripts/upgrade-step.sh (universal phase body, dispatches on $PHASE) and /scripts/update_k8s.sh (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
ConfigMap k8s-upgrade-job-template |
Mounts /template/job-template.yaml — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst |
ServiceAccount k8s-upgrade-job |
Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on k8s-upgrade only. Namespace-scoped Role binding grants secrets:get on k8s-upgrade-creds. |
ExternalSecret k8s-upgrade-creds |
Syncs secret/k8s-upgrade/{ssh_key, slack_webhook} from Vault. Mounted into every Job at /secrets/k8s-upgrade. |
CronJob k8s-version-check |
Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from job-template.yaml and applies it. |
Pushgateway metrics
Pushed by upgrade-step.sh during phase execution; observed by the
Upgrade Gates alert group in stacks/monitoring/.../prometheus_chart_values.tpl:
| Metric | Pushed by | Cleared by |
|---|---|---|
k8s_upgrade_in_flight (1/0) |
preflight Job (set to 1) | postflight Job (set to 0) |
k8s_upgrade_started_timestamp (epoch s) |
preflight Job | postflight Job (set to 0) |
k8s_upgrade_snapshot_taken (1/0) |
preflight Job (set to 1 after Job=pre-upgrade-etcd-* completes with Backup done: log of ≥1 KiB) |
postflight Job (0) |
k8s_upgrade_available{kind,running,target} |
detection CronJob | next detection run (overwrite) |
k8s_version_check_last_run_timestamp |
detection CronJob | (cumulative) |
Upgrade Gates alerts (Upgrade Gates group in prometheus_chart_values.tpl)
K8sVersionSkew— distinct kubelet/apiservergitVersioncount > 1 for 30m. Catches a half-done rollout.EtcdPreUpgradeSnapshotMissing—k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0for 10m. Catches preflight Stage 2 failing silently.K8sUpgradeStalled—k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400for 5m. Catches a Job in the chain dying without spawning its successor.- All three alerts ALSO block kured (same
--prometheus-urlhalt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
Vault secrets
secret/k8s-upgrade/ssh_key— ed25519 PRIVATE key, used by Jobs to SSHwizard@<node>secret/k8s-upgrade/ssh_key_pub— matching PUBLIC key, deployed to nodes'~/.ssh/authorized_keyssecret/k8s-upgrade/slack_webhook— Slack incoming-webhook URL
Exposed in K8s via ExternalSecret k8s-upgrade-creds in the k8s-upgrade namespace. The previous api_bearer_token entry is GONE — the chain does not POST to claude-agent-service.
Common Operations
Verify the pipeline is healthy
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check
# Latest detection run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
# Pushgateway — running detection metric
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
Manually trigger detection (no upgrade)
Use detection_dry_run=true to short-circuit before spawning Job 0:
# Toggle var in TF, apply, and trigger
# (in stacks/k8s-version-upgrade/main.tf)
# variable "detection_dry_run" { default = true }
# scripts/tg apply
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
# When done, flip back to false.
Manually trigger the chain (skip detection)
Useful for testing or to force a specific target. Render Job 0 directly:
TARGET=1.34.7
KIND=patch
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
-o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
cat <<EOF | envsubst | kubectl apply -f -
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
EOF
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
# the full env block. Easier: just trigger detection with the right inputs.
Kill a stuck Job (chain halted mid-flight)
The chain stalls if any Job dies without spawning its successor. K8sUpgradeStalled
fires after 90 min. Recovery:
# 1. Identify the failed Job
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
kubectl -n k8s-upgrade logs job/<failed-job-name>
# 2. Diagnose. Common causes:
# - drain stuck on PDB-violating pod (predrain_unstick should handle this;
# but a brand-new PDB pattern could escape it — manually delete the pod)
# - SSH from Job pod failing (node restarted? known_hosts mismatch?)
# - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
# 3. Fix the root cause first.
# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
# `kubectl apply` of the same name reconciles to a single Job.
kubectl -n k8s-upgrade delete job/<failed-job-name>
# 5. Manually render + apply the same Job. Pull the template + spec from the
# next-Job-creation block in upgrade-step.sh — easiest is to copy from a
# sibling Job's YAML:
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
| yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
| yq '.metadata.name = "<failed-job-name>"' \
| yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
| kubectl apply -f -
# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
# is idempotent (deterministic name) so re-running won't duplicate downstream.
Skip a phase (advanced; use sparingly)
If you've already done the work for a phase manually and want the chain to jump past it, manually create the NEXT phase's Job with the deterministic name. The previous phase's spawn-next will see the Job already exists and short-circuit. Example: master already on target; jump straight to worker:
TARGET=1.34.7
TGT_LBL=${TARGET//./-}
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
Halt the pipeline in an emergency
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
-p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: -p '{"spec":{"suspend":false}}'
# Option 2: delete all in-flight chain Jobs
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
# This leaves the in-flight annotation + Pushgateway gauge intact —
# K8sUpgradeStalled will fire to surface the halt.
# Option 3: force a blocker alert (same regex kured uses)
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
Clear orphaned in-flight state
After deciding NOT to retry a halted chain:
kubectl annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path-
# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
| curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
kill %1
Rollback paths
kubeadm does not support in-place downgrade. If a run fails:
Master broke during/after kubeadm upgrade
- Identify the etcd snapshot:
kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}' - Restore etcd per
infra/docs/runbooks/restore-etcd.md. - Manually downgrade master
kubeadm/kubelet/kubectlto the pre-upgrade version. Find versions in/var/log/apt/history.logon the node:ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40' # Pre-upgrade versions are in the most recent "Commandline: apt-get install" sudo apt-mark unhold kubeadm kubelet kubectl sudo apt-get install --allow-downgrades -y \ kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1 sudo apt-mark hold kubeadm kubelet kubectl sudo systemctl daemon-reload && sudo systemctl restart kubelet
Worker broke
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300- Downgrade apt packages on that node only (see above)
kubectl uncordon <node>- The cluster continues running on the master + remaining workers throughout
One-shot SSH key rotation
- Generate new keypair:
ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N "" - Update Vault:
vault kv patch secret/k8s-upgrade \ ssh_key=@/tmp/k8s-upgrade \ ssh_key_pub=@/tmp/k8s-upgrade.pub - Push the new pubkey to every node:
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys' ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys' done - ESO refreshes within 15 min — or force:
kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite
Past Incidents
2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the
claude-agent-serviceDeployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4. - During Stage 6 (first worker drain) the agent ran
kubectl drain k8s-node4— evicting itself. - The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
- Mitigation: rewrote the pipeline as a chain of Jobs, each
nodeSelector-pinned to a non-target node. Newpredrain_unstickstep deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. AddedK8sUpgradeStalledalert (in-flight + started_timestamp > 90 min).
File Pointers
| What | Where |
|---|---|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | infra/stacks/k8s-version-upgrade/main.tf |
| Universal phase body | infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh |
| Job template | infra/stacks/k8s-version-upgrade/job-template.yaml |
| Per-node upgrade script | infra/scripts/update_k8s.sh |
| Upgrade Gates alerts | infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl (group "Upgrade Gates") |
| Vault secrets | secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook} |
| Architecture doc | infra/docs/architecture/automated-upgrades.md (K8s Version Upgrades section) |
| Related (OS reboots) | infra/docs/runbooks/k8s-node-auto-upgrades.md |
| Deprecated agent prompt (reference) | infra/.claude/agents/k8s-version-upgrade.deprecated.md |