k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
8e13f1528e
commit
448bc0c0f6
7 changed files with 1063 additions and 394 deletions
|
|
@ -1917,6 +1917,21 @@ serverFiles:
|
|||
severity: critical
|
||||
annotations:
|
||||
summary: "K8s upgrade is in flight but no etcd snapshot was recorded — pipeline pre-flight failed silently"
|
||||
# K8sUpgradeStalled: the v2 Job-chain pushes `k8s_upgrade_started_timestamp`
|
||||
# in preflight and resets `k8s_upgrade_in_flight=0` in postflight. If
|
||||
# in_flight=1 persists for >90 min, a Job in the chain failed
|
||||
# (backoffLimit=1), got preempted/evicted, or is hung. Manual recovery:
|
||||
# `kubectl -n k8s-upgrade get jobs` → identify failed/stuck Job → delete
|
||||
# it → fix root cause → re-create the same Job. Next-Job creation in each
|
||||
# phase is idempotent (deterministic name = `k8s-upgrade-<phase>-<target>`)
|
||||
# so re-running won't duplicate downstream Jobs.
|
||||
- alert: K8sUpgradeStalled
|
||||
expr: k8s_upgrade_in_flight == 1 and (time() - k8s_upgrade_started_timestamp) > 5400
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "K8s upgrade has been in flight for >90 min — chain is stuck. Check: kubectl -n k8s-upgrade get jobs"
|
||||
- name: "Traefik Ingress"
|
||||
rules:
|
||||
- alert: TraefikDown
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue