k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs

Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00 · 2026-06-17 13:10:18 +00:00 · fb638cd8ec
commit fb638cd8ec
parent dfa1a12a86
4 changed files with 71 additions and 26 deletions
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -274,8 +274,13 @@ Job 6 — postflight      (no pinning)
 Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
 by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
 apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
-so `apply` reconciles to a single Job per run — re-running a failed Job
-won't duplicate downstream Jobs.
+so `apply` reconciles to a single Job per run — re-running won't duplicate
+downstream Jobs. The detection CronJob and `spawn_next` additionally delete +
+re-spawn a terminally-**Failed** Job of the same name (rather than skipping it
+on existence), so a transient preflight gate self-heals on the next cycle
+instead of wedging the pipeline until the dead Job's 7d TTL expires
+(retry-on-failure, added 2026-06-17 after a spurious critical alert stalled
+1.34.9 for 5 days).

 ### Self-preemption history (the reason for the Job-chain rewrite)

@ -305,10 +310,11 @@ each Job's pod and its drain target are always different nodes.
 - **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
  `--role master|worker --release X.Y.Z`. Piped via SSH into each node by
  upgrade-step.sh.
- **Three Upgrade Gates alerts**:
+- **Four Upgrade Gates alerts**:
  - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
  - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
+  - `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
 - **Pushgateway metrics**:
  - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
  - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)