k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
dfa1a12a86
commit
fb638cd8ec
4 changed files with 71 additions and 26 deletions
|
|
@ -274,8 +274,13 @@ Job 6 — postflight (no pinning)
|
|||
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
|
||||
by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
|
||||
apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
|
||||
so `apply` reconciles to a single Job per run — re-running a failed Job
|
||||
won't duplicate downstream Jobs.
|
||||
so `apply` reconciles to a single Job per run — re-running won't duplicate
|
||||
downstream Jobs. The detection CronJob and `spawn_next` additionally delete +
|
||||
re-spawn a terminally-**Failed** Job of the same name (rather than skipping it
|
||||
on existence), so a transient preflight gate self-heals on the next cycle
|
||||
instead of wedging the pipeline until the dead Job's 7d TTL expires
|
||||
(retry-on-failure, added 2026-06-17 after a spurious critical alert stalled
|
||||
1.34.9 for 5 days).
|
||||
|
||||
### Self-preemption history (the reason for the Job-chain rewrite)
|
||||
|
||||
|
|
@ -305,10 +310,11 @@ each Job's pod and its drain target are always different nodes.
|
|||
- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
|
||||
`--role master|worker --release X.Y.Z`. Piped via SSH into each node by
|
||||
upgrade-step.sh.
|
||||
- **Three Upgrade Gates alerts**:
|
||||
- **Four Upgrade Gates alerts**:
|
||||
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
|
||||
- `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
|
||||
- `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
|
||||
- `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
|
||||
- **Pushgateway metrics**:
|
||||
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
|
||||
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
|
||||
|
|
|
|||
|
|
@ -115,7 +115,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
|||
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
|
||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
||||
- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
|
||||
### Vault secrets
|
||||
|
||||
|
|
@ -202,8 +203,18 @@ EOF
|
|||
```
|
||||
|
||||
### Kill a stuck Job (chain halted mid-flight)
|
||||
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
|
||||
fires after 90 min. Recovery:
|
||||
A phase Job that dies without spawning its successor halts the chain. Two alerts
|
||||
surface it: `K8sUpgradeStalled` (a mid-chain Job that died with `in_flight=1`,
|
||||
after 90 min) and `K8sUpgradeChainJobFailed` (any phase that terminally failed,
|
||||
after 15 min — including a **preflight** that aborted before `in_flight` was set,
|
||||
which `K8sUpgradeStalled` cannot see).
|
||||
|
||||
**Preflight failures now self-heal** (since 2026-06-17): the detection CronJob and
|
||||
`spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on
|
||||
name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious
|
||||
critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days —
|
||||
clears on the next daily cycle. A mid-chain phase that keeps failing still needs
|
||||
manual recovery: fix the root cause, then:
|
||||
|
||||
```bash
|
||||
# 1. Identify the failed Job
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue