k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
dfa1a12a86
commit
fb638cd8ec
4 changed files with 71 additions and 26 deletions
|
|
@ -61,8 +61,11 @@ Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
|||
- `k8s_upgrade_in_flight` — 0/1
|
||||
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
|
||||
|
||||
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
|
||||
been running >90 minutes. The script raises `✗` in the same window.
|
||||
`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running
|
||||
>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally
|
||||
failed — including a **preflight that aborted before `in_flight` was set**
|
||||
(the gates exit pre-metric). The script raises `✗` for either, and reads the
|
||||
Jobs directly, so it also catches a Failed preflight that left no metric.
|
||||
|
||||
## Status-icon legend
|
||||
|
||||
|
|
@ -72,7 +75,7 @@ been running >90 minutes. The script raises `✗` in the same window.
|
|||
| `→` | Update available, not yet applied (K8s patch/minor) |
|
||||
| `…` | In flight — chain currently running |
|
||||
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
|
||||
| `✗` | Broken: pod down, alert firing, chain stalled |
|
||||
| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed |
|
||||
|
||||
## Drill-down — when a row trips, what to do
|
||||
|
||||
|
|
@ -177,6 +180,31 @@ kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -
|
|||
--header='Content-Type: text/plain'"
|
||||
```
|
||||
|
||||
### K8s `✗ chain failed` — a phase Job terminally failed
|
||||
|
||||
`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted
|
||||
on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) —
|
||||
these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and
|
||||
the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get jobs
|
||||
kubectl -n k8s-upgrade describe job <failed-job> # check the Failed reason
|
||||
# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
|
||||
# them. Replay the gate instead — which critical alerts were firing at the
|
||||
# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)
|
||||
```
|
||||
|
||||
Recovery is now mostly automatic: the detection CronJob and `spawn_next`
|
||||
re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a
|
||||
transient gate clears within ~24h. To expedite, delete the Failed Job and
|
||||
trigger detection:
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade delete job <failed-job>
|
||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
|
||||
```
|
||||
|
||||
### K8s `✗ detection stale` — last detection >9 days
|
||||
|
||||
```bash
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue