|
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| agent-task-tracking.md | ||
| authentication.md | ||
| automated-upgrades.md | ||
| backup-dr.md | ||
| chrome-service.md | ||
| ci-cd.md | ||
| compute.md | ||
| databases.md | ||
| dns.md | ||
| homepage.md | ||
| incident-response.md | ||
| llama-cpp.md | ||
| mailserver.md | ||
| monitoring.md | ||
| multi-tenancy.md | ||
| networking.md | ||
| overview.md | ||
| secrets.md | ||
| security.md | ||
| storage.md | ||
| vpn.md | ||
| wave1-egress-observation-2026-05-22.md | ||