k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:
* the detection guard and spawn_next only checked whether the phase Job
EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
* the abort happens before the in-flight metric is pushed, so neither
K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
"never ran" while actually being stuck.
Fixes:
D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
(upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
instead of skipping it, so a transient gate self-corrects next cycle
rather than wedging the pipeline for a week.
D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
web-terminal is not cluster infrastructure and must not block upgrades.
D3 observability: new K8sUpgradeChainJobFailed alert
(kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
a Failed chain Job as "chain failed" — closing the pre-in-flight blind
spot so a wedge is visible immediately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7e7e41cbef
commit
dfa1a12a86
4 changed files with 75 additions and 5 deletions
|
|
@ -445,6 +445,17 @@ collect_k8s() {
|
|||
|
||||
K8S_NEXT="$(next_daily_noon_utc)"
|
||||
|
||||
# Failed chain-Job detection. A preflight/phase Job can abort BEFORE pushing
|
||||
# k8s_upgrade_in_flight=1 (the preflight gates exit pre-metric), so in-flight
|
||||
# / stalled stay clean while the pipeline is actually wedged: the
|
||||
# deterministic-name + 7d-TTL Job blocks re-spawn. Surface it directly.
|
||||
# (2026-06-17: a transient critical alert wedged the 1.34.9 preflight for 5
|
||||
# days, invisible to every metric-based check.)
|
||||
local failed_jobs
|
||||
failed_jobs=$($KUBECTL -n k8s-upgrade get jobs \
|
||||
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Failed")].status}{"\n"}{end}' 2>/dev/null \
|
||||
| awk -F'\t' '$2=="True" && $1 ~ /^k8s-upgrade-/{print $1}' | paste -sd' ' - || true)
|
||||
|
||||
# Status logic.
|
||||
local stalled=0
|
||||
if [[ "${in_flight:-0}" == "1" && "$started_int" -gt 0 ]]; then
|
||||
|
|
@ -463,6 +474,10 @@ collect_k8s() {
|
|||
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="detection stale"
|
||||
K8S_NOTES="last detection >9d ago"
|
||||
raise_exit 2
|
||||
elif [[ -n "$failed_jobs" ]]; then
|
||||
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="chain failed"
|
||||
K8S_NOTES="failed upgrade Job(s): $failed_jobs — pipeline wedged. Inspect: kubectl -n k8s-upgrade describe job <name> (the retry-on-failure guard re-spawns on the next detection cycle)"
|
||||
raise_exit 2
|
||||
elif [[ "${in_flight:-0}" == "1" ]]; then
|
||||
K8S_STATUS_ICON="…"; K8S_STATUS_TEXT="in-flight"
|
||||
K8S_NOTES="upgrade chain running"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue