k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:
* the detection guard and spawn_next only checked whether the phase Job
EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
* the abort happens before the in-flight metric is pushed, so neither
K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
"never ran" while actually being stuck.
Fixes:
D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
(upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
instead of skipping it, so a transient gate self-corrects next cycle
rather than wedging the pipeline for a week.
D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
web-terminal is not cluster infrastructure and must not block upgrades.
D3 observability: new K8sUpgradeChainJobFailed alert
(kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
a Failed chain Job as "chain failed" — closing the pre-in-flight blind
spot so a wedge is visible immediately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7e7e41cbef
commit
dfa1a12a86
4 changed files with 75 additions and 5 deletions
|
|
@ -451,9 +451,22 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
# Idempotency: deterministic name reconciles via `apply`.
|
||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||
|
||||
# Retry-on-failure idempotency: skip only if an existing preflight
|
||||
# Job is Active/Complete. A *Failed* preflight (aborted on a
|
||||
# transient gate, e.g. a spurious critical alert) is deleted and
|
||||
# re-spawned — otherwise its deterministic name + 7d TTL wedges
|
||||
# the entire pipeline until it ages out. (Stuck-pipeline fix
|
||||
# 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
|
||||
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
|
||||
slack "Preflight Job $JOB_NAME already exists (rerunning detection mid-flight?)"
|
||||
exit 0
|
||||
JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
|
||||
if [ "$JOB_FAILED" = "True" ]; then
|
||||
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
else
|
||||
slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
||||
export JOB_NAME PHASE_NEXT=preflight TARGET_NODE_NEXT="" \
|
||||
|
|
|
|||
|
|
@ -222,9 +222,23 @@ spawn_next() {
|
|||
local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
|
||||
[ -n "${NEXT_TARGET_NODE:-}" ] && job_name="${job_name}-${NEXT_TARGET_NODE}"
|
||||
|
||||
# Retry-on-failure idempotency: skip an existing next-Job ONLY if it is
|
||||
# Active or Complete. A *Failed* Job (a phase that aborted on a transient
|
||||
# gate) is deleted and re-created — otherwise its deterministic name plus
|
||||
# ttlSecondsAfterFinished (7d) would block the whole chain from re-running
|
||||
# that phase until the dead Job aged out. (Stuck-pipeline fix 2026-06-17:
|
||||
# a transient critical alert wedged the 1.34.9 preflight for 5 days.)
|
||||
if $KUBECTL -n "$NS" get job "$job_name" >/dev/null 2>&1; then
|
||||
echo "Next Job $job_name already exists; idempotent skip."
|
||||
return 0
|
||||
local job_failed
|
||||
job_failed=$($KUBECTL -n "$NS" get job "$job_name" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
|
||||
if [ "$job_failed" = "True" ]; then
|
||||
echo "Next Job $job_name exists but FAILED — deleting and re-spawning."
|
||||
$KUBECTL -n "$NS" delete job "$job_name" --wait=true >/dev/null 2>&1 || true
|
||||
else
|
||||
echo "Next Job $job_name already exists (active/complete); idempotent skip."
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
local scheduling_block=""
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue