k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)

The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing. On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the devvm) was firing when the daily detection ran; the preflight's "halt on any critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1). Two design gaps then turned that blip into a multi-day wedge: * the detection guard and spawn_next only checked whether the phase Job EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via ttlSecondsAfterFinished, so every daily run skipped re-spawning it; * the abort happens before the in-flight metric is pushed, so neither K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported "never ran" while actually being stuck. Fixes: D1 retry-on-failure: detection CronJob (main.tf) and spawn_next (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job instead of skipping it, so a transient gate self-corrects next cycle rather than wedging the pipeline for a week. D2 WebterminalTtydUnreachable critical -> warning: a devvm developer web-terminal is not cluster infrastructure and must not block upgrades. D3 observability: new K8sUpgradeChainJobFailed alert (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags a Failed chain Job as "chain failed" — closing the pre-in-flight blind spot so a wedge is visible immediately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:07:36 +00:00 · 2026-06-17 13:07:36 +00:00 · dfa1a12a86
commit dfa1a12a86
parent 7e7e41cbef
4 changed files with 75 additions and 5 deletions
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -451,9 +451,22 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                #    Idempotency: deterministic name reconciles via `apply`.
                JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"

+                # Retry-on-failure idempotency: skip only if an existing preflight
+                # Job is Active/Complete. A *Failed* preflight (aborted on a
+                # transient gate, e.g. a spurious critical alert) is deleted and
+                # re-spawned — otherwise its deterministic name + 7d TTL wedges
+                # the entire pipeline until it ages out. (Stuck-pipeline fix
+                # 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
                if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
-                  slack "Preflight Job $JOB_NAME already exists (rerunning detection mid-flight?)"
-                  exit 0
+                  JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
+                    -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
+                  if [ "$JOB_FAILED" = "True" ]; then
+                    slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
+                    /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
+                  else
+                    slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
+                    exit 0
+                  fi
                fi

                export JOB_NAME PHASE_NEXT=preflight TARGET_NODE_NEXT="" \