k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 23:54:05 +00:00 · 2026-05-11 23:54:05 +00:00 · 448bc0c0f6
commit 448bc0c0f6
parent 8e13f1528e
7 changed files with 1063 additions and 394 deletions
--- a/stacks/k8s-version-upgrade/job-template.yaml
+++ b/stacks/k8s-version-upgrade/job-template.yaml
@ -0,0 +1,88 @@
+# k8s-upgrade-chain Job template.
+#
+# Rendered by `envsubst` inside upgrade-step.sh (and the detection CronJob)
+# before `kubectl apply`. All ${VAR} placeholders are envsubst-side; this file
+# is NOT processed by Terraform.
+#
+# Required environment for envsubst:
+#   JOB_NAME            unique-per-(phase, target_version[, target_node])
+#   PHASE_NEXT          phase the Job runs (preflight|master|worker|postflight)
+#   TARGET_NODE_NEXT    node the Job operates on (empty for preflight/postflight)
+#   TARGET_VERSION      X.Y.Z
+#   TARGET_VERSION_LABEL  X-Y-Z (label-safe)
+#   KIND                patch | minor
+#   IMAGE               container image to run upgrade-step.sh
+#   SCHEDULING_BLOCK    YAML fragment with nodeSelector/tolerations (may be empty)
+#
+# Idempotency: name is deterministic per (phase, target_version[, target_node])
+# so `kubectl apply` reconciles to a single Job per run.
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: ${JOB_NAME}
+  namespace: k8s-upgrade
+  labels:
+    app: k8s-upgrade-chain
+    phase: ${PHASE_NEXT}
+    target-version: "${TARGET_VERSION_LABEL}"
+spec:
+  ttlSecondsAfterFinished: 604800   # 7 days for postmortem review
+  backoffLimit: 1
+  template:
+    metadata:
+      labels:
+        app: k8s-upgrade-chain
+        phase: ${PHASE_NEXT}
+    spec:
+      serviceAccountName: k8s-upgrade-job
+      restartPolicy: Never
+${SCHEDULING_BLOCK}
+      imagePullSecrets:
+        - name: registry-credentials
+      containers:
+        - name: upgrade-step
+          image: ${IMAGE}
+          env:
+            - name: PHASE
+              value: "${PHASE_NEXT}"
+            - name: TARGET_NODE
+              value: "${TARGET_NODE_NEXT}"
+            - name: TARGET_VERSION
+              value: "${TARGET_VERSION}"
+            - name: KIND
+              value: "${KIND}"
+            - name: IMAGE
+              value: "${IMAGE}"
+            - name: HOME
+              value: "/tmp"
+          command: ["/bin/bash", "/scripts/upgrade-step.sh"]
+          volumeMounts:
+            - name: creds
+              mountPath: /secrets/k8s-upgrade
+              readOnly: true
+            - name: scripts
+              mountPath: /scripts
+              readOnly: true
+            - name: template
+              mountPath: /template
+              readOnly: true
+          resources:
+            requests:
+              cpu: "100m"
+              memory: "256Mi"
+            limits:
+              memory: "512Mi"
+      volumes:
+        - name: creds
+          secret:
+            secretName: k8s-upgrade-creds
+            # 0444 so the non-root container can read; upgrade-step.sh copies
+            # the SSH key to /tmp/ssh_key with mode 0400 for openssh.
+            defaultMode: 0444
+        - name: scripts
+          configMap:
+            name: k8s-upgrade-scripts
+            defaultMode: 0755
+        - name: template
+          configMap:
+            name: k8s-upgrade-job-template
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -1,44 +1,48 @@
 # k8s-version-upgrade — Automated K8s component (kubeadm/kubelet/kubectl) upgrade
 #
-# Detects new patch/minor versions via a weekly CronJob, then dispatches the
-# `k8s-version-upgrade` agent (infra/.claude/agents/k8s-version-upgrade.md)
-# through claude-agent-service for the actual rolling upgrade.
+# Architecture: detection CronJob → chain of small Jobs, one per phase. Each
+# Job's pod runs on a node that is NOT its drain target — eliminates the
+# self-preemption bug that killed the agent-based v1 (2026-05-11 incident).
+#
+# Chain (Job 0 → Job 6):
+#   preflight  (pinned: k8s-node1)
+#   master     (pinned: k8s-node1; drains k8s-master)
+#   worker     (pinned: k8s-node1; drains k8s-node4 → 3 → 2)
+#   worker     (pinned: k8s-master + control-plane toleration; drains k8s-node1 last)
+#   postflight (no pinning)
+#
+# Each phase Job's container runs scripts/upgrade-step.sh which:
+#   - dispatches on $PHASE
+#   - spawns the next Job via envsubst on job-template.yaml
+#   - uses deterministic naming (k8s-upgrade-${phase}-${target_version}[-${node}])
+#     so re-running on failure reconciles to a single Job per run.
 #
 # Reuse points:
-#   - claude-agent-service.claude-agent.svc:8080 — agent job runner
-#   - Vault secret/k8s-upgrade/* — operator populates ssh_key + slack_webhook
-#   - Prometheus + Pushgateway + Upgrade Gates alert group (in monitoring stack)
-#   - update_k8s.sh — library script the agent shells into nodes with
-#
-# Notes:
-#   - Schedule is Sun 12:00 UTC — well outside the kured Mon-Fri 02:00-06:00
-#     London window so OS reboots and K8s version rollouts can't overlap.
-#   - Patch detection uses `apt-cache madison kubeadm` on master via SSH.
-#     Minor detection probes the next-minor apt repo URL with HEAD.
+#   - claude-agent-service image (kubectl + ssh + jq + curl + envsubst)
+#   - Vault secret/k8s-upgrade/* (ssh_key, slack_webhook)
+#   - Prometheus + Pushgateway + Upgrade Gates alerts
+#   - default/backup-etcd CronJob (snapshot trigger)
+#   - infra/scripts/update_k8s.sh (per-node upgrade body)

 variable "schedule" {
  type    = string
-  default = "0 12 * * 0" # Sunday 12:00 UTC
+  default = "0 12 * * 0" # Sunday 12:00 UTC — outside kured window
 }

-# Toggle to suspend the detection CronJob without dropping the stack.
 variable "enabled" {
  type    = bool
  default = true
 }

-# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in
-# sync when the claude-agent-service image is rebuilt. Reused here because the
-# detection CronJob only needs kubectl, ssh-client, curl, jq — all of which
-# the claude-agent-service image already ships.
-variable "claude_agent_service_image_tag" {
+# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump
+# in lockstep with claude-agent-service rebuilds. The image ships kubectl,
+# ssh-client, curl, jq, envsubst — everything the upgrade Jobs need.
+variable "image_tag" {
  type    = string
  default = "2fd7670d"
 }

-# If true, the CronJob runs the detection sequence but does NOT POST to
-# claude-agent-service. Used for Test 1 to confirm detection works without
-# firing a real upgrade.
+# When true, detection runs but does NOT spawn the preflight Job.
 variable "detection_dry_run" {
  type    = bool
  default = false
@ -46,9 +50,9 @@ variable "detection_dry_run" {

 locals {
  namespace = "k8s-upgrade"
-  ca_image  = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
+  image     = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.image_tag}"
  labels = {
-    app = "k8s-version-check"
+    app = "k8s-version-upgrade"
  }
 }

@ -62,21 +66,19 @@ resource "kubernetes_namespace" "k8s_upgrade" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label
    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
  }
 }

-# --- ExternalSecret: ssh_key + slack_webhook + agent-service bearer ---
+# --- ExternalSecret: SSH key + Slack webhook ---
 #
 # Operator populates Vault `secret/k8s-upgrade/` with:
-#   - ssh_key         (PEM-encoded ed25519 private key)
-#   - ssh_key_pub     (the matching public key — distributed to nodes' authorized_keys)
-#   - slack_webhook   (Slack incoming-webhook URL, separate channel from kured for clean alerting)
+#   - ssh_key       (ed25519 PRIVATE key, used to SSH wizard@<node> from Jobs)
+#   - ssh_key_pub   (matching public key, deployed to nodes' authorized_keys)
+#   - slack_webhook (incoming-webhook URL)
 #
-# The claude-agent-service bearer token comes from secret/claude-agent-service
-# (reused — no parallel token needed).
-
+# No claude-agent bearer needed — the chain no longer POSTs to that service.
 resource "kubernetes_manifest" "external_secret" {
  manifest = {
    apiVersion = "external-secrets.io/v1beta1"
@ -109,191 +111,157 @@ resource "kubernetes_manifest" "external_secret" {
            property = "slack_webhook"
          }
        },
-        {
-          secretKey = "api_bearer_token"
-          remoteRef = {
-            key      = "claude-agent-service"
-            property = "api_bearer_token"
-          }
-        },
      ]
    }
  }
 }

-# --- ServiceAccount + RBAC for the detection CronJob ---
-
-resource "kubernetes_service_account" "k8s_version_check" {
-  metadata {
-    name      = "k8s-version-check"
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
-  }
-}
-
-# Cluster-wide read on nodes (for kubeletVersion comparison)
-resource "kubernetes_cluster_role" "k8s_version_check" {
-  metadata {
-    name = "k8s-version-check"
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["nodes"]
-    verbs      = ["get", "list"]
-  }
-}
-
-resource "kubernetes_cluster_role_binding" "k8s_version_check" {
-  metadata {
-    name = "k8s-version-check"
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "ClusterRole"
-    name      = kubernetes_cluster_role.k8s_version_check.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.k8s_version_check.metadata[0].name
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
-  }
-}
-
-# Namespace-scoped: detection CronJob reads its own creds Secret.
-resource "kubernetes_role" "k8s_version_check_secrets" {
-  metadata {
-    name      = "k8s-version-check-secrets"
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
-  }
-  rule {
-    api_groups     = [""]
-    resources      = ["secrets"]
-    resource_names = ["k8s-upgrade-creds"]
-    verbs          = ["get"]
-  }
-}
-
-resource "kubernetes_role_binding" "k8s_version_check_secrets" {
-  metadata {
-    name      = "k8s-version-check-secrets"
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.k8s_version_check_secrets.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.k8s_version_check.metadata[0].name
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
-  }
-}
-
-# --- Cross-namespace RBAC: claude-agent SA reads k8s-upgrade-creds + annotates ns ---
+# --- Unified ServiceAccount + RBAC ---
 #
-# The k8s-version-upgrade agent runs inside the claude-agent-service pod (SA
-# `claude-agent` in `claude-agent` ns). It needs:
-#   - GET on this namespace's k8s-upgrade-creds Secret (to fetch ssh_key + slack)
-#   - PATCH on the k8s-upgrade Namespace annotations (in-flight marker)
+# One SA serves BOTH the detection CronJob and every phase Job:
+#   - detection CronJob: needs nodes:get/list + secrets:get + jobs:create
+#     (to spawn Job 0 = preflight)
+#   - phase Jobs: same + pods/eviction:create + pods:delete + namespaces:patch
+#
+# Cluster-scoped because the chain spans the whole cluster (drain works on
+# any node, and the preflight Job creates a Job in `default` ns from
+# `cronjob/backup-etcd`).

-resource "kubernetes_role" "claude_agent_reads_creds" {
+resource "kubernetes_service_account" "k8s_upgrade_job" {
  metadata {
-    name      = "claude-agent-reads-creds"
+    name      = "k8s-upgrade-job"
    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
  }
-  rule {
-    api_groups     = [""]
-    resources      = ["secrets"]
-    resource_names = ["k8s-upgrade-creds"]
-    verbs          = ["get"]
-  }
 }

-resource "kubernetes_role_binding" "claude_agent_reads_creds" {
+resource "kubernetes_cluster_role" "k8s_upgrade_job" {
  metadata {
-    name      = "claude-agent-reads-creds"
-    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+    name = "k8s-upgrade-job"
  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.claude_agent_reads_creds.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = "claude-agent"
-    namespace = "claude-agent"
-  }
-}
-
-# The base claude-agent ClusterRole grants get/list/watch on most resources
-# but not the mutating verbs the upgrade agent needs. Rather than fork the
-# upstream stack, we add a sibling ClusterRole here scoped to exactly the
-# verbs+resources required:
-#   - patch on namespace k8s-upgrade (in-flight annotation)
-#   - create on batch/jobs (trigger etcd snapshot Job from cronjob/backup-etcd)
-#   - patch on nodes (cordon/uncordon — drain needs this)
-#   - create on pods/eviction (drain evicts pods)
-resource "kubernetes_cluster_role" "claude_agent_upgrade_ops" {
-  metadata {
-    name = "claude-agent-upgrade-ops"
-  }
-  # Annotate the k8s-upgrade namespace
-  rule {
-    api_groups     = [""]
-    resources      = ["namespaces"]
-    resource_names = ["k8s-upgrade"]
-    verbs          = ["patch", "update"]
-  }
-  # Trigger etcd snapshot Jobs (from cronjob/backup-etcd in default ns).
-  # Cluster-scoped because we may also create test Jobs in k8s-upgrade ns.
-  rule {
-    api_groups = ["batch"]
-    resources  = ["jobs"]
-    verbs      = ["create", "delete"]
-  }
-  # Cordon / uncordon nodes
+  # Read nodes (version comparison + readiness check)
  rule {
    api_groups = [""]
    resources  = ["nodes"]
-    verbs      = ["patch", "update"]
+    verbs      = ["get", "list", "patch", "update"]
  }
-  # Drain (evict pods)
+  # Drain — evict pods
  rule {
    api_groups = [""]
    resources  = ["pods/eviction"]
    verbs      = ["create"]
  }
-  # Delete pods stuck during drain (sometimes evict isn't enough)
+  # Drain fallback — direct delete (predrain_unstick bypasses PDBs)
  rule {
    api_groups = [""]
    resources  = ["pods"]
-    verbs      = ["delete"]
+    verbs      = ["get", "list", "delete"]
+  }
+  # Read PDBs to find drain-blocking pods
+  rule {
+    api_groups = ["policy"]
+    resources  = ["poddisruptionbudgets"]
+    verbs      = ["get", "list"]
+  }
+  # Chain dispatch — create the next Job; reconcile via apply on retry.
+  # In `default` ns to also create the etcd-snapshot Job from cronjob/backup-etcd.
+  rule {
+    api_groups = ["batch"]
+    resources  = ["jobs"]
+    verbs      = ["create", "get", "list", "delete", "patch", "watch"]
+  }
+  # Pull CronJob spec for `kubectl create job --from=cronjob/backup-etcd`
+  rule {
+    api_groups = ["batch"]
+    resources  = ["cronjobs"]
+    verbs      = ["get", "list"]
+  }
+  # Annotate the k8s-upgrade namespace (in-flight marker + snapshot path)
+  rule {
+    api_groups     = [""]
+    resources      = ["namespaces"]
+    resource_names = [local.namespace]
+    verbs          = ["get", "patch", "update"]
  }
 }

-resource "kubernetes_cluster_role_binding" "claude_agent_upgrade_ops" {
+resource "kubernetes_cluster_role_binding" "k8s_upgrade_job" {
  metadata {
-    name = "claude-agent-upgrade-ops"
+    name = "k8s-upgrade-job"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
-    name      = kubernetes_cluster_role.claude_agent_upgrade_ops.metadata[0].name
+    name      = kubernetes_cluster_role.k8s_upgrade_job.metadata[0].name
  }
  subject {
    kind      = "ServiceAccount"
-    name      = "claude-agent"
-    namespace = "claude-agent"
+    name      = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+}
+
+# Namespaced: read the credentials Secret in k8s-upgrade (SSH key + Slack URL)
+resource "kubernetes_role" "k8s_upgrade_job_ns" {
+  metadata {
+    name      = "k8s-upgrade-job-ns"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["secrets"]
+    resource_names = ["k8s-upgrade-creds"]
+    verbs          = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "k8s_upgrade_job_ns" {
+  metadata {
+    name      = "k8s-upgrade-job-ns"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.k8s_upgrade_job_ns.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+}
+
+# --- ConfigMaps: scripts + Job template ---
+
+resource "kubernetes_config_map" "k8s_upgrade_scripts" {
+  metadata {
+    name      = "k8s-upgrade-scripts"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+    labels    = local.labels
+  }
+  data = {
+    "upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
+    "update_k8s.sh"   = file("${path.module}/../../scripts/update_k8s.sh")
+  }
+}
+
+resource "kubernetes_config_map" "k8s_upgrade_job_template" {
+  metadata {
+    name      = "k8s-upgrade-job-template"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+    labels    = local.labels
+  }
+  data = {
+    "job-template.yaml" = file("${path.module}/job-template.yaml")
  }
 }

 # --- Detection CronJob ---
 #
-# Weekly: compares running cluster version against latest available patch
-# (apt-cache madison kubeadm on master) and latest available minor (HEAD on
-# next-minor pkgs.k8s.io repo). When a target is detected, POSTs to
-# claude-agent-service to kick the upgrade agent.
+# Probes for available patch/minor targets weekly. When one is found, renders
+# Job 0 (preflight) from the same job-template the chain uses. The CronJob no
+# longer POSTs to claude-agent-service; the whole pipeline now runs inside the
+# cluster via Job-chaining.

 resource "kubernetes_cron_job_v1" "k8s_version_check" {
  metadata {
@ -320,33 +288,36 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
            labels = local.labels
          }
          spec {
-            service_account_name = kubernetes_service_account.k8s_version_check.metadata[0].name
+            service_account_name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
            restart_policy       = "Never"
            image_pull_secrets {
              name = "registry-credentials"
            }
+            volume {
+              name = "creds"
+              secret {
+                secret_name = "k8s-upgrade-creds"
+                # 0444 — non-root container needs read; SSH key gets re-installed
+                # with mode 0400 in the inline command before any ssh call.
+                default_mode = "0444"
+              }
+            }
+            volume {
+              name = "template"
+              config_map {
+                name = kubernetes_config_map.k8s_upgrade_job_template.metadata[0].name
+              }
+            }
            container {
              name  = "version-check"
-              image = local.ca_image
+              image = local.image
              command = ["/bin/bash", "-c", <<-EOT
                set -euo pipefail
                echo "==> k8s-version-check ($(date -u +%FT%TZ))"

-                # 1. Load SSH key from K8s Secret
-                mkdir -p /tmp
-                /usr/local/bin/kubectl get secret k8s-upgrade-creds \
-                  -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
-                chmod 400 /tmp/k8s-upgrade-ssh-key
-
-                SLACK=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
-                  -o jsonpath='{.data.slack_webhook}' | base64 -d)
-
-                AGENT_TOKEN=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
-                  -o jsonpath='{.data.api_bearer_token}' | base64 -d)
-
-                SSH="ssh -i /tmp/k8s-upgrade-ssh-key \
-                  -o StrictHostKeyChecking=accept-new \
-                  -o UserKnownHostsFile=/tmp/known_hosts"
+                SLACK=$(cat /secrets/k8s-upgrade/slack_webhook)
+                install -m 0400 /secrets/k8s-upgrade/ssh_key /tmp/ssh_key
+                SSH="ssh -i /tmp/ssh_key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts -o ConnectTimeout=10"

                slack() {
                  curl -sS -X POST -H 'Content-Type: application/json' \
@ -354,17 +325,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                    "$SLACK" || true
                }

-                # 2. Detect running version
+                # 1. Detect running version
                RUNNING=$(/usr/local/bin/kubectl get nodes \
-                  -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' \
-                  | tr -d v)
+                  -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' | tr -d v)
                RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}')
                echo "Running version: v$RUNNING (minor $RUNNING_MINOR)"

-                # 3. Detect highest available patch within the running minor track.
-                # Refresh the local apt cache first — without this, a newly-published
-                # patch won't show up via `apt-cache madison` until something else
-                # triggers an `apt-get update`.
+                # 2. Latest patch within current minor (refresh master's apt cache)
                LATEST_PATCH=$($SSH wizard@k8s-master \
                  "sudo apt-get update -qq -o Dir::Etc::sourcelist='sources.list.d/kubernetes.list' -o Dir::Etc::sourceparts='-' -o APT::Get::List-Cleanup='0' >/dev/null 2>&1 ; \
                   apt-cache madison kubeadm 2>/dev/null \
@ -372,9 +339,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                    | sed 's/-.*//' \
                    | grep '^$RUNNING_MINOR\\.' \
                    | sort -V | tail -1" || echo "")
-                echo "Latest patch (apt): v$LATEST_PATCH"
+                echo "Latest patch: v$LATEST_PATCH"

-                # 4. Detect next available minor by probing the apt repo URL.
+                # 3. Next-minor probe
                NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
                NEXT_MINOR="1.$NEXT_MINOR_NUM"
                NEXT_MINOR_AVAILABLE="no"
@ -385,14 +352,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                fi
                echo "Next minor v$NEXT_MINOR available: $NEXT_MINOR_AVAILABLE"

-                # 5. Decide what to do
+                # 4. Choose target
                TARGET=""
                KIND=""
                if [ -n "$LATEST_PATCH" ] && [ "$LATEST_PATCH" != "$RUNNING" ]; then
                  TARGET="$LATEST_PATCH"
                  KIND="patch"
                elif [ "$NEXT_MINOR_AVAILABLE" = "yes" ]; then
-                  # Probe the minor track to get its latest patch.
                  NEXT_MINOR_PATCH=$($SSH wizard@k8s-master \
                    "curl -sf 'https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Packages' \
                      | grep -oE 'Version: [0-9.-]+' \
@ -404,7 +370,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  fi
                fi

-                # 6. Push the discovery metric to Pushgateway
+                # 5. Pushgateway discovery metric
                PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-check'
                {
                  echo "# TYPE k8s_upgrade_available gauge"
@ -417,64 +383,61 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  echo "k8s_version_check_last_run_timestamp $(date +%s)"
                } | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"

-                # 7. Decide whether to dispatch
+                # 6. Decide whether to spawn Job 0
                if [ -z "$TARGET" ]; then
-                  echo "No upgrade needed (running=$RUNNING, latest_patch=$LATEST_PATCH, next_minor_available=$NEXT_MINOR_AVAILABLE)"
+                  echo "No upgrade needed"
                  exit 0
                fi

                slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"

-                # DRY_RUN_OVERRIDE wins over DRY_RUN — but a Job copied from
-                # this CronJob can't add new env vars (spec is immutable). The
-                # operator path for "trigger detection without dispatch" is
-                # toggling the CronJob's `var.detection_dry_run` then applying.
-                # Documented in the runbook.
-                EFFECTIVE_DRY_RUN="$${DRY_RUN_OVERRIDE:-$DRY_RUN}"
-                if [ "$EFFECTIVE_DRY_RUN" = "true" ]; then
-                  echo "dry_run=true — not POSTing to claude-agent-service"
-                  slack "DRY_RUN — skipping agent dispatch"
+                if [ "$DRY_RUN" = "true" ]; then
+                  slack "DRY_RUN — not spawning preflight Job"
                  exit 0
                fi

-                # 8. POST to claude-agent-service
-                PAYLOAD=$(jq -nc \
-                  --arg target "$TARGET" \
-                  --arg kind "$KIND" \
-                  '{
-                    prompt: ("Run the k8s-version-upgrade agent. Inputs: " + ({target_version: $target, kind: $kind, dry_run: false, stages: "all"} | tostring)),
-                    agent: ".claude/agents/k8s-version-upgrade",
-                    max_budget_usd: 30
-                  }')
+                # 7. Spawn Job 0 (preflight) via envsubst on the job-template
+                #    Idempotency: deterministic name reconciles via `apply`.
+                JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"

-                echo "Dispatching agent: $PAYLOAD"
-                RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
-                  -H "Authorization: Bearer $AGENT_TOKEN" \
-                  -H 'Content-Type: application/json' \
-                  -d "$PAYLOAD" \
-                  http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
-                CODE=$(printf '%s' "$RESP" | tail -n1)
-                BODY=$(printf '%s' "$RESP" | sed '$d')
-
-                if [ "$CODE" = "200" ] || [ "$CODE" = "202" ]; then
-                  JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // .id // "unknown"')
-                  slack "Agent dispatched: job=$JOB_ID (target=v$TARGET kind=$KIND)"
-                  echo "OK — job=$JOB_ID"
-                else
-                  slack "ERROR dispatching agent: HTTP $CODE — $BODY"
-                  echo "dispatch failed: HTTP $CODE — $BODY" >&2
-                  exit 1
+                if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
+                  slack "Preflight Job $JOB_NAME already exists (rerunning detection mid-flight?)"
+                  exit 0
                fi
+
+                export JOB_NAME PHASE_NEXT=preflight TARGET_NODE_NEXT="" \
+                       TARGET_VERSION="$TARGET" TARGET_VERSION_LABEL="$${TARGET//./-}" \
+                       KIND="$KIND" IMAGE="$${IMAGE}" \
+                       SCHEDULING_BLOCK=$'      nodeSelector:\n        kubernetes.io/hostname: k8s-node1'
+
+                envsubst < /template/job-template.yaml \
+                  | /usr/local/bin/kubectl apply -f -
+
+                slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
              EOT
              ]
              env {
                name  = "DRY_RUN"
                value = tostring(var.detection_dry_run)
              }
+              env {
+                name  = "IMAGE"
+                value = local.image
+              }
              env {
                name  = "HOME"
                value = "/tmp"
              }
+              volume_mount {
+                name       = "creds"
+                mount_path = "/secrets/k8s-upgrade"
+                read_only  = true
+              }
+              volume_mount {
+                name       = "template"
+                mount_path = "/template"
+                read_only  = true
+              }
              resources {
                requests = {
                  cpu    = "50m"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -0,0 +1,438 @@
+#!/usr/bin/env bash
+#
+# Universal upgrade-step body. Each Job in the k8s-version-upgrade chain runs
+# this once, dispatching on $PHASE. On success it computes the next phase and
+# spawns the next Job. The chain is:
+#
+#   preflight  (run on k8s-node1)
+#     ↓
+#   master     (drains k8s-master; run on k8s-node1)
+#     ↓
+#   worker k8s-node4   (run on k8s-node1)
+#     ↓
+#   worker k8s-node3   (run on k8s-node1)
+#     ↓
+#   worker k8s-node2   (run on k8s-node1)
+#     ↓
+#   worker k8s-node1   (drains k8s-node1; run on k8s-master with control-plane toleration)
+#     ↓
+#   postflight (no node pinning)
+#
+# k8s-node1 hosts every Job except the one that drains k8s-node1 itself.
+# k8s-node1 is therefore upgraded LAST.
+#
+# Required env vars (set on the Job pod by job-template.yaml):
+#   PHASE              preflight | master | worker | postflight
+#   TARGET_NODE        k8s-master | k8s-nodeN  (empty for preflight/postflight)
+#   TARGET_VERSION     X.Y.Z
+#   KIND               patch | minor
+#   IMAGE              container image to use for next Job in the chain
+
+set -euo pipefail
+
+NS=k8s-upgrade
+SSH_KEY=/secrets/k8s-upgrade/ssh_key
+SLACK_FILE=/secrets/k8s-upgrade/slack_webhook
+PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
+PROM='http://prometheus-server.monitoring.svc.cluster.local:80'
+KUBECTL=kubectl
+JOB_TEMPLATE=/template/job-template.yaml
+UPDATE_K8S_SH=/scripts/update_k8s.sh
+
+# SSH key must be 0400 — refresh from secret mount (defaultMode does this but
+# bind-mount semantics can preserve loose perms; chmod is idempotent).
+install -m 0400 "$SSH_KEY" /tmp/ssh_key
+SSH_KEY=/tmp/ssh_key
+
+SSH_OPTS=(-i "$SSH_KEY"
+          -o StrictHostKeyChecking=accept-new
+          -o UserKnownHostsFile=/tmp/known_hosts
+          -o ConnectTimeout=10)
+
+SLACK_URL="$(cat "$SLACK_FILE")"
+
+slack() {
+  local msg="$1"
+  curl -sS -X POST -H 'Content-Type: application/json' \
+    --data "$(jq -nc --arg t "[k8s-upgrade-${PHASE}${TARGET_NODE:+:$TARGET_NODE}] $msg" \
+              '{text: $t}')" \
+    "$SLACK_URL" >/dev/null || echo "warn: slack post failed"
+}
+
+push() {
+  printf '# TYPE %s gauge\n%s %s\n' "$1" "$1" "$2" \
+    | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
+}
+
+halt_on_alert_query() {
+  local extra_ignore="${1:-}"
+  local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor'
+  [ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
+  regex="$regex)$"
+
+  curl -sf "$PROM/api/v1/alerts" \
+    | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+    | grep -vE "$regex" | sort -u
+}
+
+wait_for_node_ready() {
+  local node="$1" want_version="$2" deadline=$(( $(date +%s) + 900 ))  # 15 min
+  while [ "$(date +%s)" -lt "$deadline" ]; do
+    local status kubelet
+    status=$($KUBECTL get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
+    kubelet=$($KUBECTL get node "$node" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v || true)
+    if [ "$status" = "True" ] && [ "$kubelet" = "$want_version" ]; then
+      return 0
+    fi
+    sleep 15
+  done
+  return 1
+}
+
+# Pre-drain: find pods on $node whose PDB has zero disruptionsAllowed and
+# delete them directly. Drain's eviction API respects PDBs and will loop
+# forever on single-replica deployments with `minAvailable: 1` — common
+# pattern on this cluster (e.g. Anubis instances default to replicas=1). A
+# direct delete bypasses eviction; the parent Deployment recreates the pod
+# elsewhere (the node is already cordoned by drain).
+predrain_unstick() {
+  local node="$1"
+  $KUBECTL get pdb -A -o json | jq -r '
+    .items[]
+    | select(.status.disruptionsAllowed == 0)
+    | "\(.metadata.namespace) \(.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(","))"
+  ' | while read -r ns selector; do
+    [ -z "$selector" ] && continue
+    $KUBECTL -n "$ns" get pods --field-selector "spec.nodeName=$node,status.phase=Running" \
+      -l "$selector" -o name 2>/dev/null \
+      | while read -r pod; do
+          echo "predrain_unstick: deleting PDB-blocked $ns/$pod (drain would loop on it)"
+          $KUBECTL -n "$ns" delete "$pod" --wait=false || true
+        done
+  done
+}
+
+# Drain wrapper: kick predrain_unstick before drain, then again every 60s in
+# the background while drain runs (in case new pods land mid-drain). Drain
+# exits when the node has no non-daemonset workload.
+drain_node() {
+  local node="$1"
+  predrain_unstick "$node"
+  ( while kill -0 $$ 2>/dev/null; do sleep 60; predrain_unstick "$node"; done ) &
+  local watcher=$!
+  trap "kill $watcher 2>/dev/null || true" EXIT
+  $KUBECTL drain "$node" --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
+  kill $watcher 2>/dev/null || true
+  trap - EXIT
+}
+
+# ---------------------------------------------------------------------------
+# Chain definition — what comes after the current phase
+# ---------------------------------------------------------------------------
+
+NEXT_PHASE=""
+NEXT_TARGET_NODE=""
+NEXT_RUN_ON=""
+
+case "${PHASE}:${TARGET_NODE:-}" in
+  preflight:)
+    NEXT_PHASE=master
+    NEXT_RUN_ON=k8s-node1 ;;
+  master:)
+    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node4
+    NEXT_RUN_ON=k8s-node1 ;;
+  worker:k8s-node4)
+    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node3
+    NEXT_RUN_ON=k8s-node1 ;;
+  worker:k8s-node3)
+    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node2
+    NEXT_RUN_ON=k8s-node1 ;;
+  worker:k8s-node2)
+    NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node1
+    NEXT_RUN_ON=k8s-master ;;  # control-plane toleration required
+  worker:k8s-node1)
+    NEXT_PHASE=postflight
+    NEXT_RUN_ON="" ;;          # no node pinning for postflight
+  postflight:)
+    NEXT_PHASE="" ;;           # end of chain
+  *)
+    echo "ERROR: unknown phase/target combo: ${PHASE}/${TARGET_NODE:-}" >&2
+    exit 2 ;;
+esac
+
+spawn_next() {
+  [ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }
+
+  local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
+  [ -n "${NEXT_TARGET_NODE:-}" ] && job_name="${job_name}-${NEXT_TARGET_NODE}"
+
+  if $KUBECTL -n "$NS" get job "$job_name" >/dev/null 2>&1; then
+    echo "Next Job $job_name already exists; idempotent skip."
+    return 0
+  fi
+
+  local scheduling_block=""
+  case "${NEXT_RUN_ON:-}" in
+    k8s-master)
+      scheduling_block=$'      nodeSelector:\n        kubernetes.io/hostname: k8s-master\n      tolerations:\n        - key: node-role.kubernetes.io/control-plane\n          operator: Exists\n          effect: NoSchedule' ;;
+    "")
+      scheduling_block="" ;;
+    *)
+      scheduling_block=$'      nodeSelector:\n        kubernetes.io/hostname: '"$NEXT_RUN_ON" ;;
+  esac
+
+  export JOB_NAME="$job_name"
+  export PHASE_NEXT="$NEXT_PHASE"
+  export TARGET_NODE_NEXT="${NEXT_TARGET_NODE:-}"
+  export TARGET_VERSION_LABEL="${TARGET_VERSION//./-}"
+  export SCHEDULING_BLOCK="$scheduling_block"
+  # TARGET_VERSION, KIND, IMAGE inherited from current env
+
+  echo "Spawning next Job: $job_name (phase=$NEXT_PHASE target=${NEXT_TARGET_NODE:-} run_on=${NEXT_RUN_ON:-anywhere})"
+  envsubst <"$JOB_TEMPLATE" | $KUBECTL apply -f -
+}
+
+# ---------------------------------------------------------------------------
+# Phase bodies
+# ---------------------------------------------------------------------------
+
+phase_preflight() {
+  slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
+
+  # 1. All nodes Ready + no pressure
+  local bad_nodes
+  bad_nodes=$($KUBECTL get nodes -o json | jq -r '
+    .items[]
+    | select(
+        (.status.conditions[] | select(.type=="Ready").status) != "True"
+        or (.status.conditions[] | select(.type=="MemoryPressure").status) == "True"
+        or (.status.conditions[] | select(.type=="DiskPressure").status) == "True")
+    | .metadata.name')
+  if [ -n "$bad_nodes" ]; then
+    slack "ABORT preflight — nodes unhealthy: $bad_nodes"
+    exit 1
+  fi
+
+  # 2. Halt-on-alert
+  local alerts
+  alerts=$(halt_on_alert_query)
+  if [ -n "$alerts" ]; then
+    slack "ABORT preflight — firing alerts:\n$alerts"
+    exit 1
+  fi
+
+  # 3. 24h-quiet baseline
+  local recent=0
+  while IFS= read -r ts; do
+    [ -z "$ts" ] && continue
+    local diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
+    if [ "$diff" -lt 86400 ]; then recent=1; break; fi
+  done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
+  if [ "$recent" -eq 1 ]; then
+    slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
+    exit 1
+  fi
+
+  # 4. kubeadm upgrade plan matches target
+  local plan_target
+  plan_target=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master 'sudo kubeadm upgrade plan' \
+    | grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \
+    | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
+  if [ "$plan_target" != "$TARGET_VERSION" ]; then
+    slack "ABORT preflight — kubeadm plan target $plan_target ≠ requested $TARGET_VERSION"
+    exit 1
+  fi
+
+  # 5. Push in-flight + started_timestamp metrics + ns annotations
+  $KUBECTL annotate ns "$NS" \
+    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
+    "viktorbarzin.me/k8s-upgrade-target=$TARGET_VERSION" \
+    --overwrite
+  push k8s_upgrade_in_flight 1
+  push k8s_upgrade_started_timestamp "$(date +%s)"
+  push k8s_upgrade_snapshot_taken 0
+
+  # 6. Trigger backup-etcd Job, wait, verify size
+  local snap_job="pre-upgrade-etcd-${TARGET_VERSION//./-}-$(date +%s)"
+  $KUBECTL -n default create job --from=cronjob/backup-etcd "$snap_job"
+  if ! $KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$snap_job"; then
+    $KUBECTL -n default describe "job/$snap_job" | tail -30
+    slack "ABORT preflight — etcd snapshot Job did not complete in 10 min"
+    exit 1
+  fi
+  local snap_log size snap_file
+  snap_log=$($KUBECTL -n default logs "job/$snap_job" -c backup-manage --tail=20 || \
+             $KUBECTL -n default logs "job/$snap_job" --tail=20)
+  size=$(echo "$snap_log" | grep -E '^Backup done:' | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+' || true)
+  snap_file=$(echo "$snap_log" | grep -E '^Backup done:' | awk '{print $3}' || true)
+  if [ -z "$size" ] || [ "$size" -lt 1024 ]; then
+    slack "ABORT preflight — etcd snapshot empty (size='${size:-unknown}')"
+    exit 1
+  fi
+  $KUBECTL annotate ns "$NS" \
+    "viktorbarzin.me/k8s-upgrade-snapshot-path=nfs://192.168.1.127:/srv/nfs/etcd-backup/$snap_file" \
+    --overwrite
+  push k8s_upgrade_snapshot_taken 1
+
+  # 7. Containerd skew fix on master (if master < workers)
+  local master_ctr worker_max=0.0.0
+  master_ctr=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master "containerd --version | awk '{print \$3}' | tr -d v")
+  for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+    local v
+    v=$(ssh "${SSH_OPTS[@]}" "wizard@$n" "containerd --version | awk '{print \$3}' | tr -d v")
+    [ "$(printf '%s\n%s' "$v" "$worker_max" | sort -V | tail -1)" = "$v" ] && worker_max="$v"
+  done
+  if [ "$(printf '%s\n%s' "$master_ctr" "$worker_max" | sort -V | head -1)" = "$master_ctr" ] \
+     && [ "$master_ctr" != "$worker_max" ]; then
+    slack "Master containerd $master_ctr < workers $worker_max — bumping"
+    ssh "${SSH_OPTS[@]}" wizard@k8s-master \
+      "sudo apt-mark unhold containerd.io && sudo apt-get install -y containerd.io='$worker_max-1' \
+       && sudo apt-mark hold containerd.io && sudo systemctl restart containerd"
+    wait_for_node_ready k8s-master "$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)" \
+      || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
+    slack "Master containerd: $master_ctr → $worker_max. Master Ready."
+  fi
+
+  # 8. Apt repo URL rewrite (minor only)
+  if [ "$KIND" = "minor" ]; then
+    local target_minor="${TARGET_VERSION%.*}"
+    for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+      ssh "${SSH_OPTS[@]}" "wizard@$n" \
+        "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
+         && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' \
+              | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
+         && sudo apt-get update"
+    done
+    slack "Apt repo rewritten to v$target_minor/deb on all 5 nodes"
+  fi
+
+  slack "Preflight clean. Snapshot at nfs://...$snap_file ($size bytes). Dispatching master Job."
+}
+
+phase_master() {
+  slack "Draining k8s-master"
+
+  # Re-check halt-on-alert before drain
+  local alerts
+  alerts=$(halt_on_alert_query)
+  [ -n "$alerts" ] && { slack "ABORT master — alerts firing pre-drain: $alerts"; exit 1; }
+
+  drain_node k8s-master
+
+  slack "Running update_k8s.sh on k8s-master (--role master --release $TARGET_VERSION)"
+  ssh "${SSH_OPTS[@]}" wizard@k8s-master 'bash -s' \
+    < "$UPDATE_K8S_SH" -- --role master --release "$TARGET_VERSION"
+
+  $KUBECTL uncordon k8s-master
+
+  wait_for_node_ready k8s-master "$TARGET_VERSION" \
+    || { slack "ABORT — k8s-master not Ready or wrong version after upgrade"; exit 1; }
+
+  local not_ready
+  not_ready=$($KUBECTL -n kube-system get pods -l 'tier=control-plane' --no-headers 2>/dev/null \
+    | grep -v Running | wc -l)
+  if [ "$not_ready" -gt 0 ]; then
+    slack "ABORT — $not_ready control-plane pods not Running after master upgrade"
+    exit 1
+  fi
+
+  alerts=$(halt_on_alert_query RecentNodeReboot)
+  [ -n "$alerts" ] && { slack "ABORT master — alerts firing post-upgrade: $alerts"; exit 1; }
+
+  slack "Master on v$TARGET_VERSION, control-plane Running. Dispatching worker chain."
+}
+
+phase_worker() {
+  [ -z "$TARGET_NODE" ] && { echo "ERROR: worker phase requires TARGET_NODE"; exit 2; }
+  slack "Draining $TARGET_NODE"
+
+  # Halt-on-alert wait (up to 30 min)
+  local attempt alerts
+  for attempt in $(seq 1 30); do
+    alerts=$(halt_on_alert_query)
+    [ -z "$alerts" ] && break
+    echo "Waiting for alerts to clear (attempt $attempt/30): $alerts"
+    sleep 60
+  done
+  [ -n "$alerts" ] && { slack "ABORT $TARGET_NODE — alerts firing after 30min: $alerts"; exit 1; }
+
+  drain_node "$TARGET_NODE"
+
+  slack "Running update_k8s.sh on $TARGET_NODE (--role worker --release $TARGET_VERSION)"
+  ssh "${SSH_OPTS[@]}" "wizard@$TARGET_NODE" 'bash -s' \
+    < "$UPDATE_K8S_SH" -- --role worker --release "$TARGET_VERSION"
+
+  $KUBECTL uncordon "$TARGET_NODE"
+
+  wait_for_node_ready "$TARGET_NODE" "$TARGET_VERSION" \
+    || { slack "ABORT — $TARGET_NODE not Ready or wrong version"; exit 1; }
+
+  # Daemonsets back on the node
+  local missing=0
+  for ds in calico-node kube-proxy; do
+    local count
+    count=$($KUBECTL get pods -A -o wide --field-selector "spec.nodeName=$TARGET_NODE,status.phase=Running" --no-headers \
+      | awk -v d="$ds" '$2 ~ d {n++} END{print n+0}')
+    [ "$count" -lt 1 ] && missing=$((missing+1))
+  done
+  [ "$missing" -gt 0 ] && { slack "WARN $TARGET_NODE — $missing daemonset(s) missing"; }
+
+  # 10-min soak with halt-on-alert (RecentNodeReboot ignored — we know we restarted it)
+  echo "Soaking $TARGET_NODE for 10 min..."
+  for i in $(seq 1 10); do
+    alerts=$(halt_on_alert_query RecentNodeReboot)
+    [ -n "$alerts" ] && { slack "ABORT $TARGET_NODE mid-soak — alerts: $alerts"; exit 1; }
+    sleep 60
+  done
+
+  slack "$TARGET_NODE on v$TARGET_VERSION. Soaked clean (10 min)."
+}
+
+phase_postflight() {
+  slack "Running postflight"
+
+  # All 5 nodes at target
+  local versions wrong
+  versions=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
+  wrong=$(echo "$versions" | grep -v ":v${TARGET_VERSION}\$" | wc -l)
+  if [ "$wrong" -ne 0 ]; then
+    slack "ABORT postflight — $wrong node(s) off target:\n$versions"
+    exit 1
+  fi
+
+  # No alerts firing
+  local alerts
+  alerts=$(halt_on_alert_query)
+  [ -n "$alerts" ] && slack "Postflight WARN — alerts still firing (cluster on target, please check):\n$alerts"
+
+  # Pod-ready ratio
+  local ratio
+  ratio=$(curl -sf "$PROM/api/v1/query" \
+            --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
+          | jq -r '.data.result[0].value[1] // "0"')
+
+  # Clear annotations + gauges
+  $KUBECTL annotate ns "$NS" \
+    'viktorbarzin.me/k8s-upgrade-in-flight-' \
+    'viktorbarzin.me/k8s-upgrade-target-' \
+    'viktorbarzin.me/k8s-upgrade-snapshot-path-' || true
+  push k8s_upgrade_in_flight 0
+  push k8s_upgrade_snapshot_taken 0
+  push k8s_upgrade_started_timestamp 0
+
+  slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
+}
+
+# ---------------------------------------------------------------------------
+# Dispatch
+# ---------------------------------------------------------------------------
+
+case "$PHASE" in
+  preflight)  phase_preflight ;;
+  master)     phase_master ;;
+  worker)     phase_worker ;;
+  postflight) phase_postflight ;;
+  *) echo "ERROR: unknown PHASE: $PHASE" >&2; exit 2 ;;
+esac
+
+spawn_next
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1917,6 +1917,21 @@ serverFiles:
              severity: critical
            annotations:
              summary: "K8s upgrade is in flight but no etcd snapshot was recorded — pipeline pre-flight failed silently"
+          # K8sUpgradeStalled: the v2 Job-chain pushes `k8s_upgrade_started_timestamp`
+          # in preflight and resets `k8s_upgrade_in_flight=0` in postflight. If
+          # in_flight=1 persists for >90 min, a Job in the chain failed
+          # (backoffLimit=1), got preempted/evicted, or is hung. Manual recovery:
+          # `kubectl -n k8s-upgrade get jobs` → identify failed/stuck Job → delete
+          # it → fix root cause → re-create the same Job. Next-Job creation in each
+          # phase is idempotent (deterministic name = `k8s-upgrade-<phase>-<target>`)
+          # so re-running won't duplicate downstream Jobs.
+          - alert: K8sUpgradeStalled
+            expr: k8s_upgrade_in_flight == 1 and (time() - k8s_upgrade_started_timestamp) > 5400
+            for: 5m
+            labels:
+              severity: critical
+            annotations:
+              summary: "K8s upgrade has been in flight for >90 min — chain is stuck. Check: kubectl -n k8s-upgrade get jobs"
      - name: "Traefik Ingress"
        rules:
          - alert: TraefikDown