k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).
2026-05-10 19:07:42 +00:00 · 2026-05-10 19:07:42 +00:00 · e75bcaf394
commit e75bcaf394
parent 09f83b4e83
8 changed files with 1379 additions and 34 deletions
--- a/.claude/agents/k8s-version-upgrade.md
+++ b/.claude/agents/k8s-version-upgrade.md
@ -0,0 +1,486 @@
+---
+name: k8s-version-upgrade
+description: "Automated K8s version upgrader. Verifies cluster health, takes an etcd snapshot, optionally fixes containerd skew on master, upgrades the control plane, then rolls workers sequentially with halt-on-alert gating and Slack notification at every transition."
+tools: Read, Write, Edit, Bash, Grep, Glob
+model: opus
+---
+
+You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
+
+## Your Job
+
+Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
+
+The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
+
+## Inputs
+
+The user prompt contains a JSON object with these fields:
+
+```json
+{
+  "target_version": "1.34.5",
+  "kind": "patch",
+  "dry_run": false,
+  "stages": "all"
+}
+```
+
+| Field | Required | Description |
+|---|---|---|
+| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
+| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
+| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
+| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
+
+Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
+
+## Environment
+
+- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
+- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
+- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
+- **Etcd snapshot dir**: `/mnt/main/etcd-backup/` (NFS, exists, writeable from master)
+- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
+
+### Credentials — fetched at startup
+
+The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
+
+```bash
+KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
+
+# SSH private key — mode 0400 required by openssh
+$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
+  -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
+chmod 400 /tmp/k8s-upgrade-ssh-key
+
+# Slack webhook (URL string)
+SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
+  -o jsonpath='{.data.slack_webhook}' | base64 -d)
+```
+
+The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
+
+```bash
+SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
+```
+
+Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
+
+## NEVER do
+
+- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
+- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
+- Never skip the etcd snapshot — even for patch
+- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
+- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
+- Never run two stages in parallel — sequential only
+- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
+- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
+
+## Slack + Pushgateway helpers
+
+Every transition posts to Slack:
+
+```bash
+slack() {
+  local msg="$1"
+  local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
+  curl -sS -X POST -H 'Content-Type: application/json' \
+    --data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
+    "$hook"
+}
+```
+
+Start every message with `[k8s-upgrade]` so it's grep-able.
+
+Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
+
+```bash
+PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
+
+push_metric() {
+  # push_metric <name> <value>
+  local name="$1" val="$2"
+  printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
+    | curl -sS --data-binary @- "$PG"
+}
+```
+
+Pushes you must make at specific stages (skipped in dry_run):
+| When | Metric | Value |
+|---|---|---|
+| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
+| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
+| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
+| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
+| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
+
+If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
+
+## Stage 0: Parse inputs + announce
+
+1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
+2. Derive `target_minor` from `target_version` (split on `.`).
+3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
+   ```bash
+   if [ "$dry_run" = "false" ]; then
+     kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
+       viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
+       viktorbarzin.me/k8s-upgrade-target="$target_version" \
+       --overwrite
+
+     push_metric k8s_upgrade_in_flight 1
+     push_metric k8s_upgrade_snapshot_taken 0
+   fi
+   ```
+4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
+
+## Stage 1: Pre-flight (`stages` includes `preflight`)
+
+Skip if `stages` excludes `preflight`.
+
+### Check 1.1 — All nodes Ready, no pressure
+
+```bash
+kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
+  | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
+```
+
+Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
+
+### Check 1.2 — Halt-on-alert (same query kured uses)
+
+```bash
+ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+  | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+  | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+  | sort -u)
+
+if [ -n "$ALERTS" ]; then
+  slack "ABORT preflight — firing alerts:\n$ALERTS"
+  exit 1
+fi
+```
+
+### Check 1.3 — 24h-quiet baseline
+
+Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
+
+```bash
+RECENT_REBOOT=0
+while IFS= read -r ts; do
+  [ -z "$ts" ] && continue
+  diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
+  [ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
+done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
+
+if [ "$RECENT_REBOOT" -eq 1 ]; then
+  slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
+  exit 1
+fi
+```
+
+### Check 1.4 — kubeadm upgrade plan reports our target
+
+```bash
+PLAN_TARGET=$($SSH \
+  wizard@k8s-master 'sudo kubeadm upgrade plan' \
+  | grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
+  | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
+```
+
+If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
+"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
+
+Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
+
+## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
+
+Always run — patch OR minor.
+
+```bash
+TARGET_PATH="/mnt/main/etcd-backup/k8s-upgrade-pre-${target_version}-$(date +%s).db"
+
+if [ "$dry_run" = "false" ]; then
+  $SSH \
+    wizard@k8s-master "sudo /usr/bin/env ETCDCTL_API=3 etcdctl snapshot save '$TARGET_PATH' \
+      --endpoints=https://127.0.0.1:2379 \
+      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+      --cert=/etc/kubernetes/pki/etcd/server.crt \
+      --key=/etc/kubernetes/pki/etcd/server.key"
+
+  # Verify size > 0
+  SIZE=$($SSH \
+    wizard@k8s-master "sudo stat -c %s '$TARGET_PATH'")
+  if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
+    slack "ABORT — etcd snapshot empty or missing ($SIZE bytes at $TARGET_PATH)"
+    exit 1
+  fi
+
+  kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
+    viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
+
+  push_metric k8s_upgrade_snapshot_taken 1
+fi
+
+slack "Etcd snapshot saved at $TARGET_PATH ($SIZE bytes)"
+```
+
+## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
+
+Only run if master containerd version < highest worker containerd version.
+
+```bash
+get_ctr_version() {
+  $SSH \
+    "wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
+}
+
+MASTER_CTR=$(get_ctr_version k8s-master)
+WORKER_MAX="0.0.0"
+for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  v=$(get_ctr_version "$n")
+  # Compare semver-ish
+  if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
+    WORKER_MAX="$v"
+  fi
+done
+
+if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
+   && [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
+  # Master is behind — bump
+  slack "Master containerd $MASTER_CTR < workers $WORKER_MAX — bumping master"
+
+  if [ "$dry_run" = "false" ]; then
+    $SSH \
+      wizard@k8s-master "sudo apt-mark unhold containerd.io \
+        && sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
+        && sudo apt-mark hold containerd.io \
+        && sudo systemctl restart containerd"
+
+    # Wait until kubelet on master is Ready again
+    for i in $(seq 1 60); do
+      STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+        -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+      [ "$STATUS" = "True" ] && break
+      sleep 10
+    done
+    [ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
+  fi
+
+  slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
+else
+  echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
+fi
+```
+
+## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
+
+Only run if `kind=minor`.
+
+For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
+
+```bash
+target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
+
+if [ "$dry_run" = "false" ]; then
+  $SSH \
+    "wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
+      && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
+      && sudo apt-get update"
+fi
+```
+
+Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
+
+## Stage 5: Master upgrade (`stages` includes `master`)
+
+```bash
+# 5.1 Drain
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
+    --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
+fi
+
+# 5.2 Run the library script via SSH pipe
+if [ "$dry_run" = "false" ]; then
+  $SSH \
+    wizard@k8s-master 'bash -s' \
+    < $WORKSPACE_DIR/scripts/update_k8s.sh \
+    -- --role master --release "$target_version"
+fi
+
+# 5.3 Uncordon + wait Ready
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
+fi
+
+for i in $(seq 1 60); do
+  STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+    -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+  KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+    -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
+  [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
+  sleep 15
+done
+
+[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
+  || { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
+
+# 5.4 All control-plane pods Running
+NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
+  -l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
+[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
+
+# 5.5 Re-check halt-on-alert
+# (re-run the Check 1.2 query, abort if anything new fires)
+
+slack "Master upgrade complete. Cluster on v$target_version. Healthy."
+```
+
+## Stage 6: Workers sequentially (`stages` includes `workers`)
+
+Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
+
+For each worker `$node`:
+
+1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
+2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
+3. SSH pipe `update_k8s.sh --role worker --release $target_version`
+4. `kubectl uncordon $node`
+5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
+6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
+7. Slack: `Worker $node complete ($i/4)`.
+
+```bash
+WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
+i=0
+for node in $WORKERS; do
+  i=$((i+1))
+
+  # Halt-on-alert recheck with retry
+  for attempt in $(seq 1 30); do
+    ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+      | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+      | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+      | sort -u)
+    [ -z "$ALERTS" ] && break
+    echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
+    sleep 60
+  done
+  [ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
+
+  if [ "$dry_run" = "false" ]; then
+    kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
+      --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
+
+    $SSH \
+      "wizard@$node" 'bash -s' \
+      < $WORKSPACE_DIR/scripts/update_k8s.sh \
+      -- --role worker --release "$target_version"
+
+    kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
+  fi
+
+  # Wait Ready + version match
+  for w in $(seq 1 60); do
+    STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
+      -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+    KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
+      -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
+    [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
+    sleep 15
+  done
+  [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
+    || { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
+
+  # 10-min soak with halt-on-alert
+  echo "Soaking $node for 10 min..."
+  for sec in $(seq 1 10); do
+    ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+      | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+      | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
+      | sort -u)
+    [ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
+    sleep 60
+  done
+
+  slack "Worker $node upgrade complete ($i/4). Soaked clean."
+done
+```
+
+Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
+
+## Stage 7: Post-flight (`stages` includes `postflight`)
+
+```bash
+# All 5 nodes at target
+VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
+  -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
+echo "$VERSIONS"
+WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
+[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
+
+# Upgrade Gates all inactive
+FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+  | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+  | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+  | sort -u)
+[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
+
+# pod-ready ratio >= 0.9
+RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
+  --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
+  | jq -r '.data.result[0].value[1] // "0"')
+slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
+
+# Clear the in-flight annotation + Pushgateway gauges
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
+    viktorbarzin.me/k8s-upgrade-in-flight- \
+    viktorbarzin.me/k8s-upgrade-target- \
+    viktorbarzin.me/k8s-upgrade-snapshot-path- || true
+
+  push_metric k8s_upgrade_in_flight 0
+  push_metric k8s_upgrade_snapshot_taken 0
+fi
+
+slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
+```
+
+## Rollback
+
+This agent does NOT auto-rollback. If anything aborts mid-flight:
+
+1. Slack the failure with the last known stage + node.
+2. Leave the in-flight annotation in place (the operator clears it manually after triage).
+3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
+
+The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
+
+## Notes for tests
+
+- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
+- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
+- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
+- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
+- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
+
+## Edge cases
+
+- **Slack down**: Don't block the upgrade — continue, log to stderr.
+- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
+- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
+- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
+- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
+
+## Verification claims you must make
+
+When you `slack` a SUCCESS message, you must have actually verified:
+- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
+- No alerts firing outside the ignore-list
+- pod-ready ratio computed from Prometheus
+
+Do not declare success without those three confirmations.
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -1,9 +1,10 @@
 # Automated Upgrades

-This doc covers two independent automation paths:
+This doc covers three independent automation paths:

 1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
-2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section near the end and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
+2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
+3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → claude-agent-service → `k8s-version-upgrade` agent. See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.

 ## Overview

@ -242,3 +243,77 @@ The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades ker

 ### Operational reference
 See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
+
+## K8s Version Upgrades
+
+Independent of the OS-upgrade and service-upgrade pipelines. Drives
+kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
+
+### Architecture
+
+```
+k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns)
+  │ probe apt-cache madison kubeadm (master) → latest available patch
+  │ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
+  │ push k8s_upgrade_available metric to Pushgateway
+  │
+  ▼ if running != latest
+POST claude-agent-service /execute  with target_version + kind
+  │
+  ▼
+k8s-version-upgrade agent (in claude-agent-service pod)
+  ├── pre-flight (5 nodes Ready, halt-on-alert, 24h-quiet, kubeadm plan match)
+  ├── etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db
+  ├── master containerd bump (only if master version < workers')
+  ├── apt repo URL rewrite to v<NEW_MINOR>/deb on all 5 nodes (kind=minor only)
+  ├── drain master → ssh < update_k8s.sh --role master → uncordon → verify
+  ├── for each worker (k8s-node4 → 3 → 2 → 1):
+  │     halt-on-alert wait → drain → ssh < update_k8s.sh --role worker → uncordon → 10-min soak
+  └── post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9)
+```
+
+### Components
+
+- **Detection CronJob**: `infra/stacks/k8s-version-upgrade/main.tf`. Image is the claude-agent-service image (alpine + kubectl + ssh-client + curl + jq). SA has cluster-read on nodes + ns-scoped get on `k8s-upgrade-creds` Secret.
+- **Agent prompt**: `infra/.claude/agents/k8s-version-upgrade.md`. Inputs: `target_version`, `kind=patch|minor`, `dry_run`, `stages`. Tools: Bash, Read, Write, Edit, Grep, Glob.
+- **Library node script**: `infra/scripts/update_k8s.sh`. Caller passes `--role master|worker --release X.Y.Z`. The agent pipes this via SSH onto each node.
+- **Two new Upgrade Gates alerts** (added in this work):
+  - `K8sVersionSkew` — kubelet/apiserver gitVersion count >1 for 30m. Catches a half-done rollout.
+  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently.
+- **Pushgateway metrics**:
+  - `k8s_upgrade_in_flight` / `k8s_upgrade_snapshot_taken` (pushed by agent)
+  - `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
+  - `k8s_version_check_last_run_timestamp` (staleness watchdog)
+
+### Source of truth
+
+| Concern | Location |
+|---|---|
+| Detection CronJob, RBAC, ExternalSecret, Vault role | `stacks/k8s-version-upgrade/main.tf` |
+| Agent orchestration | `.claude/agents/k8s-version-upgrade.md` |
+| Library node script | `scripts/update_k8s.sh` |
+| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
+| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
+
+### Why this design
+
+The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
+
+- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
+- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Two new gate alerts catch upgrade-specific half-states (version skew, missing snapshot).
+- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
+- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
+- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
+
+### Secrets
+
+| Secret | Vault Path | Purpose |
+|--------|-----------|---------|
+| SSH private key | `secret/k8s-upgrade.ssh_key` | Agent + detection CronJob SSH to all 5 nodes (user `wizard`) |
+| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
+| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
+| Agent service bearer | `secret/claude-agent-service.api_bearer_token` (reused) | Detection CronJob POSTs to `/execute` |
+
+### Operational reference
+
+See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection or the agent, rollback paths (master / worker / mid-flight abort), and SSH key rotation.
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -0,0 +1,238 @@
+# K8s Version Upgrade Pipeline
+
+## Overview
+
+Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
+VMs are upgraded automatically by a weekly detection CronJob that fires the
+`k8s-version-upgrade` agent through `claude-agent-service`. The agent walks
+the cluster through pre-flight → etcd snapshot → optional master containerd
+skew fix → optional apt repo URL rewrite (minor only) → master kubeadm
+upgrade → workers rolled sequentially → post-flight, with Slack notification
+at every transition and Prometheus halt-on-alert gating before every drain.
+
+This is **independent** of the OS-side `unattended-upgrades + kured`
+pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts and
+their schedules don't overlap (kured runs Mon-Fri 02:00-06:00 London;
+detection here runs Sun 12:00 UTC).
+
+## Architecture
+
+```
+k8s-version-check CronJob   (Sun 12:00 UTC)
+  │ kubectl get nodes  → running version
+  │ ssh master 'apt-cache madison kubeadm'  → latest patch (within current minor)
+  │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release  → next minor available?
+  │
+  ▼ if running != latest_patch  OR  next minor available
+POST claude-agent-service /execute
+  { prompt: "Run k8s-version-upgrade agent. Inputs: {target_version, kind, dry_run, stages}" }
+  │
+  ▼
+k8s-version-upgrade agent  (inside claude-agent-service pod)
+  ├── Stage 0: parse inputs, mark in-flight annotation + Pushgateway gauge
+  ├── Stage 1: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
+  ├── Stage 2: etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db
+  │            push k8s_upgrade_snapshot_taken=1
+  ├── Stage 3: master containerd bump (only if master < workers)
+  ├── Stage 4: apt repo URL rewrite to v<NEW_MINOR>/deb (only if kind=minor)
+  ├── Stage 5: drain master → ssh < update_k8s.sh --role master --release X.Y.Z → uncordon → verify
+  ├── Stage 6: each worker k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1:
+  │            halt-on-alert wait → drain → ssh script --role worker → uncordon → 10-min soak
+  └── Stage 7: post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9)
+               clear in-flight annotation, push k8s_upgrade_in_flight=0
+```
+
+## Components
+
+### Detection CronJob (`k8s-version-check`)
+- **Stack**: `infra/stacks/k8s-version-upgrade/main.tf`
+- **Image**: `forgejo.viktorbarzin.me/viktor/claude-agent-service` (ships kubectl, ssh-client, curl, jq)
+- **Schedule**: `0 12 * * 0` (Sunday 12:00 UTC). Outside kured window.
+- **SA**: `k8s-version-check` (cluster-read nodes, ns-scoped get on `k8s-upgrade-creds` Secret)
+- **Pushgateway metrics**:
+  - `k8s_upgrade_available{kind, running, target}` — 1 when a target is detected
+  - `k8s_version_check_last_run_timestamp` — staleness watchdog
+
+### Agent (`k8s-version-upgrade`)
+- **Prompt**: `infra/.claude/agents/k8s-version-upgrade.md`
+- **Runtime**: claude-agent-service pod (claude-agent ns)
+- **Inputs** (JSON in prompt): `target_version`, `kind` (patch|minor), `dry_run`, `stages`
+- **Library script**: `infra/scripts/update_k8s.sh` (run on each node via SSH pipe — `ssh ... 'bash -s' < update_k8s.sh -- --role master|worker --release X.Y.Z`)
+
+### Upgrade Gates alerts (additions for this pipeline)
+- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout where some nodes are upgraded and some aren't.
+- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently.
+- Both join the existing 10 Upgrade Gates alerts (KubeAPIServerDown, RecentNodeReboot, etc.) — kured ALSO blocks rolling reboots whenever any of these are firing.
+
+### Vault secrets
+- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by detection CronJob + agent to SSH into all 5 nodes (user `wizard`)
+- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to `/home/wizard/.ssh/authorized_keys` on every node
+- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL (separate channel from kured for clean alerting)
+
+Both keys exposed in K8s via ExternalSecret `k8s-upgrade-creds` in `k8s-upgrade` namespace.
+
+## Common Operations
+
+### Verify the pipeline is healthy
+```bash
+# CronJob present + not suspended
+kubectl -n k8s-upgrade get cronjob k8s-version-check
+
+# Latest run output
+kubectl -n k8s-upgrade get jobs -l app=k8s-version-check
+kubectl -n k8s-upgrade logs -l app=k8s-version-check --tail=200
+
+# Pushgateway metric — fresh discovery?
+curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics | \
+  grep -E '^(k8s_upgrade_available|k8s_version_check_last_run_timestamp)'
+
+# Upgrade Gates rules loaded
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
+  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
+```
+
+### Manually trigger a detection run (no upgrade)
+Use `detection_dry_run=true` to short-circuit before the POST to
+claude-agent-service:
+
+```bash
+# One-shot job from the cron, with DRY_RUN env override:
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
+kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
+```
+
+To make `detection_dry_run` permanent (e.g. while debugging),
+toggle the var in `stacks/k8s-version-upgrade/main.tf` and `scripts/tg apply`.
+
+### Manually dispatch the agent (skip detection)
+Useful when you want to force a run on a specific version without waiting for
+Sunday, or when testing.
+
+```bash
+TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service)
+
+# Dry-run (no mutations)
+curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":true,\"stages\":\"all\"}",
+    "agent": ".claude/agents/k8s-version-upgrade",
+    "max_budget_usd": 5
+  }'
+
+# Snapshot-only (Test 3 in the plan)
+curl -X POST ... -d '{
+    "prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"preflight,snapshot\"}",
+    ...
+}'
+
+# Real run
+curl -X POST ... -d '{
+    "prompt": "... Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"all\"}",
+    ...
+}'
+```
+
+Poll job status:
+```bash
+curl -s -H "Authorization: Bearer $TOKEN" \
+  http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID | jq .
+```
+
+### Halt the pipeline in an emergency
+The pipeline is gated by Prometheus alerts — any firing Upgrade Gates alert
+blocks the next drain. To explicitly halt:
+
+```bash
+# Option 1: suspend the detection CronJob (won't stop an in-flight agent run)
+kubectl -n k8s-upgrade patch cronjob k8s-version-check \
+  -p '{"spec":{"suspend":true}}' --type=merge
+# Re-enable: --type=merge -p '{"spec":{"suspend":false}}'
+
+# Option 2: kill an in-flight agent job
+TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service)
+JOB_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \
+  http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs | \
+  jq -r '.[] | select(.agent | test("k8s-version-upgrade")) | .id' | head -1)
+curl -X DELETE -H "Authorization: Bearer $TOKEN" \
+  http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID
+
+# Option 3: force a blocker alert (Upgrade Gates expression that always fires)
+# — see infra/docs/runbooks/k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
+```
+
+### Rollback paths
+
+`kubeadm` does **not** support in-place downgrade. If a run fails:
+
+#### Master broke during/after kubeadm upgrade
+1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
+2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
+3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
+   ```bash
+   ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
+   # Pre-upgrade versions are in the most recent "Commandline: apt-get install"
+   sudo apt-mark unhold kubeadm kubelet kubectl
+   sudo apt-get install --allow-downgrades -y \
+     kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
+   sudo apt-mark hold kubeadm kubelet kubectl
+   sudo systemctl daemon-reload && sudo systemctl restart kubelet
+   ```
+
+#### Worker broke
+1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
+2. Downgrade apt packages on that node only (see above)
+3. `kubectl uncordon <node>`
+4. The cluster continues running on the master + remaining workers throughout
+
+#### Pipeline aborts mid-flight (halt-on-alert blocks >30 min)
+- The agent posts a Slack message with the blocking alert list and exits non-zero
+- The in-flight annotation on `ns/k8s-upgrade` stays set → `EtcdPreUpgradeSnapshotMissing` may fire if Stage 2 didn't complete
+- Operator: triage the blocker, clear the alert, re-dispatch the agent manually (see "Manually dispatch the agent")
+- After successful retry: the agent's Stage 7 clears the annotation. If you decide NOT to retry, clear by hand:
+  ```bash
+  kubectl annotate ns k8s-upgrade \
+    viktorbarzin.me/k8s-upgrade-in-flight- \
+    viktorbarzin.me/k8s-upgrade-target- \
+    viktorbarzin.me/k8s-upgrade-snapshot-path-
+  # Also reset the Pushgateway gauge so the alert clears:
+  printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n' | \
+    curl --data-binary @- http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade
+  ```
+
+### One-shot SSH key rotation
+1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
+2. Update Vault:
+   ```bash
+   vault kv patch secret/k8s-upgrade \
+     ssh_key=@/tmp/k8s-upgrade \
+     ssh_key_pub=@/tmp/k8s-upgrade.pub
+   ```
+3. Push the new pubkey to every node:
+   ```bash
+   for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+     # Remove old upgrade key (tag with "k8s-upgrade") then append new
+     ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
+     ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
+   done
+   ```
+4. ESO refreshes the K8s Secret within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
+
+## Past Incidents
+
+- (none yet — pipeline went live 2026-05-10)
+- Pre-pipeline manual upgrades documented in commit history; the `update_k8s.sh` shell of those manual runs is preserved in `infra/scripts/update_k8s.sh` and is what the agent shells into nodes with.
+
+## File Pointers
+
+| What | Where |
+|------|-------|
+| Detection CronJob + RBAC + ExternalSecret | `infra/stacks/k8s-version-upgrade/main.tf` |
+| Agent prompt | `infra/.claude/agents/k8s-version-upgrade.md` |
+| Library node script | `infra/scripts/update_k8s.sh` |
+| Upgrade Gates alerts (incl. K8sVersionSkew + EtcdPreUpgradeSnapshotMissing) | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
+| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` — "K8s Version Upgrades" section |
+| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
--- a/scripts/update_k8s.sh
+++ b/scripts/update_k8s.sh
@ -1,36 +1,98 @@
 #!/usr/bin/env bash
+#
+# K8s component upgrader. Run on a single node (master OR worker) at a time.
+# The caller is responsible for:
+#   - draining + uncordoning the node (this script does not touch kubectl)
+#   - sequencing nodes (master first, then workers one at a time)
+#   - pre-flight checks (etcd snapshot, halt-on-alert, etc)
+#
+# Used by:
+#   - the k8s-version-upgrade agent (infra/.claude/agents/k8s-version-upgrade.md)
+#   - manual operators following the runbook (infra/docs/runbooks/k8s-version-upgrade.md)
+#
+# Old manual orchestration loop (kept for reference — the agent does the
+# equivalent now):
+#   for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do
+#     kb drain $n --ignore-daemonsets --delete-emptydir-data
+#     s wizard@$n 'bash -s' < update_k8s.sh --role worker --release 1.34.5
+#     kb uncordon $n
+#   done

-# run for all nodes using :
-# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do echo $n; kb drain $n --ignore-daemonsets --delete-emptydir-data; s wizard@$n 'bash -s' <update_k8s.sh; kb uncordon $n; done
+set -euo pipefail

-set -e
-export stable_version='1.34'  # change me
-export release="$stable_version.2"  # change me
+ROLE=""
+RELEASE=""

-echo "Upgrading to $stable_version"
+usage() {
+    cat <<EOF
+Usage: $0 --role <master|worker> --release <X.Y.Z>

-echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
-sudo mkdir -p /etc/apt/keyrings
-curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/Release.key" | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
+  --role     master|worker  (required)
+  --release  kubeadm/kubelet/kubectl target patch version, e.g. 1.34.5

-sudo apt-mark unhold kubeadm kubelet kubectl
-sudo apt-get update 
-sudo apt-get install -y kubeadm="$release-*" 
+Behavior:
+  - Rewrites /etc/apt/sources.list.d/kubernetes.list to the v\$MINOR/deb repo
+    derived from --release (so a 1.34.x release uses v1.34/deb, 1.35.x uses
+    v1.35/deb, etc).
+  - apt-get install kubeadm=<release>-* (apt-mark unhold first).
+  - master: kubeadm upgrade plan && kubeadm upgrade apply v<release> -y
+  - worker: kubeadm upgrade node
+  - apt-get install kubelet=<release>-* kubectl=<release>-* then re-hold.
+  - systemctl daemon-reload && systemctl restart kubelet
+EOF
+}

-HOSTNAME=$(hostname)
-SEARCH_STR="master"
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --role)    ROLE="$2"; shift 2;;
+        --release) RELEASE="$2"; shift 2;;
+        -h|--help) usage; exit 0;;
+        *) echo "Unknown arg: $1" >&2; usage; exit 2;;
+    esac
+done

-if [[ "$HOSTNAME" == *"$SEARCH_STR"* ]]; then
-    echo "Upgrading master"
-    sudo kubeadm upgrade plan && sudo kubeadm upgrade apply v$release -y
-else
-    echo "Upgrading worker"
-    sudo kubeadm upgrade node 
+if [[ -z "$ROLE" || -z "$RELEASE" ]]; then
+    echo "ERROR: --role and --release are required" >&2
+    usage
+    exit 2
 fi

-sudo apt-get install -y kubelet="$release-*" kubectl="$release-*"
-sudo apt-mark hold kubeadm kubelet kubectl
+if [[ "$ROLE" != "master" && "$ROLE" != "worker" ]]; then
+    echo "ERROR: --role must be 'master' or 'worker' (got: $ROLE)" >&2
+    exit 2
+fi

+# Derive minor track (e.g. 1.34.5 → 1.34)
+STABLE_VERSION="$(echo "$RELEASE" | awk -F. '{print $1"."$2}')"
+
+echo "==> Upgrading $(hostname) ($ROLE) to v$RELEASE (track v$STABLE_VERSION)"
+
+# Apt repo URL is pinned per minor track. Rewrite + re-import the signing key
+# every run — cheap, idempotent, and handles the minor-bump case where the
+# old track's repo no longer carries the target version.
+echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/ /" \
+    | sudo tee /etc/apt/sources.list.d/kubernetes.list
+sudo mkdir -p /etc/apt/keyrings
+curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/Release.key" \
+    | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
+
+sudo apt-mark unhold kubeadm kubelet kubectl
+sudo apt-get update
+sudo apt-get install -y "kubeadm=$RELEASE-*"
+
+if [[ "$ROLE" == "master" ]]; then
+    echo "==> Master path: kubeadm upgrade plan + apply"
+    sudo kubeadm upgrade plan
+    sudo kubeadm upgrade apply "v$RELEASE" -y
+else
+    echo "==> Worker path: kubeadm upgrade node"
+    sudo kubeadm upgrade node
+fi
+
+sudo apt-get install -y "kubelet=$RELEASE-*" "kubectl=$RELEASE-*"
+sudo apt-mark hold kubeadm kubelet kubectl

 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
+
+echo "==> Done: $(hostname) is on v$RELEASE"
--- a/scripts/update_node.sh
+++ b/scripts/update_node.sh
@ -1,8 +1,14 @@
 #!/usr/bin/env bash
+#
+# OS-major upgrade (Ubuntu do-release-upgrade). NOT in the auto-upgrade
+# pipeline — minor apt patches are handled by unattended-upgrades + kured;
+# K8s component bumps are handled by the k8s-version-upgrade agent. Run this
+# script manually when bumping Ubuntu LTS major versions.
+#
+# See:
+#   - infra/docs/runbooks/k8s-node-auto-upgrades.md  (apt + reboot)
+#   - infra/docs/runbooks/k8s-version-upgrade.md     (kubeadm/kubelet/kubectl)

 # sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
 sudo do-release-upgrade
 sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
-
-
-
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -0,0 +1,456 @@
+# k8s-version-upgrade — Automated K8s component (kubeadm/kubelet/kubectl) upgrade
+#
+# Detects new patch/minor versions via a weekly CronJob, then dispatches the
+# `k8s-version-upgrade` agent (infra/.claude/agents/k8s-version-upgrade.md)
+# through claude-agent-service for the actual rolling upgrade.
+#
+# Reuse points:
+#   - claude-agent-service.claude-agent.svc:8080 — agent job runner
+#   - Vault secret/k8s-upgrade/* — operator populates ssh_key + slack_webhook
+#   - Prometheus + Pushgateway + Upgrade Gates alert group (in monitoring stack)
+#   - update_k8s.sh — library script the agent shells into nodes with
+#
+# Notes:
+#   - Schedule is Sun 12:00 UTC — well outside the kured Mon-Fri 02:00-06:00
+#     London window so OS reboots and K8s version rollouts can't overlap.
+#   - Patch detection uses `apt-cache madison kubeadm` on master via SSH.
+#     Minor detection probes the next-minor apt repo URL with HEAD.
+
+variable "schedule" {
+  type    = string
+  default = "0 12 * * 0" # Sunday 12:00 UTC
+}
+
+# Toggle to suspend the detection CronJob without dropping the stack.
+variable "enabled" {
+  type    = bool
+  default = true
+}
+
+# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in
+# sync when the claude-agent-service image is rebuilt. Reused here because the
+# detection CronJob only needs kubectl, ssh-client, curl, jq — all of which
+# the claude-agent-service image already ships.
+variable "claude_agent_service_image_tag" {
+  type    = string
+  default = "2fd7670d"
+}
+
+# If true, the CronJob runs the detection sequence but does NOT POST to
+# claude-agent-service. Used for Test 1 to confirm detection works without
+# firing a real upgrade.
+variable "detection_dry_run" {
+  type    = bool
+  default = false
+}
+
+locals {
+  namespace = "k8s-upgrade"
+  ca_image  = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
+  labels = {
+    app = "k8s-version-check"
+  }
+}
+
+# --- Namespace ---
+
+resource "kubernetes_namespace" "k8s_upgrade" {
+  metadata {
+    name = local.namespace
+    labels = {
+      tier = local.tiers.cluster
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+# --- ExternalSecret: ssh_key + slack_webhook + agent-service bearer ---
+#
+# Operator populates Vault `secret/k8s-upgrade/` with:
+#   - ssh_key         (PEM-encoded ed25519 private key)
+#   - ssh_key_pub     (the matching public key — distributed to nodes' authorized_keys)
+#   - slack_webhook   (Slack incoming-webhook URL, separate channel from kured for clean alerting)
+#
+# The claude-agent-service bearer token comes from secret/claude-agent-service
+# (reused — no parallel token needed).
+
+resource "kubernetes_manifest" "external_secret" {
+  manifest = {
+    apiVersion = "external-secrets.io/v1beta1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "k8s-upgrade-creds"
+      namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "k8s-upgrade-creds"
+      }
+      data = [
+        {
+          secretKey = "ssh_key"
+          remoteRef = {
+            key      = "k8s-upgrade"
+            property = "ssh_key"
+          }
+        },
+        {
+          secretKey = "slack_webhook"
+          remoteRef = {
+            key      = "k8s-upgrade"
+            property = "slack_webhook"
+          }
+        },
+        {
+          secretKey = "api_bearer_token"
+          remoteRef = {
+            key      = "claude-agent-service"
+            property = "api_bearer_token"
+          }
+        },
+      ]
+    }
+  }
+}
+
+# --- ServiceAccount + RBAC for the detection CronJob ---
+
+resource "kubernetes_service_account" "k8s_version_check" {
+  metadata {
+    name      = "k8s-version-check"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+}
+
+# Cluster-wide read on nodes (for kubeletVersion comparison)
+resource "kubernetes_cluster_role" "k8s_version_check" {
+  metadata {
+    name = "k8s-version-check"
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["nodes"]
+    verbs      = ["get", "list"]
+  }
+}
+
+resource "kubernetes_cluster_role_binding" "k8s_version_check" {
+  metadata {
+    name = "k8s-version-check"
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "ClusterRole"
+    name      = kubernetes_cluster_role.k8s_version_check.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.k8s_version_check.metadata[0].name
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+}
+
+# Namespace-scoped: detection CronJob reads its own creds Secret.
+resource "kubernetes_role" "k8s_version_check_secrets" {
+  metadata {
+    name      = "k8s-version-check-secrets"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["secrets"]
+    resource_names = ["k8s-upgrade-creds"]
+    verbs          = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "k8s_version_check_secrets" {
+  metadata {
+    name      = "k8s-version-check-secrets"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.k8s_version_check_secrets.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.k8s_version_check.metadata[0].name
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+}
+
+# --- Cross-namespace RBAC: claude-agent SA reads k8s-upgrade-creds + annotates ns ---
+#
+# The k8s-version-upgrade agent runs inside the claude-agent-service pod (SA
+# `claude-agent` in `claude-agent` ns). It needs:
+#   - GET on this namespace's k8s-upgrade-creds Secret (to fetch ssh_key + slack)
+#   - PATCH on the k8s-upgrade Namespace annotations (in-flight marker)
+
+resource "kubernetes_role" "claude_agent_reads_creds" {
+  metadata {
+    name      = "claude-agent-reads-creds"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["secrets"]
+    resource_names = ["k8s-upgrade-creds"]
+    verbs          = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "claude_agent_reads_creds" {
+  metadata {
+    name      = "claude-agent-reads-creds"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.claude_agent_reads_creds.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = "claude-agent"
+    namespace = "claude-agent"
+  }
+}
+
+# The claude-agent ClusterRole already grants `get,list,watch` on namespaces
+# but NOT patch — so we need to extend it here for the annotation write.
+# Bound via a separate ClusterRoleBinding so we don't fork the upstream stack.
+resource "kubernetes_cluster_role" "claude_agent_annotates_ns" {
+  metadata {
+    name = "claude-agent-annotates-k8s-upgrade-ns"
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["namespaces"]
+    resource_names = ["k8s-upgrade"]
+    verbs          = ["patch", "update"]
+  }
+}
+
+resource "kubernetes_cluster_role_binding" "claude_agent_annotates_ns" {
+  metadata {
+    name = "claude-agent-annotates-k8s-upgrade-ns"
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "ClusterRole"
+    name      = kubernetes_cluster_role.claude_agent_annotates_ns.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = "claude-agent"
+    namespace = "claude-agent"
+  }
+}
+
+# --- Detection CronJob ---
+#
+# Weekly: compares running cluster version against latest available patch
+# (apt-cache madison kubeadm on master) and latest available minor (HEAD on
+# next-minor pkgs.k8s.io repo). When a target is detected, POSTs to
+# claude-agent-service to kick the upgrade agent.
+
+resource "kubernetes_cron_job_v1" "k8s_version_check" {
+  metadata {
+    name      = "k8s-version-check"
+    namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
+    labels    = local.labels
+  }
+  spec {
+    schedule                      = var.schedule
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 3
+    failed_jobs_history_limit     = 3
+    starting_deadline_seconds     = 600
+    suspend                       = !var.enabled
+    job_template {
+      metadata {
+        labels = local.labels
+      }
+      spec {
+        backoff_limit              = 0
+        ttl_seconds_after_finished = 86400
+        template {
+          metadata {
+            labels = local.labels
+          }
+          spec {
+            service_account_name = kubernetes_service_account.k8s_version_check.metadata[0].name
+            restart_policy       = "Never"
+            image_pull_secrets {
+              name = "registry-credentials"
+            }
+            container {
+              name  = "version-check"
+              image = local.ca_image
+              command = ["/bin/bash", "-c", <<-EOT
+                set -euo pipefail
+                echo "==> k8s-version-check ($(date -u +%FT%TZ))"
+
+                # 1. Load SSH key from K8s Secret
+                mkdir -p /tmp
+                /usr/local/bin/kubectl get secret k8s-upgrade-creds \
+                  -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
+                chmod 400 /tmp/k8s-upgrade-ssh-key
+
+                SLACK=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
+                  -o jsonpath='{.data.slack_webhook}' | base64 -d)
+
+                AGENT_TOKEN=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
+                  -o jsonpath='{.data.api_bearer_token}' | base64 -d)
+
+                SSH="ssh -i /tmp/k8s-upgrade-ssh-key \
+                  -o StrictHostKeyChecking=accept-new \
+                  -o UserKnownHostsFile=/tmp/known_hosts"
+
+                slack() {
+                  curl -sS -X POST -H 'Content-Type: application/json' \
+                    --data "$(jq -nc --arg t "[k8s-version-check] $1" '{text: $t}')" \
+                    "$SLACK" || true
+                }
+
+                # 2. Detect running version
+                RUNNING=$(/usr/local/bin/kubectl get nodes \
+                  -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' \
+                  | tr -d v)
+                RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}')
+                echo "Running version: v$RUNNING (minor $RUNNING_MINOR)"
+
+                # 3. Detect highest available patch within the running minor track.
+                LATEST_PATCH=$($SSH wizard@k8s-master \
+                  "apt-cache madison kubeadm 2>/dev/null \
+                    | awk '{print \$3}' \
+                    | sed 's/-.*//' \
+                    | grep '^$RUNNING_MINOR\\.' \
+                    | sort -V | tail -1" || echo "")
+                echo "Latest patch (apt): v$LATEST_PATCH"
+
+                # 4. Detect next available minor by probing the apt repo URL.
+                NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
+                NEXT_MINOR="1.$NEXT_MINOR_NUM"
+                NEXT_MINOR_AVAILABLE="no"
+                if curl -sIo /dev/null -w '%%{http_code}' \
+                    "https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \
+                    | grep -q '^200$'; then
+                  NEXT_MINOR_AVAILABLE="yes"
+                fi
+                echo "Next minor v$NEXT_MINOR available: $NEXT_MINOR_AVAILABLE"
+
+                # 5. Decide what to do
+                TARGET=""
+                KIND=""
+                if [ -n "$LATEST_PATCH" ] && [ "$LATEST_PATCH" != "$RUNNING" ]; then
+                  TARGET="$LATEST_PATCH"
+                  KIND="patch"
+                elif [ "$NEXT_MINOR_AVAILABLE" = "yes" ]; then
+                  # Probe the minor track to get its latest patch.
+                  NEXT_MINOR_PATCH=$($SSH wizard@k8s-master \
+                    "curl -sf 'https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Packages' \
+                      | grep -oE 'Version: [0-9.-]+' \
+                      | awk '{print \$2}' | sed 's/-.*//' \
+                      | sort -V | tail -1" || echo "")
+                  if [ -n "$NEXT_MINOR_PATCH" ]; then
+                    TARGET="$NEXT_MINOR_PATCH"
+                    KIND="minor"
+                  fi
+                fi
+
+                # 6. Push the discovery metric to Pushgateway
+                PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-check'
+                {
+                  echo "# TYPE k8s_upgrade_available gauge"
+                  if [ -n "$TARGET" ]; then
+                    echo "k8s_upgrade_available{kind=\"$KIND\",running=\"$RUNNING\",target=\"$TARGET\"} 1"
+                  else
+                    echo "k8s_upgrade_available{kind=\"none\",running=\"$RUNNING\",target=\"$RUNNING\"} 0"
+                  fi
+                  echo "# TYPE k8s_version_check_last_run_timestamp gauge"
+                  echo "k8s_version_check_last_run_timestamp $(date +%s)"
+                } | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
+
+                # 7. Decide whether to dispatch
+                if [ -z "$TARGET" ]; then
+                  echo "No upgrade needed (running=$RUNNING, latest_patch=$LATEST_PATCH, next_minor_available=$NEXT_MINOR_AVAILABLE)"
+                  exit 0
+                fi
+
+                slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
+
+                if [ "$DRY_RUN" = "true" ]; then
+                  echo "DRY_RUN=true — not POSTing to claude-agent-service"
+                  slack "DRY_RUN — skipping agent dispatch"
+                  exit 0
+                fi
+
+                # 8. POST to claude-agent-service
+                PAYLOAD=$(jq -nc \
+                  --arg target "$TARGET" \
+                  --arg kind "$KIND" \
+                  '{
+                    prompt: ("Run the k8s-version-upgrade agent. Inputs: " + ({target_version: $target, kind: $kind, dry_run: false, stages: "all"} | tostring)),
+                    agent: ".claude/agents/k8s-version-upgrade",
+                    max_budget_usd: 30
+                  }')
+
+                echo "Dispatching agent: $PAYLOAD"
+                RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
+                  -H "Authorization: Bearer $AGENT_TOKEN" \
+                  -H 'Content-Type: application/json' \
+                  -d "$PAYLOAD" \
+                  http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
+                CODE=$(printf '%s' "$RESP" | tail -n1)
+                BODY=$(printf '%s' "$RESP" | sed '$d')
+
+                if [ "$CODE" = "200" ] || [ "$CODE" = "202" ]; then
+                  JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // .id // "unknown"')
+                  slack "Agent dispatched: job=$JOB_ID (target=v$TARGET kind=$KIND)"
+                  echo "OK — job=$JOB_ID"
+                else
+                  slack "ERROR dispatching agent: HTTP $CODE — $BODY"
+                  echo "dispatch failed: HTTP $CODE — $BODY" >&2
+                  exit 1
+                fi
+              EOT
+              ]
+              env {
+                name  = "DRY_RUN"
+                value = tostring(var.detection_dry_run)
+              }
+              env {
+                name  = "HOME"
+                value = "/tmp"
+              }
+              resources {
+                requests = {
+                  cpu    = "50m"
+                  memory = "128Mi"
+                }
+                limits = {
+                  memory = "256Mi"
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
--- a/stacks/k8s-version-upgrade/terragrunt.hcl
+++ b/stacks/k8s-version-upgrade/terragrunt.hcl
@ -0,0 +1,23 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+# ExternalSecret hits ESO which needs to be alive when the manifest applies.
+dependency "external_secrets" {
+  config_path  = "../external-secrets"
+  skip_outputs = true
+}
+
+# Upgrade Gates rules (incl. K8sVersionSkew + EtcdPreUpgradeSnapshotMissing)
+# live in the monitoring stack — make the relationship visible so reapplies
+# don't race the alerts being available.
+dependency "monitoring" {
+  config_path  = "../monitoring"
+  skip_outputs = true
+}
+
+# Note: stacks/claude-agent-service has no terragrunt.hcl yet (manual apply
+# pattern) — its ServiceAccount + Namespace are referenced by name from this
+# stack's RoleBindings, which is fine because RoleBindings allow forward
+# references. Apply order: claude-agent-service first (or already deployed),
+# then this stack.
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1890,14 +1890,13 @@ serverFiles:
            annotations:
              summary: "Kubelet/apiserver gitVersion skew detected — possible half-done k8s upgrade. Inspect: kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.kubeletVersion}'"
          # EtcdPreUpgradeSnapshotMissing: the k8s-version-upgrade agent pushes
-          # k8s_upgrade_in_flight=1 when it starts, and k8s_upgrade_snapshot_taken=1
-          # after the etcdctl snapshot is verified. If we see in_flight=1 with no
-          # corresponding snapshot_taken=1 after 10 min, the agent has skipped or
-          # failed the snapshot — that's a critical safety hole.
+          # `k8s_upgrade_in_flight=1` + `k8s_upgrade_snapshot_taken=0` at Stage 0,
+          # then sets snapshot_taken=1 in Stage 2 after etcdctl confirms the
+          # snapshot file size. Anywhere in_flight=1 with snapshot_taken=0
+          # lasting >10m means the agent skipped or failed Stage 2 — a critical
+          # safety hole (no recovery point if master upgrade hangs).
          - alert: EtcdPreUpgradeSnapshotMissing
-            expr: |
-              k8s_upgrade_in_flight == 1
-              unless on() k8s_upgrade_snapshot_taken == 1
+            expr: k8s_upgrade_in_flight == 1 and k8s_upgrade_snapshot_taken == 0
            for: 10m
            labels:
              severity: critical