k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
This commit is contained in:
Viktor Barzin 2026-05-10 19:07:42 +00:00
parent 09f83b4e83
commit e75bcaf394
8 changed files with 1379 additions and 34 deletions

View file

@ -0,0 +1,486 @@
---
name: k8s-version-upgrade
description: "Automated K8s version upgrader. Verifies cluster health, takes an etcd snapshot, optionally fixes containerd skew on master, upgrades the control plane, then rolls workers sequentially with halt-on-alert gating and Slack notification at every transition."
tools: Read, Write, Edit, Bash, Grep, Glob
model: opus
---
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
## Your Job
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
## Inputs
The user prompt contains a JSON object with these fields:
```json
{
"target_version": "1.34.5",
"kind": "patch",
"dry_run": false,
"stages": "all"
}
```
| Field | Required | Description |
|---|---|---|
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
## Environment
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
- **Etcd snapshot dir**: `/mnt/main/etcd-backup/` (NFS, exists, writeable from master)
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
### Credentials — fetched at startup
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
```bash
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
# SSH private key — mode 0400 required by openssh
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
chmod 400 /tmp/k8s-upgrade-ssh-key
# Slack webhook (URL string)
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.slack_webhook}' | base64 -d)
```
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
```bash
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
```
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
## NEVER do
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
- Never skip the etcd snapshot — even for patch
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
- Never run two stages in parallel — sequential only
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
## Slack + Pushgateway helpers
Every transition posts to Slack:
```bash
slack() {
local msg="$1"
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
curl -sS -X POST -H 'Content-Type: application/json' \
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
"$hook"
}
```
Start every message with `[k8s-upgrade]` so it's grep-able.
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
```bash
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
push_metric() {
# push_metric <name> <value>
local name="$1" val="$2"
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
| curl -sS --data-binary @- "$PG"
}
```
Pushes you must make at specific stages (skipped in dry_run):
| When | Metric | Value |
|---|---|---|
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
## Stage 0: Parse inputs + announce
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
2. Derive `target_minor` from `target_version` (split on `.`).
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
```bash
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
viktorbarzin.me/k8s-upgrade-target="$target_version" \
--overwrite
push_metric k8s_upgrade_in_flight 1
push_metric k8s_upgrade_snapshot_taken 0
fi
```
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
## Stage 1: Pre-flight (`stages` includes `preflight`)
Skip if `stages` excludes `preflight`.
### Check 1.1 — All nodes Ready, no pressure
```bash
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
```
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
### Check 1.2 — Halt-on-alert (same query kured uses)
```bash
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
if [ -n "$ALERTS" ]; then
slack "ABORT preflight — firing alerts:\n$ALERTS"
exit 1
fi
```
### Check 1.3 — 24h-quiet baseline
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
```bash
RECENT_REBOOT=0
while IFS= read -r ts; do
[ -z "$ts" ] && continue
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$RECENT_REBOOT" -eq 1 ]; then
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
exit 1
fi
```
### Check 1.4 — kubeadm upgrade plan reports our target
```bash
PLAN_TARGET=$($SSH \
wizard@k8s-master 'sudo kubeadm upgrade plan' \
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
```
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
Always run — patch OR minor.
```bash
TARGET_PATH="/mnt/main/etcd-backup/k8s-upgrade-pre-${target_version}-$(date +%s).db"
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master "sudo /usr/bin/env ETCDCTL_API=3 etcdctl snapshot save '$TARGET_PATH' \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key"
# Verify size > 0
SIZE=$($SSH \
wizard@k8s-master "sudo stat -c %s '$TARGET_PATH'")
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
slack "ABORT — etcd snapshot empty or missing ($SIZE bytes at $TARGET_PATH)"
exit 1
fi
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
push_metric k8s_upgrade_snapshot_taken 1
fi
slack "Etcd snapshot saved at $TARGET_PATH ($SIZE bytes)"
```
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
Only run if master containerd version < highest worker containerd version.
```bash
get_ctr_version() {
$SSH \
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
}
MASTER_CTR=$(get_ctr_version k8s-master)
WORKER_MAX="0.0.0"
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
v=$(get_ctr_version "$n")
# Compare semver-ish
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
WORKER_MAX="$v"
fi
done
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
# Master is behind — bump
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX bumping master"
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master "sudo apt-mark unhold containerd.io \
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
&& sudo apt-mark hold containerd.io \
&& sudo systemctl restart containerd"
# Wait until kubelet on master is Ready again
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
[ "$STATUS" = "True" ] && break
sleep 10
done
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
fi
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
else
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
fi
```
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
Only run if `kind=minor`.
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
```bash
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
if [ "$dry_run" = "false" ]; then
$SSH \
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
&& sudo apt-get update"
fi
```
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
## Stage 5: Master upgrade (`stages` includes `master`)
```bash
# 5.1 Drain
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
fi
# 5.2 Run the library script via SSH pipe
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role master --release "$target_version"
fi
# 5.3 Uncordon + wait Ready
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
fi
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
# 5.4 All control-plane pods Running
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
# 5.5 Re-check halt-on-alert
# (re-run the Check 1.2 query, abort if anything new fires)
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
```
## Stage 6: Workers sequentially (`stages` includes `workers`)
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
For each worker `$node`:
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
4. `kubectl uncordon $node`
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
7. Slack: `Worker $node complete ($i/4)`.
```bash
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
i=0
for node in $WORKERS; do
i=$((i+1))
# Halt-on-alert recheck with retry
for attempt in $(seq 1 30); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -z "$ALERTS" ] && break
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
sleep 60
done
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
$SSH \
"wizard@$node" 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role worker --release "$target_version"
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
fi
# Wait Ready + version match
for w in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
# 10-min soak with halt-on-alert
echo "Soaking $node for 10 min..."
for sec in $(seq 1 10); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
| sort -u)
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
sleep 60
done
slack "Worker $node upgrade complete ($i/4). Soaked clean."
done
```
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
## Stage 7: Post-flight (`stages` includes `postflight`)
```bash
# All 5 nodes at target
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
echo "$VERSIONS"
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
# Upgrade Gates all inactive
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
# pod-ready ratio >= 0.9
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
| jq -r '.data.result[0].value[1] // "0"')
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
# Clear the in-flight annotation + Pushgateway gauges
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
push_metric k8s_upgrade_in_flight 0
push_metric k8s_upgrade_snapshot_taken 0
fi
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
```
## Rollback
This agent does NOT auto-rollback. If anything aborts mid-flight:
1. Slack the failure with the last known stage + node.
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
## Notes for tests
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
## Edge cases
- **Slack down**: Don't block the upgrade — continue, log to stderr.
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
## Verification claims you must make
When you `slack` a SUCCESS message, you must have actually verified:
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
- No alerts firing outside the ignore-list
- pod-ready ratio computed from Prometheus
Do not declare success without those three confirmations.

View file

@ -1,9 +1,10 @@
# Automated Upgrades
This doc covers two independent automation paths:
This doc covers three independent automation paths:
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section near the end and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → claude-agent-service → `k8s-version-upgrade` agent. See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
## Overview
@ -242,3 +243,77 @@ The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades ker
### Operational reference
See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
## K8s Version Upgrades
Independent of the OS-upgrade and service-upgrade pipelines. Drives
kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
### Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
│ probe apt-cache madison kubeadm (master) → latest available patch
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
│ push k8s_upgrade_available metric to Pushgateway
▼ if running != latest
POST claude-agent-service /execute with target_version + kind
k8s-version-upgrade agent (in claude-agent-service pod)
├── pre-flight (5 nodes Ready, halt-on-alert, 24h-quiet, kubeadm plan match)
├── etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db
├── master containerd bump (only if master version < workers')
├── apt repo URL rewrite to v<NEW_MINOR>/deb on all 5 nodes (kind=minor only)
├── drain master → ssh < update_k8s.sh --role master uncordon verify
├── for each worker (k8s-node4 → 3 → 2 → 1):
│ halt-on-alert wait → drain → ssh < update_k8s.sh --role worker uncordon 10-min soak
└── post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9)
```
### Components
- **Detection CronJob**: `infra/stacks/k8s-version-upgrade/main.tf`. Image is the claude-agent-service image (alpine + kubectl + ssh-client + curl + jq). SA has cluster-read on nodes + ns-scoped get on `k8s-upgrade-creds` Secret.
- **Agent prompt**: `infra/.claude/agents/k8s-version-upgrade.md`. Inputs: `target_version`, `kind=patch|minor`, `dry_run`, `stages`. Tools: Bash, Read, Write, Edit, Grep, Glob.
- **Library node script**: `infra/scripts/update_k8s.sh`. Caller passes `--role master|worker --release X.Y.Z`. The agent pipes this via SSH onto each node.
- **Two new Upgrade Gates alerts** (added in this work):
- `K8sVersionSkew` — kubelet/apiserver gitVersion count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` / `k8s_upgrade_snapshot_taken` (pushed by agent)
- `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
- `k8s_version_check_last_run_timestamp` (staleness watchdog)
### Source of truth
| Concern | Location |
|---|---|
| Detection CronJob, RBAC, ExternalSecret, Vault role | `stacks/k8s-version-upgrade/main.tf` |
| Agent orchestration | `.claude/agents/k8s-version-upgrade.md` |
| Library node script | `scripts/update_k8s.sh` |
| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
### Why this design
The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Two new gate alerts catch upgrade-specific half-states (version skew, missing snapshot).
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
### Secrets
| Secret | Vault Path | Purpose |
|--------|-----------|---------|
| SSH private key | `secret/k8s-upgrade.ssh_key` | Agent + detection CronJob SSH to all 5 nodes (user `wizard`) |
| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
| Agent service bearer | `secret/claude-agent-service.api_bearer_token` (reused) | Detection CronJob POSTs to `/execute` |
### Operational reference
See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection or the agent, rollback paths (master / worker / mid-flight abort), and SSH key rotation.

View file

@ -0,0 +1,238 @@
# K8s Version Upgrade Pipeline
## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that fires the
`k8s-version-upgrade` agent through `claude-agent-service`. The agent walks
the cluster through pre-flight → etcd snapshot → optional master containerd
skew fix → optional apt repo URL rewrite (minor only) → master kubeadm
upgrade → workers rolled sequentially → post-flight, with Slack notification
at every transition and Prometheus halt-on-alert gating before every drain.
This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts and
their schedules don't overlap (kured runs Mon-Fri 02:00-06:00 London;
detection here runs Sun 12:00 UTC).
## Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC)
│ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
▼ if running != latest_patch OR next minor available
POST claude-agent-service /execute
{ prompt: "Run k8s-version-upgrade agent. Inputs: {target_version, kind, dry_run, stages}" }
k8s-version-upgrade agent (inside claude-agent-service pod)
├── Stage 0: parse inputs, mark in-flight annotation + Pushgateway gauge
├── Stage 1: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
├── Stage 2: etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db
│ push k8s_upgrade_snapshot_taken=1
├── Stage 3: master containerd bump (only if master < workers)
├── Stage 4: apt repo URL rewrite to v<NEW_MINOR>/deb (only if kind=minor)
├── Stage 5: drain master → ssh < update_k8s.sh --role master --release X.Y.Z uncordon verify
├── Stage 6: each worker k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1:
│ halt-on-alert wait → drain → ssh script --role worker → uncordon → 10-min soak
└── Stage 7: post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9)
clear in-flight annotation, push k8s_upgrade_in_flight=0
```
## Components
### Detection CronJob (`k8s-version-check`)
- **Stack**: `infra/stacks/k8s-version-upgrade/main.tf`
- **Image**: `forgejo.viktorbarzin.me/viktor/claude-agent-service` (ships kubectl, ssh-client, curl, jq)
- **Schedule**: `0 12 * * 0` (Sunday 12:00 UTC). Outside kured window.
- **SA**: `k8s-version-check` (cluster-read nodes, ns-scoped get on `k8s-upgrade-creds` Secret)
- **Pushgateway metrics**:
- `k8s_upgrade_available{kind, running, target}` — 1 when a target is detected
- `k8s_version_check_last_run_timestamp` — staleness watchdog
### Agent (`k8s-version-upgrade`)
- **Prompt**: `infra/.claude/agents/k8s-version-upgrade.md`
- **Runtime**: claude-agent-service pod (claude-agent ns)
- **Inputs** (JSON in prompt): `target_version`, `kind` (patch|minor), `dry_run`, `stages`
- **Library script**: `infra/scripts/update_k8s.sh` (run on each node via SSH pipe — `ssh ... 'bash -s' < update_k8s.sh -- --role master|worker --release X.Y.Z`)
### Upgrade Gates alerts (additions for this pipeline)
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout where some nodes are upgraded and some aren't.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently.
- Both join the existing 10 Upgrade Gates alerts (KubeAPIServerDown, RecentNodeReboot, etc.) — kured ALSO blocks rolling reboots whenever any of these are firing.
### Vault secrets
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by detection CronJob + agent to SSH into all 5 nodes (user `wizard`)
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to `/home/wizard/.ssh/authorized_keys` on every node
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL (separate channel from kured for clean alerting)
Both keys exposed in K8s via ExternalSecret `k8s-upgrade-creds` in `k8s-upgrade` namespace.
## Common Operations
### Verify the pipeline is healthy
```bash
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check
# Latest run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-check
kubectl -n k8s-upgrade logs -l app=k8s-version-check --tail=200
# Pushgateway metric — fresh discovery?
curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics | \
grep -E '^(k8s_upgrade_available|k8s_version_check_last_run_timestamp)'
# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
```
### Manually trigger a detection run (no upgrade)
Use `detection_dry_run=true` to short-circuit before the POST to
claude-agent-service:
```bash
# One-shot job from the cron, with DRY_RUN env override:
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
```
To make `detection_dry_run` permanent (e.g. while debugging),
toggle the var in `stacks/k8s-version-upgrade/main.tf` and `scripts/tg apply`.
### Manually dispatch the agent (skip detection)
Useful when you want to force a run on a specific version without waiting for
Sunday, or when testing.
```bash
TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service)
# Dry-run (no mutations)
curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":true,\"stages\":\"all\"}",
"agent": ".claude/agents/k8s-version-upgrade",
"max_budget_usd": 5
}'
# Snapshot-only (Test 3 in the plan)
curl -X POST ... -d '{
"prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"preflight,snapshot\"}",
...
}'
# Real run
curl -X POST ... -d '{
"prompt": "... Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"all\"}",
...
}'
```
Poll job status:
```bash
curl -s -H "Authorization: Bearer $TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID | jq .
```
### Halt the pipeline in an emergency
The pipeline is gated by Prometheus alerts — any firing Upgrade Gates alert
blocks the next drain. To explicitly halt:
```bash
# Option 1: suspend the detection CronJob (won't stop an in-flight agent run)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
-p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: --type=merge -p '{"spec":{"suspend":false}}'
# Option 2: kill an in-flight agent job
TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service)
JOB_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs | \
jq -r '.[] | select(.agent | test("k8s-version-upgrade")) | .id' | head -1)
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID
# Option 3: force a blocker alert (Upgrade Gates expression that always fires)
# — see infra/docs/runbooks/k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
```
### Rollback paths
`kubeadm` does **not** support in-place downgrade. If a run fails:
#### Master broke during/after kubeadm upgrade
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
```bash
ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
# Pre-upgrade versions are in the most recent "Commandline: apt-get install"
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get install --allow-downgrades -y \
kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload && sudo systemctl restart kubelet
```
#### Worker broke
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
2. Downgrade apt packages on that node only (see above)
3. `kubectl uncordon <node>`
4. The cluster continues running on the master + remaining workers throughout
#### Pipeline aborts mid-flight (halt-on-alert blocks >30 min)
- The agent posts a Slack message with the blocking alert list and exits non-zero
- The in-flight annotation on `ns/k8s-upgrade` stays set → `EtcdPreUpgradeSnapshotMissing` may fire if Stage 2 didn't complete
- Operator: triage the blocker, clear the alert, re-dispatch the agent manually (see "Manually dispatch the agent")
- After successful retry: the agent's Stage 7 clears the annotation. If you decide NOT to retry, clear by hand:
```bash
kubectl annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path-
# Also reset the Pushgateway gauge so the alert clears:
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n' | \
curl --data-binary @- http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade
```
### One-shot SSH key rotation
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
2. Update Vault:
```bash
vault kv patch secret/k8s-upgrade \
ssh_key=@/tmp/k8s-upgrade \
ssh_key_pub=@/tmp/k8s-upgrade.pub
```
3. Push the new pubkey to every node:
```bash
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
# Remove old upgrade key (tag with "k8s-upgrade") then append new
ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
done
```
4. ESO refreshes the K8s Secret within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
## Past Incidents
- (none yet — pipeline went live 2026-05-10)
- Pre-pipeline manual upgrades documented in commit history; the `update_k8s.sh` shell of those manual runs is preserved in `infra/scripts/update_k8s.sh` and is what the agent shells into nodes with.
## File Pointers
| What | Where |
|------|-------|
| Detection CronJob + RBAC + ExternalSecret | `infra/stacks/k8s-version-upgrade/main.tf` |
| Agent prompt | `infra/.claude/agents/k8s-version-upgrade.md` |
| Library node script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts (incl. K8sVersionSkew + EtcdPreUpgradeSnapshotMissing) | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` — "K8s Version Upgrades" section |
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |

View file

@ -1,36 +1,98 @@
#!/usr/bin/env bash
#
# K8s component upgrader. Run on a single node (master OR worker) at a time.
# The caller is responsible for:
# - draining + uncordoning the node (this script does not touch kubectl)
# - sequencing nodes (master first, then workers one at a time)
# - pre-flight checks (etcd snapshot, halt-on-alert, etc)
#
# Used by:
# - the k8s-version-upgrade agent (infra/.claude/agents/k8s-version-upgrade.md)
# - manual operators following the runbook (infra/docs/runbooks/k8s-version-upgrade.md)
#
# Old manual orchestration loop (kept for reference — the agent does the
# equivalent now):
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do
# kb drain $n --ignore-daemonsets --delete-emptydir-data
# s wizard@$n 'bash -s' < update_k8s.sh --role worker --release 1.34.5
# kb uncordon $n
# done
# run for all nodes using :
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do echo $n; kb drain $n --ignore-daemonsets --delete-emptydir-data; s wizard@$n 'bash -s' <update_k8s.sh; kb uncordon $n; done
set -euo pipefail
set -e
export stable_version='1.34' # change me
export release="$stable_version.2" # change me
ROLE=""
RELEASE=""
echo "Upgrading to $stable_version"
usage() {
cat <<EOF
Usage: $0 --role <master|worker> --release <X.Y.Z>
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo mkdir -p /etc/apt/keyrings
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/Release.key" | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
--role master|worker (required)
--release kubeadm/kubelet/kubectl target patch version, e.g. 1.34.5
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y kubeadm="$release-*"
Behavior:
- Rewrites /etc/apt/sources.list.d/kubernetes.list to the v\$MINOR/deb repo
derived from --release (so a 1.34.x release uses v1.34/deb, 1.35.x uses
v1.35/deb, etc).
- apt-get install kubeadm=<release>-* (apt-mark unhold first).
- master: kubeadm upgrade plan && kubeadm upgrade apply v<release> -y
- worker: kubeadm upgrade node
- apt-get install kubelet=<release>-* kubectl=<release>-* then re-hold.
- systemctl daemon-reload && systemctl restart kubelet
EOF
}
HOSTNAME=$(hostname)
SEARCH_STR="master"
while [[ $# -gt 0 ]]; do
case "$1" in
--role) ROLE="$2"; shift 2;;
--release) RELEASE="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) echo "Unknown arg: $1" >&2; usage; exit 2;;
esac
done
if [[ "$HOSTNAME" == *"$SEARCH_STR"* ]]; then
echo "Upgrading master"
sudo kubeadm upgrade plan && sudo kubeadm upgrade apply v$release -y
else
echo "Upgrading worker"
sudo kubeadm upgrade node
if [[ -z "$ROLE" || -z "$RELEASE" ]]; then
echo "ERROR: --role and --release are required" >&2
usage
exit 2
fi
sudo apt-get install -y kubelet="$release-*" kubectl="$release-*"
sudo apt-mark hold kubeadm kubelet kubectl
if [[ "$ROLE" != "master" && "$ROLE" != "worker" ]]; then
echo "ERROR: --role must be 'master' or 'worker' (got: $ROLE)" >&2
exit 2
fi
# Derive minor track (e.g. 1.34.5 → 1.34)
STABLE_VERSION="$(echo "$RELEASE" | awk -F. '{print $1"."$2}')"
echo "==> Upgrading $(hostname) ($ROLE) to v$RELEASE (track v$STABLE_VERSION)"
# Apt repo URL is pinned per minor track. Rewrite + re-import the signing key
# every run — cheap, idempotent, and handles the minor-bump case where the
# old track's repo no longer carries the target version.
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/ /" \
| sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo mkdir -p /etc/apt/keyrings
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/Release.key" \
| sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y "kubeadm=$RELEASE-*"
if [[ "$ROLE" == "master" ]]; then
echo "==> Master path: kubeadm upgrade plan + apply"
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply "v$RELEASE" -y
else
echo "==> Worker path: kubeadm upgrade node"
sudo kubeadm upgrade node
fi
sudo apt-get install -y "kubelet=$RELEASE-*" "kubectl=$RELEASE-*"
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "==> Done: $(hostname) is on v$RELEASE"

View file

@ -1,8 +1,14 @@
#!/usr/bin/env bash
#
# OS-major upgrade (Ubuntu do-release-upgrade). NOT in the auto-upgrade
# pipeline — minor apt patches are handled by unattended-upgrades + kured;
# K8s component bumps are handled by the k8s-version-upgrade agent. Run this
# script manually when bumping Ubuntu LTS major versions.
#
# See:
# - infra/docs/runbooks/k8s-node-auto-upgrades.md (apt + reboot)
# - infra/docs/runbooks/k8s-version-upgrade.md (kubeadm/kubelet/kubectl)
# sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
sudo do-release-upgrade
sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y

View file

@ -0,0 +1,456 @@
# k8s-version-upgrade Automated K8s component (kubeadm/kubelet/kubectl) upgrade
#
# Detects new patch/minor versions via a weekly CronJob, then dispatches the
# `k8s-version-upgrade` agent (infra/.claude/agents/k8s-version-upgrade.md)
# through claude-agent-service for the actual rolling upgrade.
#
# Reuse points:
# - claude-agent-service.claude-agent.svc:8080 agent job runner
# - Vault secret/k8s-upgrade/* operator populates ssh_key + slack_webhook
# - Prometheus + Pushgateway + Upgrade Gates alert group (in monitoring stack)
# - update_k8s.sh library script the agent shells into nodes with
#
# Notes:
# - Schedule is Sun 12:00 UTC well outside the kured Mon-Fri 02:00-06:00
# London window so OS reboots and K8s version rollouts can't overlap.
# - Patch detection uses `apt-cache madison kubeadm` on master via SSH.
# Minor detection probes the next-minor apt repo URL with HEAD.
variable "schedule" {
type = string
default = "0 12 * * 0" # Sunday 12:00 UTC
}
# Toggle to suspend the detection CronJob without dropping the stack.
variable "enabled" {
type = bool
default = true
}
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf keep in
# sync when the claude-agent-service image is rebuilt. Reused here because the
# detection CronJob only needs kubectl, ssh-client, curl, jq all of which
# the claude-agent-service image already ships.
variable "claude_agent_service_image_tag" {
type = string
default = "2fd7670d"
}
# If true, the CronJob runs the detection sequence but does NOT POST to
# claude-agent-service. Used for Test 1 to confirm detection works without
# firing a real upgrade.
variable "detection_dry_run" {
type = bool
default = false
}
locals {
namespace = "k8s-upgrade"
ca_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
labels = {
app = "k8s-version-check"
}
}
# --- Namespace ---
resource "kubernetes_namespace" "k8s_upgrade" {
metadata {
name = local.namespace
labels = {
tier = local.tiers.cluster
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# --- ExternalSecret: ssh_key + slack_webhook + agent-service bearer ---
#
# Operator populates Vault `secret/k8s-upgrade/` with:
# - ssh_key (PEM-encoded ed25519 private key)
# - ssh_key_pub (the matching public key distributed to nodes' authorized_keys)
# - slack_webhook (Slack incoming-webhook URL, separate channel from kured for clean alerting)
#
# The claude-agent-service bearer token comes from secret/claude-agent-service
# (reused no parallel token needed).
resource "kubernetes_manifest" "external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "k8s-upgrade-creds"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "k8s-upgrade-creds"
}
data = [
{
secretKey = "ssh_key"
remoteRef = {
key = "k8s-upgrade"
property = "ssh_key"
}
},
{
secretKey = "slack_webhook"
remoteRef = {
key = "k8s-upgrade"
property = "slack_webhook"
}
},
{
secretKey = "api_bearer_token"
remoteRef = {
key = "claude-agent-service"
property = "api_bearer_token"
}
},
]
}
}
}
# --- ServiceAccount + RBAC for the detection CronJob ---
resource "kubernetes_service_account" "k8s_version_check" {
metadata {
name = "k8s-version-check"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
}
# Cluster-wide read on nodes (for kubeletVersion comparison)
resource "kubernetes_cluster_role" "k8s_version_check" {
metadata {
name = "k8s-version-check"
}
rule {
api_groups = [""]
resources = ["nodes"]
verbs = ["get", "list"]
}
}
resource "kubernetes_cluster_role_binding" "k8s_version_check" {
metadata {
name = "k8s-version-check"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = kubernetes_cluster_role.k8s_version_check.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.k8s_version_check.metadata[0].name
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
}
# Namespace-scoped: detection CronJob reads its own creds Secret.
resource "kubernetes_role" "k8s_version_check_secrets" {
metadata {
name = "k8s-version-check-secrets"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
rule {
api_groups = [""]
resources = ["secrets"]
resource_names = ["k8s-upgrade-creds"]
verbs = ["get"]
}
}
resource "kubernetes_role_binding" "k8s_version_check_secrets" {
metadata {
name = "k8s-version-check-secrets"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.k8s_version_check_secrets.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.k8s_version_check.metadata[0].name
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
}
# --- Cross-namespace RBAC: claude-agent SA reads k8s-upgrade-creds + annotates ns ---
#
# The k8s-version-upgrade agent runs inside the claude-agent-service pod (SA
# `claude-agent` in `claude-agent` ns). It needs:
# - GET on this namespace's k8s-upgrade-creds Secret (to fetch ssh_key + slack)
# - PATCH on the k8s-upgrade Namespace annotations (in-flight marker)
resource "kubernetes_role" "claude_agent_reads_creds" {
metadata {
name = "claude-agent-reads-creds"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
rule {
api_groups = [""]
resources = ["secrets"]
resource_names = ["k8s-upgrade-creds"]
verbs = ["get"]
}
}
resource "kubernetes_role_binding" "claude_agent_reads_creds" {
metadata {
name = "claude-agent-reads-creds"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.claude_agent_reads_creds.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = "claude-agent"
namespace = "claude-agent"
}
}
# The claude-agent ClusterRole already grants `get,list,watch` on namespaces
# but NOT patch so we need to extend it here for the annotation write.
# Bound via a separate ClusterRoleBinding so we don't fork the upstream stack.
resource "kubernetes_cluster_role" "claude_agent_annotates_ns" {
metadata {
name = "claude-agent-annotates-k8s-upgrade-ns"
}
rule {
api_groups = [""]
resources = ["namespaces"]
resource_names = ["k8s-upgrade"]
verbs = ["patch", "update"]
}
}
resource "kubernetes_cluster_role_binding" "claude_agent_annotates_ns" {
metadata {
name = "claude-agent-annotates-k8s-upgrade-ns"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = kubernetes_cluster_role.claude_agent_annotates_ns.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = "claude-agent"
namespace = "claude-agent"
}
}
# --- Detection CronJob ---
#
# Weekly: compares running cluster version against latest available patch
# (apt-cache madison kubeadm on master) and latest available minor (HEAD on
# next-minor pkgs.k8s.io repo). When a target is detected, POSTs to
# claude-agent-service to kick the upgrade agent.
resource "kubernetes_cron_job_v1" "k8s_version_check" {
metadata {
name = "k8s-version-check"
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
labels = local.labels
}
spec {
schedule = var.schedule
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 600
suspend = !var.enabled
job_template {
metadata {
labels = local.labels
}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 86400
template {
metadata {
labels = local.labels
}
spec {
service_account_name = kubernetes_service_account.k8s_version_check.metadata[0].name
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "version-check"
image = local.ca_image
command = ["/bin/bash", "-c", <<-EOT
set -euo pipefail
echo "==> k8s-version-check ($(date -u +%FT%TZ))"
# 1. Load SSH key from K8s Secret
mkdir -p /tmp
/usr/local/bin/kubectl get secret k8s-upgrade-creds \
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
chmod 400 /tmp/k8s-upgrade-ssh-key
SLACK=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
-o jsonpath='{.data.slack_webhook}' | base64 -d)
AGENT_TOKEN=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
-o jsonpath='{.data.api_bearer_token}' | base64 -d)
SSH="ssh -i /tmp/k8s-upgrade-ssh-key \
-o StrictHostKeyChecking=accept-new \
-o UserKnownHostsFile=/tmp/known_hosts"
slack() {
curl -sS -X POST -H 'Content-Type: application/json' \
--data "$(jq -nc --arg t "[k8s-version-check] $1" '{text: $t}')" \
"$SLACK" || true
}
# 2. Detect running version
RUNNING=$(/usr/local/bin/kubectl get nodes \
-o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' \
| tr -d v)
RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}')
echo "Running version: v$RUNNING (minor $RUNNING_MINOR)"
# 3. Detect highest available patch within the running minor track.
LATEST_PATCH=$($SSH wizard@k8s-master \
"apt-cache madison kubeadm 2>/dev/null \
| awk '{print \$3}' \
| sed 's/-.*//' \
| grep '^$RUNNING_MINOR\\.' \
| sort -V | tail -1" || echo "")
echo "Latest patch (apt): v$LATEST_PATCH"
# 4. Detect next available minor by probing the apt repo URL.
NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
NEXT_MINOR="1.$NEXT_MINOR_NUM"
NEXT_MINOR_AVAILABLE="no"
if curl -sIo /dev/null -w '%%{http_code}' \
"https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \
| grep -q '^200$'; then
NEXT_MINOR_AVAILABLE="yes"
fi
echo "Next minor v$NEXT_MINOR available: $NEXT_MINOR_AVAILABLE"
# 5. Decide what to do
TARGET=""
KIND=""
if [ -n "$LATEST_PATCH" ] && [ "$LATEST_PATCH" != "$RUNNING" ]; then
TARGET="$LATEST_PATCH"
KIND="patch"
elif [ "$NEXT_MINOR_AVAILABLE" = "yes" ]; then
# Probe the minor track to get its latest patch.
NEXT_MINOR_PATCH=$($SSH wizard@k8s-master \
"curl -sf 'https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Packages' \
| grep -oE 'Version: [0-9.-]+' \
| awk '{print \$2}' | sed 's/-.*//' \
| sort -V | tail -1" || echo "")
if [ -n "$NEXT_MINOR_PATCH" ]; then
TARGET="$NEXT_MINOR_PATCH"
KIND="minor"
fi
fi
# 6. Push the discovery metric to Pushgateway
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-check'
{
echo "# TYPE k8s_upgrade_available gauge"
if [ -n "$TARGET" ]; then
echo "k8s_upgrade_available{kind=\"$KIND\",running=\"$RUNNING\",target=\"$TARGET\"} 1"
else
echo "k8s_upgrade_available{kind=\"none\",running=\"$RUNNING\",target=\"$RUNNING\"} 0"
fi
echo "# TYPE k8s_version_check_last_run_timestamp gauge"
echo "k8s_version_check_last_run_timestamp $(date +%s)"
} | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
# 7. Decide whether to dispatch
if [ -z "$TARGET" ]; then
echo "No upgrade needed (running=$RUNNING, latest_patch=$LATEST_PATCH, next_minor_available=$NEXT_MINOR_AVAILABLE)"
exit 0
fi
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
if [ "$DRY_RUN" = "true" ]; then
echo "DRY_RUN=true — not POSTing to claude-agent-service"
slack "DRY_RUN — skipping agent dispatch"
exit 0
fi
# 8. POST to claude-agent-service
PAYLOAD=$(jq -nc \
--arg target "$TARGET" \
--arg kind "$KIND" \
'{
prompt: ("Run the k8s-version-upgrade agent. Inputs: " + ({target_version: $target, kind: $kind, dry_run: false, stages: "all"} | tostring)),
agent: ".claude/agents/k8s-version-upgrade",
max_budget_usd: 30
}')
echo "Dispatching agent: $PAYLOAD"
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
-H "Authorization: Bearer $AGENT_TOKEN" \
-H 'Content-Type: application/json' \
-d "$PAYLOAD" \
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
CODE=$(printf '%s' "$RESP" | tail -n1)
BODY=$(printf '%s' "$RESP" | sed '$d')
if [ "$CODE" = "200" ] || [ "$CODE" = "202" ]; then
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // .id // "unknown"')
slack "Agent dispatched: job=$JOB_ID (target=v$TARGET kind=$KIND)"
echo "OK — job=$JOB_ID"
else
slack "ERROR dispatching agent: HTTP $CODE — $BODY"
echo "dispatch failed: HTTP $CODE — $BODY" >&2
exit 1
fi
EOT
]
env {
name = "DRY_RUN"
value = tostring(var.detection_dry_run)
}
env {
name = "HOME"
value = "/tmp"
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -0,0 +1,23 @@
include "root" {
path = find_in_parent_folders()
}
# ExternalSecret hits ESO which needs to be alive when the manifest applies.
dependency "external_secrets" {
config_path = "../external-secrets"
skip_outputs = true
}
# Upgrade Gates rules (incl. K8sVersionSkew + EtcdPreUpgradeSnapshotMissing)
# live in the monitoring stack make the relationship visible so reapplies
# don't race the alerts being available.
dependency "monitoring" {
config_path = "../monitoring"
skip_outputs = true
}
# Note: stacks/claude-agent-service has no terragrunt.hcl yet (manual apply
# pattern) its ServiceAccount + Namespace are referenced by name from this
# stack's RoleBindings, which is fine because RoleBindings allow forward
# references. Apply order: claude-agent-service first (or already deployed),
# then this stack.

View file

@ -1890,14 +1890,13 @@ serverFiles:
annotations:
summary: "Kubelet/apiserver gitVersion skew detected — possible half-done k8s upgrade. Inspect: kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.kubeletVersion}'"
# EtcdPreUpgradeSnapshotMissing: the k8s-version-upgrade agent pushes
# k8s_upgrade_in_flight=1 when it starts, and k8s_upgrade_snapshot_taken=1
# after the etcdctl snapshot is verified. If we see in_flight=1 with no
# corresponding snapshot_taken=1 after 10 min, the agent has skipped or
# failed the snapshot — that's a critical safety hole.
# `k8s_upgrade_in_flight=1` + `k8s_upgrade_snapshot_taken=0` at Stage 0,
# then sets snapshot_taken=1 in Stage 2 after etcdctl confirms the
# snapshot file size. Anywhere in_flight=1 with snapshot_taken=0
# lasting >10m means the agent skipped or failed Stage 2 — a critical
# safety hole (no recovery point if master upgrade hangs).
- alert: EtcdPreUpgradeSnapshotMissing
expr: |
k8s_upgrade_in_flight == 1
unless on() k8s_upgrade_snapshot_taken == 1
expr: k8s_upgrade_in_flight == 1 and k8s_upgrade_snapshot_taken == 0
for: 10m
labels:
severity: critical