From 448bc0c0f62f942eb151cf812952e7651fccbe6a Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 11 May 2026 23:54:05 +0000 Subject: [PATCH] k8s-version-upgrade: decompose into Job chain to fix self-preemption MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade--[-]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 --- ...e.md => k8s-version-upgrade.deprecated.md} | 51 +- docs/architecture/automated-upgrades.md | 90 ++-- docs/runbooks/k8s-version-upgrade.md | 328 ++++++++----- stacks/k8s-version-upgrade/job-template.yaml | 88 ++++ stacks/k8s-version-upgrade/main.tf | 447 ++++++++---------- .../scripts/upgrade-step.sh | 438 +++++++++++++++++ .../monitoring/prometheus_chart_values.tpl | 15 + 7 files changed, 1063 insertions(+), 394 deletions(-) rename .claude/agents/{k8s-version-upgrade.md => k8s-version-upgrade.deprecated.md} (89%) create mode 100644 stacks/k8s-version-upgrade/job-template.yaml create mode 100644 stacks/k8s-version-upgrade/scripts/upgrade-step.sh diff --git a/.claude/agents/k8s-version-upgrade.md b/.claude/agents/k8s-version-upgrade.deprecated.md similarity index 89% rename from .claude/agents/k8s-version-upgrade.md rename to .claude/agents/k8s-version-upgrade.deprecated.md index 39d5f306..fd0f774b 100644 --- a/.claude/agents/k8s-version-upgrade.md +++ b/.claude/agents/k8s-version-upgrade.deprecated.md @@ -1,10 +1,57 @@ --- -name: k8s-version-upgrade -description: "Automated K8s version upgrader. Verifies cluster health, takes an etcd snapshot, optionally fixes containerd skew on master, upgrades the control plane, then rolls workers sequentially with halt-on-alert gating and Slack notification at every transition." +name: k8s-version-upgrade-DEPRECATED +description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below." tools: Read, Write, Edit, Bash, Grep, Glob model: opus --- +# DEPRECATED — Do NOT invoke this agent + +Retired **2026-05-11** after a self-preemption incident: this agent ran inside +the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was +scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4` +(Stage 6, first worker), it evicted itself. The bash process died mid-SSH, +leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7, +workers at v1.34.2). + +## Replaced by + +A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` + +`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can +preempt itself because each Job's pod and its target node are always +different. + +| Old | New | +|-----|-----| +| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) | +| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` | +| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` | +| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds | +| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) | + +## Where the logic lives now + +- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal + phase body. Dispatches on `$PHASE`. Each phase spawns the next Job. +- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template + rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in + every Job pod. +- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps, + unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob. +- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a + stuck Job, skip a phase, manually re-trigger from a specific phase). + +## Why kept (not deleted) + +Documents the prompted-agent design and is useful as historical reference when +reading post-mortem discussions or comparing approaches. The `name` field has +been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from +`claude-agent-service`. + +--- + +# Original prompt — DO NOT EXECUTE (reference only) + You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA). ## Your Job diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index 8420d5bc..5451a0f3 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -4,7 +4,7 @@ This doc covers three independent automation paths: 1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc. 2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`. -3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → claude-agent-service → `k8s-version-upgrade` agent. See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`. +3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`. ## Overview @@ -257,31 +257,62 @@ k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns) │ probe HEAD https://pkgs.k8s.io/.../v/deb/Release → next minor? │ push k8s_upgrade_available metric to Pushgateway │ - ▼ if running != latest -POST claude-agent-service /execute with target_version + kind - │ + ▼ if a target is detected +envsubst on /template/job-template.yaml | kubectl apply -f - + │ spawns Job 0 = k8s-upgrade-preflight- ▼ -k8s-version-upgrade agent (in claude-agent-service pod) - ├── pre-flight (5 nodes Ready, halt-on-alert, 24h-quiet, kubeadm plan match) - ├── etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db - ├── master containerd bump (only if master version < workers') - ├── apt repo URL rewrite to v/deb on all 5 nodes (kind=minor only) - ├── drain master → ssh < update_k8s.sh --role master → uncordon → verify - ├── for each worker (k8s-node4 → 3 → 2 → 1): - │ halt-on-alert wait → drain → ssh < update_k8s.sh --role worker → uncordon → 10-min soak - └── post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9) + +Job 0 — preflight (pinned: k8s-node1) +Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master +Job 2 — worker (pinned: k8s-node1) drains k8s-node4 +Job 3 — worker (pinned: k8s-node1) drains k8s-node3 +Job 4 — worker (pinned: k8s-node1) drains k8s-node2 +Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration +Job 6 — postflight (no pinning) ``` +Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends +by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl +apply -f -`). Job names are deterministic (`k8s-upgrade--[-]`) +so `apply` reconciles to a single Job per run — re-running a failed Job +won't duplicate downstream Jobs. + +### Self-preemption history (the reason for the Job-chain rewrite) + +The v1 design ran the whole upgrade inside the `claude-agent-service` +Deployment (1 replica, no nodeSelector). On 2026-05-11 the agent's pod was +scheduled to k8s-node4. When the agent ran `kubectl drain k8s-node4` during +Stage 6, it evicted itself — the bash process died after the drain but +before the SSH-pipe to install kubeadm on node4. The cluster ended up +half-upgraded (master at v1.34.7, workers at v1.34.2). The rewrite to a +chain of `nodeSelector`-pinned Jobs eliminates this failure mode because +each Job's pod and its drain target are always different nodes. + ### Components -- **Detection CronJob**: `infra/stacks/k8s-version-upgrade/main.tf`. Image is the claude-agent-service image (alpine + kubectl + ssh-client + curl + jq). SA has cluster-read on nodes + ns-scoped get on `k8s-upgrade-creds` Secret. -- **Agent prompt**: `infra/.claude/agents/k8s-version-upgrade.md`. Inputs: `target_version`, `kind=patch|minor`, `dry_run`, `stages`. Tools: Bash, Read, Write, Edit, Grep, Glob. -- **Library node script**: `infra/scripts/update_k8s.sh`. Caller passes `--role master|worker --release X.Y.Z`. The agent pipes this via SSH onto each node. -- **Two new Upgrade Gates alerts** (added in this work): - - `K8sVersionSkew` — kubelet/apiserver gitVersion count >1 for 30m. Catches a half-done rollout. - - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently. +- **Detection CronJob + ConfigMaps + RBAC**: `infra/stacks/k8s-version-upgrade/main.tf`. + - Image is the claude-agent-service image (kubectl + ssh-client + curl + jq + envsubst). + - One unified ServiceAccount `k8s-upgrade-job` serves both the detection CronJob and every chain Job. +- **Phase body**: `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`. + Dispatches on `$PHASE` (preflight | master | worker | postflight). Computes + `NEXT_PHASE` / `NEXT_TARGET_NODE` / `NEXT_RUN_ON` and spawns the next Job. + Includes a `predrain_unstick` helper that pre-deletes pods on the target + node whose PDB has `disruptionsAllowed=0` (otherwise drain loops forever on + single-replica deployments like Anubis instances). +- **Job template**: `infra/stacks/k8s-version-upgrade/job-template.yaml`. + envsubst-rendered at runtime. Mounts a `creds` Secret, a `scripts` + ConfigMap, and a `template` ConfigMap into each Job pod. +- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes + `--role master|worker --release X.Y.Z`. Piped via SSH into each node by + upgrade-step.sh. +- **Three Upgrade Gates alerts**: + - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout. + - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently. + - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor. - **Pushgateway metrics**: - - `k8s_upgrade_in_flight` / `k8s_upgrade_snapshot_taken` (pushed by agent) + - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight) + - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB) + - `k8s_upgrade_started_timestamp` (set in preflight; used by `K8sUpgradeStalled`) - `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob) - `k8s_version_check_last_run_timestamp` (staleness watchdog) @@ -289,31 +320,36 @@ k8s-version-upgrade agent (in claude-agent-service pod) | Concern | Location | |---|---| -| Detection CronJob, RBAC, ExternalSecret, Vault role | `stacks/k8s-version-upgrade/main.tf` | -| Agent orchestration | `.claude/agents/k8s-version-upgrade.md` | -| Library node script | `scripts/update_k8s.sh` | +| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `stacks/k8s-version-upgrade/main.tf` | +| Phase orchestration | `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | +| Job template | `stacks/k8s-version-upgrade/job-template.yaml` | +| Per-node upgrade script | `scripts/update_k8s.sh` | | Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | | Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` | +| Deprecated agent prompt (reference) | `.claude/agents/k8s-version-upgrade.deprecated.md` | ### Why this design The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations: - **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks. -- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Two new gate alerts catch upgrade-specific half-states (version skew, missing snapshot). +- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain). +- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration). - **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours. - **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched). - **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path. +- **PDB-blocked pods don't stall the chain**. `predrain_unstick` deletes PDB=0 pods on the target node directly (bypassing the eviction API), so the parent Deployment recreates them elsewhere. This was the workaround applied manually during the 2026-05-11 recovery for Anubis single-replica instances. ### Secrets | Secret | Vault Path | Purpose | |--------|-----------|---------| -| SSH private key | `secret/k8s-upgrade.ssh_key` | Agent + detection CronJob SSH to all 5 nodes (user `wizard`) | +| SSH private key | `secret/k8s-upgrade.ssh_key` | Jobs SSH `wizard@` | | SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` | | Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) | -| Agent service bearer | `secret/claude-agent-service.api_bearer_token` (reused) | Detection CronJob POSTs to `/execute` | + +The previous `api_bearer_token` entry is gone — the chain does not POST to `claude-agent-service`. ### Operational reference -See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection or the agent, rollback paths (master / worker / mid-flight abort), and SSH key rotation. +See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection, killing a stuck Job, skipping a phase, rollback paths (master / worker / mid-flight abort), and SSH key rotation. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 9519dcac..02deb88f 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -3,12 +3,15 @@ ## Overview Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s -VMs are upgraded automatically by a weekly detection CronJob that fires the -`k8s-version-upgrade` agent through `claude-agent-service`. The agent walks -the cluster through pre-flight → etcd snapshot → optional master containerd -skew fix → optional apt repo URL rewrite (minor only) → master kubeadm -upgrade → workers rolled sequentially → post-flight, with Slack notification -at every transition and Prometheus halt-on-alert gating before every drain. +VMs are upgraded automatically by a weekly detection CronJob that seeds a +chain of small phase Jobs. Each Job is **pinned to a node that is NOT its +drain target** — so no pod in the chain can preempt itself. + +The chain (Sun 12:00 UTC weekly): + +``` +detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job +``` This is **independent** of the OS-side `unattended-upgrades + kured` pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts and @@ -18,58 +21,106 @@ detection here runs Sun 12:00 UTC). ## Architecture ``` -k8s-version-check CronJob (Sun 12:00 UTC) +k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job) │ kubectl get nodes → running version │ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor) │ HEAD pkgs.k8s.io/.../v/deb/Release → next minor available? + │ push k8s_upgrade_available{kind,running,target} → Pushgateway │ - ▼ if running != latest_patch OR next minor available -POST claude-agent-service /execute - { prompt: "Run k8s-version-upgrade agent. Inputs: {target_version, kind, dry_run, stages}" } - │ + ▼ if a target is detected +envsubst on /template/job-template.yaml | kubectl apply -f - + │ creates k8s-upgrade-preflight- ▼ -k8s-version-upgrade agent (inside claude-agent-service pod) - ├── Stage 0: parse inputs, mark in-flight annotation + Pushgateway gauge - ├── Stage 1: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) - ├── Stage 2: etcd snapshot save → /mnt/main/etcd-backup/k8s-upgrade-pre-X.Y.Z-EPOCH.db - │ push k8s_upgrade_snapshot_taken=1 - ├── Stage 3: master containerd bump (only if master < workers) - ├── Stage 4: apt repo URL rewrite to v/deb (only if kind=minor) - ├── Stage 5: drain master → ssh < update_k8s.sh --role master --release X.Y.Z → uncordon → verify - ├── Stage 6: each worker k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1: - │ halt-on-alert wait → drain → ssh script --role worker → uncordon → 10-min soak - └── Stage 7: post-flight (all nodes match target, alerts clean, pod-ready ratio ≥ 0.9) - clear in-flight annotation, push k8s_upgrade_in_flight=0 + +Job 0 — preflight (pinned: k8s-node1) + ├── All nodes Ready + no Mem/Disk pressure + ├── halt-on-alert (kured-style ignore-list) + ├── 24h-quiet baseline (no Ready transitions <24h ago) + ├── kubeadm upgrade plan matches target + ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) + ├── Trigger backup-etcd Job, wait, verify snapshot byte count + ├── SSH master: containerd skew fix (if master < workers) + ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor) + └── spawn_next → k8s-upgrade-master- + ▼ + +Job 1 — master upgrade (pinned: k8s-node1) + ├── halt-on-alert recheck (no firing alerts) + ├── drain k8s-master (predrain_unstick deletes PDB-blocked pods) + ├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z + ├── kubectl uncordon k8s-master; wait Ready + version match + ├── verify control-plane pods Running + ├── halt-on-alert recheck (allows RecentNodeReboot) + └── spawn_next → k8s-upgrade-worker--k8s-node4 + ▼ + +Job 2 — worker k8s-node4 (pinned: k8s-node1) +Job 3 — worker k8s-node3 (pinned: k8s-node1) +Job 4 — worker k8s-node2 (pinned: k8s-node1) + (identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next) + ▼ + +Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration) + └── spawn_next → k8s-upgrade-postflight- + ▼ + +Job 6 — postflight (no pinning) + ├── Verify all 5 nodes at target version + ├── Verify no firing Upgrade Gates alerts + ├── Compute pod-ready ratio (should be ≥ 0.9) + ├── Clear k8s-upgrade-* annotations on namespace + ├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0 + └── Slack: ✅ K8s upgrade complete ``` +**Pin choices summarised:** +- k8s-node1 hosts every Job that drains master or another worker. k8s-node1 + itself is upgraded **last**. +- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a + toleration for `node-role.kubernetes.io/control-plane:NoSchedule`. +- If anyone reorders the worker sequence, the pin for Job 5 needs to track + whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh` + → the `case "${PHASE}:${TARGET_NODE:-}"` block. + ## Components -### Detection CronJob (`k8s-version-check`) -- **Stack**: `infra/stacks/k8s-version-upgrade/main.tf` -- **Image**: `forgejo.viktorbarzin.me/viktor/claude-agent-service` (ships kubectl, ssh-client, curl, jq) -- **Schedule**: `0 12 * * 0` (Sunday 12:00 UTC). Outside kured window. -- **SA**: `k8s-version-check` (cluster-read nodes, ns-scoped get on `k8s-upgrade-creds` Secret) -- **Pushgateway metrics**: - - `k8s_upgrade_available{kind, running, target}` — 1 when a target is detected - - `k8s_version_check_last_run_timestamp` — staleness watchdog +### Shared resources (one-time, Terraform-managed) -### Agent (`k8s-version-upgrade`) -- **Prompt**: `infra/.claude/agents/k8s-version-upgrade.md` -- **Runtime**: claude-agent-service pod (claude-agent ns) -- **Inputs** (JSON in prompt): `target_version`, `kind` (patch|minor), `dry_run`, `stages` -- **Library script**: `infra/scripts/update_k8s.sh` (run on each node via SSH pipe — `ssh ... 'bash -s' < update_k8s.sh -- --role master|worker --release X.Y.Z`) +| Resource | Purpose | +|---|---| +| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. | +| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. | +| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. | +| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. | +| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | -### Upgrade Gates alerts (additions for this pipeline) -- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout where some nodes are upgraded and some aren't. -- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches Stage 2 failing silently. -- Both join the existing 10 Upgrade Gates alerts (KubeAPIServerDown, RecentNodeReboot, etc.) — kured ALSO blocks rolling reboots whenever any of these are firing. +### Pushgateway metrics + +Pushed by upgrade-step.sh during phase execution; observed by the +`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`: + +| Metric | Pushed by | Cleared by | +|---|---|---| +| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) | +| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) | +| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) | +| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) | +| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) | + +### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl) + +- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. +- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. +- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. +- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. ### Vault secrets -- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by detection CronJob + agent to SSH into all 5 nodes (user `wizard`) -- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to `/home/wizard/.ssh/authorized_keys` on every node -- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL (separate channel from kured for clean alerting) -Both keys exposed in K8s via ExternalSecret `k8s-upgrade-creds` in `k8s-upgrade` namespace. +- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@` +- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys` +- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL + +Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`. ## Common Operations @@ -78,13 +129,17 @@ Both keys exposed in K8s via ExternalSecret `k8s-upgrade-creds` in `k8s-upgrade` # CronJob present + not suspended kubectl -n k8s-upgrade get cronjob k8s-version-check -# Latest run output -kubectl -n k8s-upgrade get jobs -l app=k8s-version-check -kubectl -n k8s-upgrade logs -l app=k8s-version-check --tail=200 +# Latest detection run output +kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade +kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 -# Pushgateway metric — fresh discovery? -curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics | \ - grep -E '^(k8s_upgrade_available|k8s_version_check_last_run_timestamp)' +# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished) +kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain + +# Pushgateway — running detection metric +kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \ + wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \ + grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)' # Upgrade Gates rules loaded kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \ @@ -92,79 +147,116 @@ kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \ jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"' ``` -### Manually trigger a detection run (no upgrade) -Use `detection_dry_run=true` to short-circuit before the POST to -claude-agent-service: +### Manually trigger detection (no upgrade) +Use `detection_dry_run=true` to short-circuit before spawning Job 0: ```bash -# One-shot job from the cron, with DRY_RUN env override: +# Toggle var in TF, apply, and trigger +# (in stacks/k8s-version-upgrade/main.tf) +# variable "detection_dry_run" { default = true } +# scripts/tg apply kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test kubectl -n k8s-upgrade logs -l job-name=version-check-test -f +# When done, flip back to false. ``` -To make `detection_dry_run` permanent (e.g. while debugging), -toggle the var in `stacks/k8s-version-upgrade/main.tf` and `scripts/tg apply`. - -### Manually dispatch the agent (skip detection) -Useful when you want to force a run on a specific version without waiting for -Sunday, or when testing. +### Manually trigger the chain (skip detection) +Useful for testing or to force a specific target. Render Job 0 directly: ```bash -TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service) +TARGET=1.34.7 +KIND=patch +IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \ + -o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}') -# Dry-run (no mutations) -curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":true,\"stages\":\"all\"}", - "agent": ".claude/agents/k8s-version-upgrade", - "max_budget_usd": 5 - }' - -# Snapshot-only (Test 3 in the plan) -curl -X POST ... -d '{ - "prompt": "Run the k8s-version-upgrade agent. Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"preflight,snapshot\"}", - ... -}' - -# Real run -curl -X POST ... -d '{ - "prompt": "... Inputs: {\"target_version\":\"1.34.5\",\"kind\":\"patch\",\"dry_run\":false,\"stages\":\"all\"}", - ... -}' +cat < | tail -50 +kubectl -n k8s-upgrade logs job/ + +# 2. Diagnose. Common causes: +# - drain stuck on PDB-violating pod (predrain_unstick should handle this; +# but a brand-new PDB pattern could escape it — manually delete the pod) +# - SSH from Job pod failing (node restarted? known_hosts mismatch?) +# - kubeadm upgrade failed on a node (check journalctl + apt history on that node) + +# 3. Fix the root cause first. + +# 4. Delete the failed Job + re-spawn it. Naming is deterministic so +# `kubectl apply` of the same name reconciles to a single Job. +kubectl -n k8s-upgrade delete job/ + +# 5. Manually render + apply the same Job. Pull the template + spec from the +# next-Job-creation block in upgrade-step.sh — easiest is to copy from a +# sibling Job's YAML: +kubectl -n k8s-upgrade get job/ -o yaml \ + | yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \ + | yq '.metadata.name = ""' \ + | yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = ""' \ + | kubectl apply -f - + +# The chain will continue from there. The next-Job-creation step in upgrade-step.sh +# is idempotent (deterministic name) so re-running won't duplicate downstream. +``` + +### Skip a phase (advanced; use sparingly) +If you've already done the work for a phase manually and want the chain to +jump past it, manually create the NEXT phase's Job with the deterministic +name. The previous phase's spawn-next will see the Job already exists and +short-circuit. Example: master already on target; jump straight to worker: + +```bash +TARGET=1.34.7 +TGT_LBL=${TARGET//./-} +# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1) ``` ### Halt the pipeline in an emergency -The pipeline is gated by Prometheus alerts — any firing Upgrade Gates alert -blocks the next drain. To explicitly halt: ```bash -# Option 1: suspend the detection CronJob (won't stop an in-flight agent run) +# Option 1: suspend the detection CronJob (won't stop an in-flight chain) kubectl -n k8s-upgrade patch cronjob k8s-version-check \ -p '{"spec":{"suspend":true}}' --type=merge -# Re-enable: --type=merge -p '{"spec":{"suspend":false}}' +# Re-enable: -p '{"spec":{"suspend":false}}' -# Option 2: kill an in-flight agent job -TOKEN=$(vault kv get -field=api_bearer_token secret/claude-agent-service) -JOB_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \ - http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs | \ - jq -r '.[] | select(.agent | test("k8s-version-upgrade")) | .id' | head -1) -curl -X DELETE -H "Authorization: Bearer $TOKEN" \ - http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID +# Option 2: delete all in-flight chain Jobs +kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain +# This leaves the in-flight annotation + Pushgateway gauge intact — +# K8sUpgradeStalled will fire to surface the halt. -# Option 3: force a blocker alert (Upgrade Gates expression that always fires) -# — see infra/docs/runbooks/k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert" +# Option 3: force a blocker alert (same regex kured uses) +# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert" +``` + +### Clear orphaned in-flight state +After deciding NOT to retry a halted chain: + +```bash +kubectl annotate ns k8s-upgrade \ + viktorbarzin.me/k8s-upgrade-in-flight- \ + viktorbarzin.me/k8s-upgrade-target- \ + viktorbarzin.me/k8s-upgrade-snapshot-path- + +# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear: +kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 & +printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \ + | curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade +kill %1 ``` ### Rollback paths - `kubeadm` does **not** support in-place downgrade. If a run fails: #### Master broke during/after kubeadm upgrade @@ -187,21 +279,6 @@ curl -X DELETE -H "Authorization: Bearer $TOKEN" \ 3. `kubectl uncordon ` 4. The cluster continues running on the master + remaining workers throughout -#### Pipeline aborts mid-flight (halt-on-alert blocks >30 min) -- The agent posts a Slack message with the blocking alert list and exits non-zero -- The in-flight annotation on `ns/k8s-upgrade` stays set → `EtcdPreUpgradeSnapshotMissing` may fire if Stage 2 didn't complete -- Operator: triage the blocker, clear the alert, re-dispatch the agent manually (see "Manually dispatch the agent") -- After successful retry: the agent's Stage 7 clears the annotation. If you decide NOT to retry, clear by hand: - ```bash - kubectl annotate ns k8s-upgrade \ - viktorbarzin.me/k8s-upgrade-in-flight- \ - viktorbarzin.me/k8s-upgrade-target- \ - viktorbarzin.me/k8s-upgrade-snapshot-path- - # Also reset the Pushgateway gauge so the alert clears: - printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n' | \ - curl --data-binary @- http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade - ``` - ### One-shot SSH key rotation 1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""` 2. Update Vault: @@ -213,26 +290,31 @@ curl -X DELETE -H "Authorization: Bearer $TOKEN" \ 3. Push the new pubkey to every node: ```bash for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do - # Remove old upgrade key (tag with "k8s-upgrade") then append new ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys' ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys' done ``` -4. ESO refreshes the K8s Secret within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite` +4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite` ## Past Incidents -- (none yet — pipeline went live 2026-05-10) -- Pre-pipeline manual upgrades documented in commit history; the `update_k8s.sh` shell of those manual runs is preserved in `infra/scripts/update_k8s.sh` and is what the agent shells into nodes with. +### 2026-05-11 — Self-preemption (agent → Job-chain rewrite) +- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4. +- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself. +- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4. +- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery. +- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min). ## File Pointers | What | Where | |------|-------| -| Detection CronJob + RBAC + ExternalSecret | `infra/stacks/k8s-version-upgrade/main.tf` | -| Agent prompt | `infra/.claude/agents/k8s-version-upgrade.md` | -| Library node script | `infra/scripts/update_k8s.sh` | -| Upgrade Gates alerts (incl. K8sVersionSkew + EtcdPreUpgradeSnapshotMissing) | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` | +| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` | +| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | +| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` | +| Per-node upgrade script | `infra/scripts/update_k8s.sh` | +| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | | Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` | -| Architecture doc | `infra/docs/architecture/automated-upgrades.md` — "K8s Version Upgrades" section | +| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) | | Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` | +| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | diff --git a/stacks/k8s-version-upgrade/job-template.yaml b/stacks/k8s-version-upgrade/job-template.yaml new file mode 100644 index 00000000..d287117c --- /dev/null +++ b/stacks/k8s-version-upgrade/job-template.yaml @@ -0,0 +1,88 @@ +# k8s-upgrade-chain Job template. +# +# Rendered by `envsubst` inside upgrade-step.sh (and the detection CronJob) +# before `kubectl apply`. All ${VAR} placeholders are envsubst-side; this file +# is NOT processed by Terraform. +# +# Required environment for envsubst: +# JOB_NAME unique-per-(phase, target_version[, target_node]) +# PHASE_NEXT phase the Job runs (preflight|master|worker|postflight) +# TARGET_NODE_NEXT node the Job operates on (empty for preflight/postflight) +# TARGET_VERSION X.Y.Z +# TARGET_VERSION_LABEL X-Y-Z (label-safe) +# KIND patch | minor +# IMAGE container image to run upgrade-step.sh +# SCHEDULING_BLOCK YAML fragment with nodeSelector/tolerations (may be empty) +# +# Idempotency: name is deterministic per (phase, target_version[, target_node]) +# so `kubectl apply` reconciles to a single Job per run. +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + namespace: k8s-upgrade + labels: + app: k8s-upgrade-chain + phase: ${PHASE_NEXT} + target-version: "${TARGET_VERSION_LABEL}" +spec: + ttlSecondsAfterFinished: 604800 # 7 days for postmortem review + backoffLimit: 1 + template: + metadata: + labels: + app: k8s-upgrade-chain + phase: ${PHASE_NEXT} + spec: + serviceAccountName: k8s-upgrade-job + restartPolicy: Never +${SCHEDULING_BLOCK} + imagePullSecrets: + - name: registry-credentials + containers: + - name: upgrade-step + image: ${IMAGE} + env: + - name: PHASE + value: "${PHASE_NEXT}" + - name: TARGET_NODE + value: "${TARGET_NODE_NEXT}" + - name: TARGET_VERSION + value: "${TARGET_VERSION}" + - name: KIND + value: "${KIND}" + - name: IMAGE + value: "${IMAGE}" + - name: HOME + value: "/tmp" + command: ["/bin/bash", "/scripts/upgrade-step.sh"] + volumeMounts: + - name: creds + mountPath: /secrets/k8s-upgrade + readOnly: true + - name: scripts + mountPath: /scripts + readOnly: true + - name: template + mountPath: /template + readOnly: true + resources: + requests: + cpu: "100m" + memory: "256Mi" + limits: + memory: "512Mi" + volumes: + - name: creds + secret: + secretName: k8s-upgrade-creds + # 0444 so the non-root container can read; upgrade-step.sh copies + # the SSH key to /tmp/ssh_key with mode 0400 for openssh. + defaultMode: 0444 + - name: scripts + configMap: + name: k8s-upgrade-scripts + defaultMode: 0755 + - name: template + configMap: + name: k8s-upgrade-job-template diff --git a/stacks/k8s-version-upgrade/main.tf b/stacks/k8s-version-upgrade/main.tf index 29652ca3..af4cc6de 100644 --- a/stacks/k8s-version-upgrade/main.tf +++ b/stacks/k8s-version-upgrade/main.tf @@ -1,44 +1,48 @@ # k8s-version-upgrade — Automated K8s component (kubeadm/kubelet/kubectl) upgrade # -# Detects new patch/minor versions via a weekly CronJob, then dispatches the -# `k8s-version-upgrade` agent (infra/.claude/agents/k8s-version-upgrade.md) -# through claude-agent-service for the actual rolling upgrade. +# Architecture: detection CronJob → chain of small Jobs, one per phase. Each +# Job's pod runs on a node that is NOT its drain target — eliminates the +# self-preemption bug that killed the agent-based v1 (2026-05-11 incident). +# +# Chain (Job 0 → Job 6): +# preflight (pinned: k8s-node1) +# master (pinned: k8s-node1; drains k8s-master) +# worker (pinned: k8s-node1; drains k8s-node4 → 3 → 2) +# worker (pinned: k8s-master + control-plane toleration; drains k8s-node1 last) +# postflight (no pinning) +# +# Each phase Job's container runs scripts/upgrade-step.sh which: +# - dispatches on $PHASE +# - spawns the next Job via envsubst on job-template.yaml +# - uses deterministic naming (k8s-upgrade-${phase}-${target_version}[-${node}]) +# so re-running on failure reconciles to a single Job per run. # # Reuse points: -# - claude-agent-service.claude-agent.svc:8080 — agent job runner -# - Vault secret/k8s-upgrade/* — operator populates ssh_key + slack_webhook -# - Prometheus + Pushgateway + Upgrade Gates alert group (in monitoring stack) -# - update_k8s.sh — library script the agent shells into nodes with -# -# Notes: -# - Schedule is Sun 12:00 UTC — well outside the kured Mon-Fri 02:00-06:00 -# London window so OS reboots and K8s version rollouts can't overlap. -# - Patch detection uses `apt-cache madison kubeadm` on master via SSH. -# Minor detection probes the next-minor apt repo URL with HEAD. +# - claude-agent-service image (kubectl + ssh + jq + curl + envsubst) +# - Vault secret/k8s-upgrade/* (ssh_key, slack_webhook) +# - Prometheus + Pushgateway + Upgrade Gates alerts +# - default/backup-etcd CronJob (snapshot trigger) +# - infra/scripts/update_k8s.sh (per-node upgrade body) variable "schedule" { type = string - default = "0 12 * * 0" # Sunday 12:00 UTC + default = "0 12 * * 0" # Sunday 12:00 UTC — outside kured window } -# Toggle to suspend the detection CronJob without dropping the stack. variable "enabled" { type = bool default = true } -# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in -# sync when the claude-agent-service image is rebuilt. Reused here because the -# detection CronJob only needs kubectl, ssh-client, curl, jq — all of which -# the claude-agent-service image already ships. -variable "claude_agent_service_image_tag" { +# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump +# in lockstep with claude-agent-service rebuilds. The image ships kubectl, +# ssh-client, curl, jq, envsubst — everything the upgrade Jobs need. +variable "image_tag" { type = string default = "2fd7670d" } -# If true, the CronJob runs the detection sequence but does NOT POST to -# claude-agent-service. Used for Test 1 to confirm detection works without -# firing a real upgrade. +# When true, detection runs but does NOT spawn the preflight Job. variable "detection_dry_run" { type = bool default = false @@ -46,9 +50,9 @@ variable "detection_dry_run" { locals { namespace = "k8s-upgrade" - ca_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}" + image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.image_tag}" labels = { - app = "k8s-version-check" + app = "k8s-version-upgrade" } } @@ -62,21 +66,19 @@ resource "kubernetes_namespace" "k8s_upgrade" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace + # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } } -# --- ExternalSecret: ssh_key + slack_webhook + agent-service bearer --- +# --- ExternalSecret: SSH key + Slack webhook --- # # Operator populates Vault `secret/k8s-upgrade/` with: -# - ssh_key (PEM-encoded ed25519 private key) -# - ssh_key_pub (the matching public key — distributed to nodes' authorized_keys) -# - slack_webhook (Slack incoming-webhook URL, separate channel from kured for clean alerting) +# - ssh_key (ed25519 PRIVATE key, used to SSH wizard@ from Jobs) +# - ssh_key_pub (matching public key, deployed to nodes' authorized_keys) +# - slack_webhook (incoming-webhook URL) # -# The claude-agent-service bearer token comes from secret/claude-agent-service -# (reused — no parallel token needed). - +# No claude-agent bearer needed — the chain no longer POSTs to that service. resource "kubernetes_manifest" "external_secret" { manifest = { apiVersion = "external-secrets.io/v1beta1" @@ -109,191 +111,157 @@ resource "kubernetes_manifest" "external_secret" { property = "slack_webhook" } }, - { - secretKey = "api_bearer_token" - remoteRef = { - key = "claude-agent-service" - property = "api_bearer_token" - } - }, ] } } } -# --- ServiceAccount + RBAC for the detection CronJob --- - -resource "kubernetes_service_account" "k8s_version_check" { - metadata { - name = "k8s-version-check" - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name - } -} - -# Cluster-wide read on nodes (for kubeletVersion comparison) -resource "kubernetes_cluster_role" "k8s_version_check" { - metadata { - name = "k8s-version-check" - } - rule { - api_groups = [""] - resources = ["nodes"] - verbs = ["get", "list"] - } -} - -resource "kubernetes_cluster_role_binding" "k8s_version_check" { - metadata { - name = "k8s-version-check" - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "ClusterRole" - name = kubernetes_cluster_role.k8s_version_check.metadata[0].name - } - subject { - kind = "ServiceAccount" - name = kubernetes_service_account.k8s_version_check.metadata[0].name - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name - } -} - -# Namespace-scoped: detection CronJob reads its own creds Secret. -resource "kubernetes_role" "k8s_version_check_secrets" { - metadata { - name = "k8s-version-check-secrets" - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name - } - rule { - api_groups = [""] - resources = ["secrets"] - resource_names = ["k8s-upgrade-creds"] - verbs = ["get"] - } -} - -resource "kubernetes_role_binding" "k8s_version_check_secrets" { - metadata { - name = "k8s-version-check-secrets" - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.k8s_version_check_secrets.metadata[0].name - } - subject { - kind = "ServiceAccount" - name = kubernetes_service_account.k8s_version_check.metadata[0].name - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name - } -} - -# --- Cross-namespace RBAC: claude-agent SA reads k8s-upgrade-creds + annotates ns --- +# --- Unified ServiceAccount + RBAC --- # -# The k8s-version-upgrade agent runs inside the claude-agent-service pod (SA -# `claude-agent` in `claude-agent` ns). It needs: -# - GET on this namespace's k8s-upgrade-creds Secret (to fetch ssh_key + slack) -# - PATCH on the k8s-upgrade Namespace annotations (in-flight marker) +# One SA serves BOTH the detection CronJob and every phase Job: +# - detection CronJob: needs nodes:get/list + secrets:get + jobs:create +# (to spawn Job 0 = preflight) +# - phase Jobs: same + pods/eviction:create + pods:delete + namespaces:patch +# +# Cluster-scoped because the chain spans the whole cluster (drain works on +# any node, and the preflight Job creates a Job in `default` ns from +# `cronjob/backup-etcd`). -resource "kubernetes_role" "claude_agent_reads_creds" { +resource "kubernetes_service_account" "k8s_upgrade_job" { metadata { - name = "claude-agent-reads-creds" + name = "k8s-upgrade-job" namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name } - rule { - api_groups = [""] - resources = ["secrets"] - resource_names = ["k8s-upgrade-creds"] - verbs = ["get"] - } } -resource "kubernetes_role_binding" "claude_agent_reads_creds" { +resource "kubernetes_cluster_role" "k8s_upgrade_job" { metadata { - name = "claude-agent-reads-creds" - namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + name = "k8s-upgrade-job" } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.claude_agent_reads_creds.metadata[0].name - } - subject { - kind = "ServiceAccount" - name = "claude-agent" - namespace = "claude-agent" - } -} - -# The base claude-agent ClusterRole grants get/list/watch on most resources -# but not the mutating verbs the upgrade agent needs. Rather than fork the -# upstream stack, we add a sibling ClusterRole here scoped to exactly the -# verbs+resources required: -# - patch on namespace k8s-upgrade (in-flight annotation) -# - create on batch/jobs (trigger etcd snapshot Job from cronjob/backup-etcd) -# - patch on nodes (cordon/uncordon — drain needs this) -# - create on pods/eviction (drain evicts pods) -resource "kubernetes_cluster_role" "claude_agent_upgrade_ops" { - metadata { - name = "claude-agent-upgrade-ops" - } - # Annotate the k8s-upgrade namespace - rule { - api_groups = [""] - resources = ["namespaces"] - resource_names = ["k8s-upgrade"] - verbs = ["patch", "update"] - } - # Trigger etcd snapshot Jobs (from cronjob/backup-etcd in default ns). - # Cluster-scoped because we may also create test Jobs in k8s-upgrade ns. - rule { - api_groups = ["batch"] - resources = ["jobs"] - verbs = ["create", "delete"] - } - # Cordon / uncordon nodes + # Read nodes (version comparison + readiness check) rule { api_groups = [""] resources = ["nodes"] - verbs = ["patch", "update"] + verbs = ["get", "list", "patch", "update"] } - # Drain (evict pods) + # Drain — evict pods rule { api_groups = [""] resources = ["pods/eviction"] verbs = ["create"] } - # Delete pods stuck during drain (sometimes evict isn't enough) + # Drain fallback — direct delete (predrain_unstick bypasses PDBs) rule { api_groups = [""] resources = ["pods"] - verbs = ["delete"] + verbs = ["get", "list", "delete"] + } + # Read PDBs to find drain-blocking pods + rule { + api_groups = ["policy"] + resources = ["poddisruptionbudgets"] + verbs = ["get", "list"] + } + # Chain dispatch — create the next Job; reconcile via apply on retry. + # In `default` ns to also create the etcd-snapshot Job from cronjob/backup-etcd. + rule { + api_groups = ["batch"] + resources = ["jobs"] + verbs = ["create", "get", "list", "delete", "patch", "watch"] + } + # Pull CronJob spec for `kubectl create job --from=cronjob/backup-etcd` + rule { + api_groups = ["batch"] + resources = ["cronjobs"] + verbs = ["get", "list"] + } + # Annotate the k8s-upgrade namespace (in-flight marker + snapshot path) + rule { + api_groups = [""] + resources = ["namespaces"] + resource_names = [local.namespace] + verbs = ["get", "patch", "update"] } } -resource "kubernetes_cluster_role_binding" "claude_agent_upgrade_ops" { +resource "kubernetes_cluster_role_binding" "k8s_upgrade_job" { metadata { - name = "claude-agent-upgrade-ops" + name = "k8s-upgrade-job" } role_ref { api_group = "rbac.authorization.k8s.io" kind = "ClusterRole" - name = kubernetes_cluster_role.claude_agent_upgrade_ops.metadata[0].name + name = kubernetes_cluster_role.k8s_upgrade_job.metadata[0].name } subject { kind = "ServiceAccount" - name = "claude-agent" - namespace = "claude-agent" + name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + } +} + +# Namespaced: read the credentials Secret in k8s-upgrade (SSH key + Slack URL) +resource "kubernetes_role" "k8s_upgrade_job_ns" { + metadata { + name = "k8s-upgrade-job-ns" + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + } + rule { + api_groups = [""] + resources = ["secrets"] + resource_names = ["k8s-upgrade-creds"] + verbs = ["get"] + } +} + +resource "kubernetes_role_binding" "k8s_upgrade_job_ns" { + metadata { + name = "k8s-upgrade-job-ns" + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "Role" + name = kubernetes_role.k8s_upgrade_job_ns.metadata[0].name + } + subject { + kind = "ServiceAccount" + name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + } +} + +# --- ConfigMaps: scripts + Job template --- + +resource "kubernetes_config_map" "k8s_upgrade_scripts" { + metadata { + name = "k8s-upgrade-scripts" + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + labels = local.labels + } + data = { + "upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh") + "update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh") + } +} + +resource "kubernetes_config_map" "k8s_upgrade_job_template" { + metadata { + name = "k8s-upgrade-job-template" + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + labels = local.labels + } + data = { + "job-template.yaml" = file("${path.module}/job-template.yaml") } } # --- Detection CronJob --- # -# Weekly: compares running cluster version against latest available patch -# (apt-cache madison kubeadm on master) and latest available minor (HEAD on -# next-minor pkgs.k8s.io repo). When a target is detected, POSTs to -# claude-agent-service to kick the upgrade agent. +# Probes for available patch/minor targets weekly. When one is found, renders +# Job 0 (preflight) from the same job-template the chain uses. The CronJob no +# longer POSTs to claude-agent-service; the whole pipeline now runs inside the +# cluster via Job-chaining. resource "kubernetes_cron_job_v1" "k8s_version_check" { metadata { @@ -320,33 +288,36 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { labels = local.labels } spec { - service_account_name = kubernetes_service_account.k8s_version_check.metadata[0].name + service_account_name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name restart_policy = "Never" image_pull_secrets { name = "registry-credentials" } + volume { + name = "creds" + secret { + secret_name = "k8s-upgrade-creds" + # 0444 — non-root container needs read; SSH key gets re-installed + # with mode 0400 in the inline command before any ssh call. + default_mode = "0444" + } + } + volume { + name = "template" + config_map { + name = kubernetes_config_map.k8s_upgrade_job_template.metadata[0].name + } + } container { name = "version-check" - image = local.ca_image + image = local.image command = ["/bin/bash", "-c", <<-EOT set -euo pipefail echo "==> k8s-version-check ($(date -u +%FT%TZ))" - # 1. Load SSH key from K8s Secret - mkdir -p /tmp - /usr/local/bin/kubectl get secret k8s-upgrade-creds \ - -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key - chmod 400 /tmp/k8s-upgrade-ssh-key - - SLACK=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \ - -o jsonpath='{.data.slack_webhook}' | base64 -d) - - AGENT_TOKEN=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \ - -o jsonpath='{.data.api_bearer_token}' | base64 -d) - - SSH="ssh -i /tmp/k8s-upgrade-ssh-key \ - -o StrictHostKeyChecking=accept-new \ - -o UserKnownHostsFile=/tmp/known_hosts" + SLACK=$(cat /secrets/k8s-upgrade/slack_webhook) + install -m 0400 /secrets/k8s-upgrade/ssh_key /tmp/ssh_key + SSH="ssh -i /tmp/ssh_key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts -o ConnectTimeout=10" slack() { curl -sS -X POST -H 'Content-Type: application/json' \ @@ -354,17 +325,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { "$SLACK" || true } - # 2. Detect running version + # 1. Detect running version RUNNING=$(/usr/local/bin/kubectl get nodes \ - -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' \ - | tr -d v) + -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' | tr -d v) RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}') echo "Running version: v$RUNNING (minor $RUNNING_MINOR)" - # 3. Detect highest available patch within the running minor track. - # Refresh the local apt cache first — without this, a newly-published - # patch won't show up via `apt-cache madison` until something else - # triggers an `apt-get update`. + # 2. Latest patch within current minor (refresh master's apt cache) LATEST_PATCH=$($SSH wizard@k8s-master \ "sudo apt-get update -qq -o Dir::Etc::sourcelist='sources.list.d/kubernetes.list' -o Dir::Etc::sourceparts='-' -o APT::Get::List-Cleanup='0' >/dev/null 2>&1 ; \ apt-cache madison kubeadm 2>/dev/null \ @@ -372,9 +339,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { | sed 's/-.*//' \ | grep '^$RUNNING_MINOR\\.' \ | sort -V | tail -1" || echo "") - echo "Latest patch (apt): v$LATEST_PATCH" + echo "Latest patch: v$LATEST_PATCH" - # 4. Detect next available minor by probing the apt repo URL. + # 3. Next-minor probe NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 )) NEXT_MINOR="1.$NEXT_MINOR_NUM" NEXT_MINOR_AVAILABLE="no" @@ -385,14 +352,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { fi echo "Next minor v$NEXT_MINOR available: $NEXT_MINOR_AVAILABLE" - # 5. Decide what to do + # 4. Choose target TARGET="" KIND="" if [ -n "$LATEST_PATCH" ] && [ "$LATEST_PATCH" != "$RUNNING" ]; then TARGET="$LATEST_PATCH" KIND="patch" elif [ "$NEXT_MINOR_AVAILABLE" = "yes" ]; then - # Probe the minor track to get its latest patch. NEXT_MINOR_PATCH=$($SSH wizard@k8s-master \ "curl -sf 'https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Packages' \ | grep -oE 'Version: [0-9.-]+' \ @@ -404,7 +370,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { fi fi - # 6. Push the discovery metric to Pushgateway + # 5. Pushgateway discovery metric PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-check' { echo "# TYPE k8s_upgrade_available gauge" @@ -417,64 +383,61 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { echo "k8s_version_check_last_run_timestamp $(date +%s)" } | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed" - # 7. Decide whether to dispatch + # 6. Decide whether to spawn Job 0 if [ -z "$TARGET" ]; then - echo "No upgrade needed (running=$RUNNING, latest_patch=$LATEST_PATCH, next_minor_available=$NEXT_MINOR_AVAILABLE)" + echo "No upgrade needed" exit 0 fi slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)" - # DRY_RUN_OVERRIDE wins over DRY_RUN — but a Job copied from - # this CronJob can't add new env vars (spec is immutable). The - # operator path for "trigger detection without dispatch" is - # toggling the CronJob's `var.detection_dry_run` then applying. - # Documented in the runbook. - EFFECTIVE_DRY_RUN="$${DRY_RUN_OVERRIDE:-$DRY_RUN}" - if [ "$EFFECTIVE_DRY_RUN" = "true" ]; then - echo "dry_run=true — not POSTing to claude-agent-service" - slack "DRY_RUN — skipping agent dispatch" + if [ "$DRY_RUN" = "true" ]; then + slack "DRY_RUN — not spawning preflight Job" exit 0 fi - # 8. POST to claude-agent-service - PAYLOAD=$(jq -nc \ - --arg target "$TARGET" \ - --arg kind "$KIND" \ - '{ - prompt: ("Run the k8s-version-upgrade agent. Inputs: " + ({target_version: $target, kind: $kind, dry_run: false, stages: "all"} | tostring)), - agent: ".claude/agents/k8s-version-upgrade", - max_budget_usd: 30 - }') + # 7. Spawn Job 0 (preflight) via envsubst on the job-template + # Idempotency: deterministic name reconciles via `apply`. + JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}" - echo "Dispatching agent: $PAYLOAD" - RESP=$(curl -sS -w '\n%%{http_code}' -X POST \ - -H "Authorization: Bearer $AGENT_TOKEN" \ - -H 'Content-Type: application/json' \ - -d "$PAYLOAD" \ - http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute) - CODE=$(printf '%s' "$RESP" | tail -n1) - BODY=$(printf '%s' "$RESP" | sed '$d') - - if [ "$CODE" = "200" ] || [ "$CODE" = "202" ]; then - JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // .id // "unknown"') - slack "Agent dispatched: job=$JOB_ID (target=v$TARGET kind=$KIND)" - echo "OK — job=$JOB_ID" - else - slack "ERROR dispatching agent: HTTP $CODE — $BODY" - echo "dispatch failed: HTTP $CODE — $BODY" >&2 - exit 1 + if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then + slack "Preflight Job $JOB_NAME already exists (rerunning detection mid-flight?)" + exit 0 fi + + export JOB_NAME PHASE_NEXT=preflight TARGET_NODE_NEXT="" \ + TARGET_VERSION="$TARGET" TARGET_VERSION_LABEL="$${TARGET//./-}" \ + KIND="$KIND" IMAGE="$${IMAGE}" \ + SCHEDULING_BLOCK=$' nodeSelector:\n kubernetes.io/hostname: k8s-node1' + + envsubst < /template/job-template.yaml \ + | /usr/local/bin/kubectl apply -f - + + slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)" EOT ] env { name = "DRY_RUN" value = tostring(var.detection_dry_run) } + env { + name = "IMAGE" + value = local.image + } env { name = "HOME" value = "/tmp" } + volume_mount { + name = "creds" + mount_path = "/secrets/k8s-upgrade" + read_only = true + } + volume_mount { + name = "template" + mount_path = "/template" + read_only = true + } resources { requests = { cpu = "50m" diff --git a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh new file mode 100644 index 00000000..fb32ae95 --- /dev/null +++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh @@ -0,0 +1,438 @@ +#!/usr/bin/env bash +# +# Universal upgrade-step body. Each Job in the k8s-version-upgrade chain runs +# this once, dispatching on $PHASE. On success it computes the next phase and +# spawns the next Job. The chain is: +# +# preflight (run on k8s-node1) +# ↓ +# master (drains k8s-master; run on k8s-node1) +# ↓ +# worker k8s-node4 (run on k8s-node1) +# ↓ +# worker k8s-node3 (run on k8s-node1) +# ↓ +# worker k8s-node2 (run on k8s-node1) +# ↓ +# worker k8s-node1 (drains k8s-node1; run on k8s-master with control-plane toleration) +# ↓ +# postflight (no node pinning) +# +# k8s-node1 hosts every Job except the one that drains k8s-node1 itself. +# k8s-node1 is therefore upgraded LAST. +# +# Required env vars (set on the Job pod by job-template.yaml): +# PHASE preflight | master | worker | postflight +# TARGET_NODE k8s-master | k8s-nodeN (empty for preflight/postflight) +# TARGET_VERSION X.Y.Z +# KIND patch | minor +# IMAGE container image to use for next Job in the chain + +set -euo pipefail + +NS=k8s-upgrade +SSH_KEY=/secrets/k8s-upgrade/ssh_key +SLACK_FILE=/secrets/k8s-upgrade/slack_webhook +PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade' +PROM='http://prometheus-server.monitoring.svc.cluster.local:80' +KUBECTL=kubectl +JOB_TEMPLATE=/template/job-template.yaml +UPDATE_K8S_SH=/scripts/update_k8s.sh + +# SSH key must be 0400 — refresh from secret mount (defaultMode does this but +# bind-mount semantics can preserve loose perms; chmod is idempotent). +install -m 0400 "$SSH_KEY" /tmp/ssh_key +SSH_KEY=/tmp/ssh_key + +SSH_OPTS=(-i "$SSH_KEY" + -o StrictHostKeyChecking=accept-new + -o UserKnownHostsFile=/tmp/known_hosts + -o ConnectTimeout=10) + +SLACK_URL="$(cat "$SLACK_FILE")" + +slack() { + local msg="$1" + curl -sS -X POST -H 'Content-Type: application/json' \ + --data "$(jq -nc --arg t "[k8s-upgrade-${PHASE}${TARGET_NODE:+:$TARGET_NODE}] $msg" \ + '{text: $t}')" \ + "$SLACK_URL" >/dev/null || echo "warn: slack post failed" +} + +push() { + printf '# TYPE %s gauge\n%s %s\n' "$1" "$1" "$2" \ + | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed" +} + +halt_on_alert_query() { + local extra_ignore="${1:-}" + local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor' + [ -n "$extra_ignore" ] && regex="$regex|$extra_ignore" + regex="$regex)$" + + curl -sf "$PROM/api/v1/alerts" \ + | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ + | grep -vE "$regex" | sort -u +} + +wait_for_node_ready() { + local node="$1" want_version="$2" deadline=$(( $(date +%s) + 900 )) # 15 min + while [ "$(date +%s)" -lt "$deadline" ]; do + local status kubelet + status=$($KUBECTL get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true) + kubelet=$($KUBECTL get node "$node" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v || true) + if [ "$status" = "True" ] && [ "$kubelet" = "$want_version" ]; then + return 0 + fi + sleep 15 + done + return 1 +} + +# Pre-drain: find pods on $node whose PDB has zero disruptionsAllowed and +# delete them directly. Drain's eviction API respects PDBs and will loop +# forever on single-replica deployments with `minAvailable: 1` — common +# pattern on this cluster (e.g. Anubis instances default to replicas=1). A +# direct delete bypasses eviction; the parent Deployment recreates the pod +# elsewhere (the node is already cordoned by drain). +predrain_unstick() { + local node="$1" + $KUBECTL get pdb -A -o json | jq -r ' + .items[] + | select(.status.disruptionsAllowed == 0) + | "\(.metadata.namespace) \(.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(","))" + ' | while read -r ns selector; do + [ -z "$selector" ] && continue + $KUBECTL -n "$ns" get pods --field-selector "spec.nodeName=$node,status.phase=Running" \ + -l "$selector" -o name 2>/dev/null \ + | while read -r pod; do + echo "predrain_unstick: deleting PDB-blocked $ns/$pod (drain would loop on it)" + $KUBECTL -n "$ns" delete "$pod" --wait=false || true + done + done +} + +# Drain wrapper: kick predrain_unstick before drain, then again every 60s in +# the background while drain runs (in case new pods land mid-drain). Drain +# exits when the node has no non-daemonset workload. +drain_node() { + local node="$1" + predrain_unstick "$node" + ( while kill -0 $$ 2>/dev/null; do sleep 60; predrain_unstick "$node"; done ) & + local watcher=$! + trap "kill $watcher 2>/dev/null || true" EXIT + $KUBECTL drain "$node" --ignore-daemonsets --delete-emptydir-data --force --grace-period=300 + kill $watcher 2>/dev/null || true + trap - EXIT +} + +# --------------------------------------------------------------------------- +# Chain definition — what comes after the current phase +# --------------------------------------------------------------------------- + +NEXT_PHASE="" +NEXT_TARGET_NODE="" +NEXT_RUN_ON="" + +case "${PHASE}:${TARGET_NODE:-}" in + preflight:) + NEXT_PHASE=master + NEXT_RUN_ON=k8s-node1 ;; + master:) + NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node4 + NEXT_RUN_ON=k8s-node1 ;; + worker:k8s-node4) + NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node3 + NEXT_RUN_ON=k8s-node1 ;; + worker:k8s-node3) + NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node2 + NEXT_RUN_ON=k8s-node1 ;; + worker:k8s-node2) + NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node1 + NEXT_RUN_ON=k8s-master ;; # control-plane toleration required + worker:k8s-node1) + NEXT_PHASE=postflight + NEXT_RUN_ON="" ;; # no node pinning for postflight + postflight:) + NEXT_PHASE="" ;; # end of chain + *) + echo "ERROR: unknown phase/target combo: ${PHASE}/${TARGET_NODE:-}" >&2 + exit 2 ;; +esac + +spawn_next() { + [ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; } + + local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}" + [ -n "${NEXT_TARGET_NODE:-}" ] && job_name="${job_name}-${NEXT_TARGET_NODE}" + + if $KUBECTL -n "$NS" get job "$job_name" >/dev/null 2>&1; then + echo "Next Job $job_name already exists; idempotent skip." + return 0 + fi + + local scheduling_block="" + case "${NEXT_RUN_ON:-}" in + k8s-master) + scheduling_block=$' nodeSelector:\n kubernetes.io/hostname: k8s-master\n tolerations:\n - key: node-role.kubernetes.io/control-plane\n operator: Exists\n effect: NoSchedule' ;; + "") + scheduling_block="" ;; + *) + scheduling_block=$' nodeSelector:\n kubernetes.io/hostname: '"$NEXT_RUN_ON" ;; + esac + + export JOB_NAME="$job_name" + export PHASE_NEXT="$NEXT_PHASE" + export TARGET_NODE_NEXT="${NEXT_TARGET_NODE:-}" + export TARGET_VERSION_LABEL="${TARGET_VERSION//./-}" + export SCHEDULING_BLOCK="$scheduling_block" + # TARGET_VERSION, KIND, IMAGE inherited from current env + + echo "Spawning next Job: $job_name (phase=$NEXT_PHASE target=${NEXT_TARGET_NODE:-} run_on=${NEXT_RUN_ON:-anywhere})" + envsubst <"$JOB_TEMPLATE" | $KUBECTL apply -f - +} + +# --------------------------------------------------------------------------- +# Phase bodies +# --------------------------------------------------------------------------- + +phase_preflight() { + slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)" + + # 1. All nodes Ready + no pressure + local bad_nodes + bad_nodes=$($KUBECTL get nodes -o json | jq -r ' + .items[] + | select( + (.status.conditions[] | select(.type=="Ready").status) != "True" + or (.status.conditions[] | select(.type=="MemoryPressure").status) == "True" + or (.status.conditions[] | select(.type=="DiskPressure").status) == "True") + | .metadata.name') + if [ -n "$bad_nodes" ]; then + slack "ABORT preflight — nodes unhealthy: $bad_nodes" + exit 1 + fi + + # 2. Halt-on-alert + local alerts + alerts=$(halt_on_alert_query) + if [ -n "$alerts" ]; then + slack "ABORT preflight — firing alerts:\n$alerts" + exit 1 + fi + + # 3. 24h-quiet baseline + local recent=0 + while IFS= read -r ts; do + [ -z "$ts" ] && continue + local diff=$(( $(date +%s) - $(date -d "$ts" +%s) )) + if [ "$diff" -lt 86400 ]; then recent=1; break; fi + done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}') + if [ "$recent" -eq 1 ]; then + slack "ABORT preflight — node transitioned Ready <24h ago (soak window)" + exit 1 + fi + + # 4. kubeadm upgrade plan matches target + local plan_target + plan_target=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master 'sudo kubeadm upgrade plan' \ + | grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \ + | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v) + if [ "$plan_target" != "$TARGET_VERSION" ]; then + slack "ABORT preflight — kubeadm plan target $plan_target ≠ requested $TARGET_VERSION" + exit 1 + fi + + # 5. Push in-flight + started_timestamp metrics + ns annotations + $KUBECTL annotate ns "$NS" \ + "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \ + "viktorbarzin.me/k8s-upgrade-target=$TARGET_VERSION" \ + --overwrite + push k8s_upgrade_in_flight 1 + push k8s_upgrade_started_timestamp "$(date +%s)" + push k8s_upgrade_snapshot_taken 0 + + # 6. Trigger backup-etcd Job, wait, verify size + local snap_job="pre-upgrade-etcd-${TARGET_VERSION//./-}-$(date +%s)" + $KUBECTL -n default create job --from=cronjob/backup-etcd "$snap_job" + if ! $KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$snap_job"; then + $KUBECTL -n default describe "job/$snap_job" | tail -30 + slack "ABORT preflight — etcd snapshot Job did not complete in 10 min" + exit 1 + fi + local snap_log size snap_file + snap_log=$($KUBECTL -n default logs "job/$snap_job" -c backup-manage --tail=20 || \ + $KUBECTL -n default logs "job/$snap_job" --tail=20) + size=$(echo "$snap_log" | grep -E '^Backup done:' | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+' || true) + snap_file=$(echo "$snap_log" | grep -E '^Backup done:' | awk '{print $3}' || true) + if [ -z "$size" ] || [ "$size" -lt 1024 ]; then + slack "ABORT preflight — etcd snapshot empty (size='${size:-unknown}')" + exit 1 + fi + $KUBECTL annotate ns "$NS" \ + "viktorbarzin.me/k8s-upgrade-snapshot-path=nfs://192.168.1.127:/srv/nfs/etcd-backup/$snap_file" \ + --overwrite + push k8s_upgrade_snapshot_taken 1 + + # 7. Containerd skew fix on master (if master < workers) + local master_ctr worker_max=0.0.0 + master_ctr=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master "containerd --version | awk '{print \$3}' | tr -d v") + for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do + local v + v=$(ssh "${SSH_OPTS[@]}" "wizard@$n" "containerd --version | awk '{print \$3}' | tr -d v") + [ "$(printf '%s\n%s' "$v" "$worker_max" | sort -V | tail -1)" = "$v" ] && worker_max="$v" + done + if [ "$(printf '%s\n%s' "$master_ctr" "$worker_max" | sort -V | head -1)" = "$master_ctr" ] \ + && [ "$master_ctr" != "$worker_max" ]; then + slack "Master containerd $master_ctr < workers $worker_max — bumping" + ssh "${SSH_OPTS[@]}" wizard@k8s-master \ + "sudo apt-mark unhold containerd.io && sudo apt-get install -y containerd.io='$worker_max-1' \ + && sudo apt-mark hold containerd.io && sudo systemctl restart containerd" + wait_for_node_ready k8s-master "$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)" \ + || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; } + slack "Master containerd: $master_ctr → $worker_max. Master Ready." + fi + + # 8. Apt repo URL rewrite (minor only) + if [ "$KIND" = "minor" ]; then + local target_minor="${TARGET_VERSION%.*}" + for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do + ssh "${SSH_OPTS[@]}" "wizard@$n" \ + "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \ + && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' \ + | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \ + && sudo apt-get update" + done + slack "Apt repo rewritten to v$target_minor/deb on all 5 nodes" + fi + + slack "Preflight clean. Snapshot at nfs://...$snap_file ($size bytes). Dispatching master Job." +} + +phase_master() { + slack "Draining k8s-master" + + # Re-check halt-on-alert before drain + local alerts + alerts=$(halt_on_alert_query) + [ -n "$alerts" ] && { slack "ABORT master — alerts firing pre-drain: $alerts"; exit 1; } + + drain_node k8s-master + + slack "Running update_k8s.sh on k8s-master (--role master --release $TARGET_VERSION)" + ssh "${SSH_OPTS[@]}" wizard@k8s-master 'bash -s' \ + < "$UPDATE_K8S_SH" -- --role master --release "$TARGET_VERSION" + + $KUBECTL uncordon k8s-master + + wait_for_node_ready k8s-master "$TARGET_VERSION" \ + || { slack "ABORT — k8s-master not Ready or wrong version after upgrade"; exit 1; } + + local not_ready + not_ready=$($KUBECTL -n kube-system get pods -l 'tier=control-plane' --no-headers 2>/dev/null \ + | grep -v Running | wc -l) + if [ "$not_ready" -gt 0 ]; then + slack "ABORT — $not_ready control-plane pods not Running after master upgrade" + exit 1 + fi + + alerts=$(halt_on_alert_query RecentNodeReboot) + [ -n "$alerts" ] && { slack "ABORT master — alerts firing post-upgrade: $alerts"; exit 1; } + + slack "Master on v$TARGET_VERSION, control-plane Running. Dispatching worker chain." +} + +phase_worker() { + [ -z "$TARGET_NODE" ] && { echo "ERROR: worker phase requires TARGET_NODE"; exit 2; } + slack "Draining $TARGET_NODE" + + # Halt-on-alert wait (up to 30 min) + local attempt alerts + for attempt in $(seq 1 30); do + alerts=$(halt_on_alert_query) + [ -z "$alerts" ] && break + echo "Waiting for alerts to clear (attempt $attempt/30): $alerts" + sleep 60 + done + [ -n "$alerts" ] && { slack "ABORT $TARGET_NODE — alerts firing after 30min: $alerts"; exit 1; } + + drain_node "$TARGET_NODE" + + slack "Running update_k8s.sh on $TARGET_NODE (--role worker --release $TARGET_VERSION)" + ssh "${SSH_OPTS[@]}" "wizard@$TARGET_NODE" 'bash -s' \ + < "$UPDATE_K8S_SH" -- --role worker --release "$TARGET_VERSION" + + $KUBECTL uncordon "$TARGET_NODE" + + wait_for_node_ready "$TARGET_NODE" "$TARGET_VERSION" \ + || { slack "ABORT — $TARGET_NODE not Ready or wrong version"; exit 1; } + + # Daemonsets back on the node + local missing=0 + for ds in calico-node kube-proxy; do + local count + count=$($KUBECTL get pods -A -o wide --field-selector "spec.nodeName=$TARGET_NODE,status.phase=Running" --no-headers \ + | awk -v d="$ds" '$2 ~ d {n++} END{print n+0}') + [ "$count" -lt 1 ] && missing=$((missing+1)) + done + [ "$missing" -gt 0 ] && { slack "WARN $TARGET_NODE — $missing daemonset(s) missing"; } + + # 10-min soak with halt-on-alert (RecentNodeReboot ignored — we know we restarted it) + echo "Soaking $TARGET_NODE for 10 min..." + for i in $(seq 1 10); do + alerts=$(halt_on_alert_query RecentNodeReboot) + [ -n "$alerts" ] && { slack "ABORT $TARGET_NODE mid-soak — alerts: $alerts"; exit 1; } + sleep 60 + done + + slack "$TARGET_NODE on v$TARGET_VERSION. Soaked clean (10 min)." +} + +phase_postflight() { + slack "Running postflight" + + # All 5 nodes at target + local versions wrong + versions=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}') + wrong=$(echo "$versions" | grep -v ":v${TARGET_VERSION}\$" | wc -l) + if [ "$wrong" -ne 0 ]; then + slack "ABORT postflight — $wrong node(s) off target:\n$versions" + exit 1 + fi + + # No alerts firing + local alerts + alerts=$(halt_on_alert_query) + [ -n "$alerts" ] && slack "Postflight WARN — alerts still firing (cluster on target, please check):\n$alerts" + + # Pod-ready ratio + local ratio + ratio=$(curl -sf "$PROM/api/v1/query" \ + --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \ + | jq -r '.data.result[0].value[1] // "0"') + + # Clear annotations + gauges + $KUBECTL annotate ns "$NS" \ + 'viktorbarzin.me/k8s-upgrade-in-flight-' \ + 'viktorbarzin.me/k8s-upgrade-target-' \ + 'viktorbarzin.me/k8s-upgrade-snapshot-path-' || true + push k8s_upgrade_in_flight 0 + push k8s_upgrade_snapshot_taken 0 + push k8s_upgrade_started_timestamp 0 + + slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)" +} + +# --------------------------------------------------------------------------- +# Dispatch +# --------------------------------------------------------------------------- + +case "$PHASE" in + preflight) phase_preflight ;; + master) phase_master ;; + worker) phase_worker ;; + postflight) phase_postflight ;; + *) echo "ERROR: unknown PHASE: $PHASE" >&2; exit 2 ;; +esac + +spawn_next diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 4f8b4f8f..beb7c7f4 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1917,6 +1917,21 @@ serverFiles: severity: critical annotations: summary: "K8s upgrade is in flight but no etcd snapshot was recorded — pipeline pre-flight failed silently" + # K8sUpgradeStalled: the v2 Job-chain pushes `k8s_upgrade_started_timestamp` + # in preflight and resets `k8s_upgrade_in_flight=0` in postflight. If + # in_flight=1 persists for >90 min, a Job in the chain failed + # (backoffLimit=1), got preempted/evicted, or is hung. Manual recovery: + # `kubectl -n k8s-upgrade get jobs` → identify failed/stuck Job → delete + # it → fix root cause → re-create the same Job. Next-Job creation in each + # phase is idempotent (deterministic name = `k8s-upgrade--`) + # so re-running won't duplicate downstream Jobs. + - alert: K8sUpgradeStalled + expr: k8s_upgrade_in_flight == 1 and (time() - k8s_upgrade_started_timestamp) > 5400 + for: 5m + labels: + severity: critical + annotations: + summary: "K8s upgrade has been in flight for >90 min — chain is stuck. Check: kubectl -n k8s-upgrade get jobs" - name: "Traefik Ingress" rules: - alert: TraefikDown