infra/docs/runbooks/k8s-version-upgrade.md

# K8s Version Upgrade Pipeline

## Overview

Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself.

The chain (Sun 12:00 UTC weekly):

```
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
```

This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node.

## Architecture

```
k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
  │ kubectl get nodes  → running version
  │ ssh master 'apt-cache madison kubeadm'  → latest patch (within current minor)
  │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release  → next minor available?
  │ push k8s_upgrade_available{kind,running,target} → Pushgateway
  │
  ▼ if a target is detected
envsubst on /template/job-template.yaml  | kubectl apply -f -
  │ creates k8s-upgrade-preflight-<target_version>
  ▼

Job 0 — preflight       (pinned: k8s-node1)
  ├── All nodes Ready + no Mem/Disk pressure
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
  ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
  └── spawn_next → k8s-upgrade-master-<target_version>
  ▼

Job 1 — master upgrade  (pinned: k8s-node1)
  ├── halt-on-alert recheck (no firing alerts)
  ├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
  ├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
  ├── kubectl uncordon k8s-master; wait Ready + version match
  ├── verify control-plane pods Running
  ├── halt-on-alert recheck (allows RecentNodeReboot)
  └── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
  ▼

Job 2 — worker k8s-node4 (pinned: k8s-node1)
Job 3 — worker k8s-node3 (pinned: k8s-node1)
Job 4 — worker k8s-node2 (pinned: k8s-node1)
  (identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
  ▼

Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
  └── spawn_next → k8s-upgrade-postflight-<target_version>
  ▼

Job 6 — postflight       (no pinning)
  ├── Verify all 5 nodes at target version
  ├── Verify no firing Upgrade Gates alerts
  ├── Compute pod-ready ratio (should be ≥ 0.9)
  ├── Clear k8s-upgrade-* annotations on namespace
  ├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
  └── Slack: ✅ K8s upgrade complete
```

**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
  itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
  toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
  whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
  → the `case "${PHASE}:${TARGET_NODE:-}"` block.

## Components

### Shared resources (one-time, Terraform-managed)

| Resource | Purpose |
|---|---|
| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |

### Pushgateway metrics

Pushed by upgrade-step.sh during phase execution; observed by the
`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:

| Metric | Pushed by | Cleared by |
|---|---|---|
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |

### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)

- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.

### Vault secrets

- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL

Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.

## Common Operations

### Verify the pipeline is healthy
```bash
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check

# Latest detection run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200

# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain

# Pushgateway — running detection metric
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
  wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
  grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'

# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
```

### Manually trigger detection (no upgrade)
Use `detection_dry_run=true` to short-circuit before spawning Job 0:

```bash
# Toggle var in TF, apply, and trigger
# (in stacks/k8s-version-upgrade/main.tf)
#   variable "detection_dry_run" { default = true }
# scripts/tg apply
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
# When done, flip back to false.
```

### Manually trigger the chain (skip detection)
Useful for testing or to force a specific target. Render Job 0 directly:

```bash
TARGET=1.34.7
KIND=patch
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
  -o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')

cat <<EOF | envsubst | kubectl apply -f -
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
EOF
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
# the full env block. Easier: just trigger detection with the right inputs.
```

### Kill a stuck Job (chain halted mid-flight)
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
fires after 90 min. Recovery:

```bash
# 1. Identify the failed Job
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
kubectl -n k8s-upgrade logs job/<failed-job-name>

# 2. Diagnose. Common causes:
#    - drain stuck on PDB-violating pod (predrain_unstick should handle this;
#      but a brand-new PDB pattern could escape it — manually delete the pod)
#    - SSH from Job pod failing (node restarted? known_hosts mismatch?)
#    - kubeadm upgrade failed on a node (check journalctl + apt history on that node)

# 3. Fix the root cause first.

# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
#    `kubectl apply` of the same name reconciles to a single Job.
kubectl -n k8s-upgrade delete job/<failed-job-name>

# 5. Manually render + apply the same Job. Pull the template + spec from the
#    next-Job-creation block in upgrade-step.sh — easiest is to copy from a
#    sibling Job's YAML:
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
  | yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
  | yq '.metadata.name = "<failed-job-name>"' \
  | yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
  | kubectl apply -f -

# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
# is idempotent (deterministic name) so re-running won't duplicate downstream.
```

### Skip a phase (advanced; use sparingly)
If you've already done the work for a phase manually and want the chain to
jump past it, manually create the NEXT phase's Job with the deterministic
name. The previous phase's spawn-next will see the Job already exists and
short-circuit. Example: master already on target; jump straight to worker:

```bash
TARGET=1.34.7
TGT_LBL=${TARGET//./-}
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
```

### Halt the pipeline in an emergency

```bash
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
  -p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: -p '{"spec":{"suspend":false}}'

# Option 2: delete all in-flight chain Jobs
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
# This leaves the in-flight annotation + Pushgateway gauge intact —
# K8sUpgradeStalled will fire to surface the halt.

# Option 3: force a blocker alert (same regex kured uses)
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
```

### Clear orphaned in-flight state
After deciding NOT to retry a halted chain:

```bash
kubectl annotate ns k8s-upgrade \
  viktorbarzin.me/k8s-upgrade-in-flight- \
  viktorbarzin.me/k8s-upgrade-target- \
  viktorbarzin.me/k8s-upgrade-snapshot-path-

# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
  | curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
kill %1
```

### Rollback paths
`kubeadm` does **not** support in-place downgrade. If a run fails:

#### Master broke during/after kubeadm upgrade
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
   ```bash
   ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
   # Pre-upgrade versions are in the most recent "Commandline: apt-get install"
   sudo apt-mark unhold kubeadm kubelet kubectl
   sudo apt-get install --allow-downgrades -y \
     kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
   sudo apt-mark hold kubeadm kubelet kubectl
   sudo systemctl daemon-reload && sudo systemctl restart kubelet
   ```

#### Worker broke
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
2. Downgrade apt packages on that node only (see above)
3. `kubectl uncordon <node>`
4. The cluster continues running on the master + remaining workers throughout

### One-shot SSH key rotation
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
2. Update Vault:
   ```bash
   vault kv patch secret/k8s-upgrade \
     ssh_key=@/tmp/k8s-upgrade \
     ssh_key_pub=@/tmp/k8s-upgrade.pub
   ```
3. Push the new pubkey to every node:
   ```bash
   for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
     ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
     ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
   done
   ```
4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`

## Past Incidents

### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).

## File Pointers

| What | Where |
|------|-------|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								# K8s Version Upgrade Pipeline
 								## Overview
 								Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								VMs are upgraded automatically by a weekly detection CronJob that seeds a
 								chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
 								drain target** — so no pod in the chain can preempt itself.
 								The chain (Sun 12:00 UTC weekly):
 								```
 								detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
 								```
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								This is **independent** of the OS-side `unattended-upgrades + kured`
-												kured: drop Mon-Fri restriction, reboot any day

The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-16 12:29:01 +00:00
+								pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
 								Schedules can overlap (kured runs daily 02:00-06:00 London; detection
 								here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
 								Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
 								group blocks the version-upgrade preflight, so the chain self-defers
 								to the next Sunday rather than rolling on top of a half-fresh node.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								## Architecture
 								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								  │ kubectl get nodes  → running version
 								  │ ssh master 'apt-cache madison kubeadm'  → latest patch (within current minor)
 								  │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release  → next minor available?
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								  │ push k8s_upgrade_available{kind,running,target} → Pushgateway
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								  │
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								  ▼ if a target is detected
 								envsubst on /template/job-template.yaml  | kubectl apply -f -
 								  │ creates k8s-upgrade-preflight-<target_version>
 								  ▼
 								Job 0 — preflight       (pinned: k8s-node1)
 								  ├── All nodes Ready + no Mem/Disk pressure
 								  ├── halt-on-alert (kured-style ignore-list)
 								  ├── 24h-quiet baseline (no Ready transitions <24h ago)
 								  ├── kubeadm upgrade plan matches target
 								  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
 								  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
 								  ├── SSH master: containerd skew fix (if master < workers)
 								  ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
 								  └── spawn_next → k8s-upgrade-master-<target_version>
 								  ▼
 								Job 1 — master upgrade  (pinned: k8s-node1)
 								  ├── halt-on-alert recheck (no firing alerts)
 								  ├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
 								  ├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
 								  ├── kubectl uncordon k8s-master; wait Ready + version match
 								  ├── verify control-plane pods Running
 								  ├── halt-on-alert recheck (allows RecentNodeReboot)
 								  └── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								  ▼
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
 								Job 2 — worker k8s-node4 (pinned: k8s-node1)
 								Job 3 — worker k8s-node3 (pinned: k8s-node1)
 								Job 4 — worker k8s-node2 (pinned: k8s-node1)
 								  (identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
 								  ▼
 								Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
 								  └── spawn_next → k8s-upgrade-postflight-<target_version>
 								  ▼
 								Job 6 — postflight       (no pinning)
 								  ├── Verify all 5 nodes at target version
 								  ├── Verify no firing Upgrade Gates alerts
 								  ├── Compute pod-ready ratio (should be ≥ 0.9)
 								  ├── Clear k8s-upgrade-* annotations on namespace
 								  ├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
 								  └── Slack: ✅ K8s upgrade complete
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								**Pin choices summarised:**
 								- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
 								  itself is upgraded **last**.
 								- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
 								  toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
 								- If anyone reorders the worker sequence, the pin for Job 5 needs to track
 								  whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
 								  → the `case "${PHASE}:${TARGET_NODE:-}"` block.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								## Components
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Shared resources (one-time, Terraform-managed)
 								| Resource | Purpose |
 								|---|---|
 								| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
 								| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
 								| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
 								| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
 								| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
 								### Pushgateway metrics
 								Pushed by upgrade-step.sh during phase execution; observed by the
 								`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
 								| Metric | Pushed by | Cleared by |
 								|---|---|---|
 								| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
 								| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
 								| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
 								| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
 								| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
 								### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
 								- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
 								- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
 								- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
 								- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								### Vault secrets
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
 								- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
 								- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
 								Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								## Common Operations
 								### Verify the pipeline is healthy
 								```bash
 								# CronJob present + not suspended
 								kubectl -n k8s-upgrade get cronjob k8s-version-check
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# Latest detection run output
 								kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
 								kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
 								kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
 								# Pushgateway — running detection metric
 								kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
 								  wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
 								  grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								# Upgrade Gates rules loaded
 								kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
 								  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
 								  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
 								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Manually trigger detection (no upgrade)
 								Use `detection_dry_run=true` to short-circuit before spawning Job 0:
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								```bash
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# Toggle var in TF, apply, and trigger
 								# (in stacks/k8s-version-upgrade/main.tf)
 								#   variable "detection_dry_run" { default = true }
 								# scripts/tg apply
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
 								kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# When done, flip back to false.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Manually trigger the chain (skip detection)
 								Useful for testing or to force a specific target. Render Job 0 directly:
 								```bash
 								TARGET=1.34.7
 								KIND=patch
 								IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
 								  -o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
 								cat <<EOF | envsubst | kubectl apply -f -
 								$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
 								EOF
 								# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
 								# the full env block. Easier: just trigger detection with the right inputs.
 								```
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Kill a stuck Job (chain halted mid-flight)
 								The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
 								fires after 90 min. Recovery:
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								```bash
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# 1. Identify the failed Job
 								kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
 								kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
 								kubectl -n k8s-upgrade logs job/<failed-job-name>
 								# 2. Diagnose. Common causes:
 								#    - drain stuck on PDB-violating pod (predrain_unstick should handle this;
 								#      but a brand-new PDB pattern could escape it — manually delete the pod)
 								#    - SSH from Job pod failing (node restarted? known_hosts mismatch?)
 								#    - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
 								# 3. Fix the root cause first.
 								# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
 								#    `kubectl apply` of the same name reconciles to a single Job.
 								kubectl -n k8s-upgrade delete job/<failed-job-name>
 								# 5. Manually render + apply the same Job. Pull the template + spec from the
 								#    next-Job-creation block in upgrade-step.sh — easiest is to copy from a
 								#    sibling Job's YAML:
 								kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
 								  | yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
 								  | yq '.metadata.name = "<failed-job-name>"' \
 								  | yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
 								  | kubectl apply -f -
 								# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
 								# is idempotent (deterministic name) so re-running won't duplicate downstream.
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Skip a phase (advanced; use sparingly)
 								If you've already done the work for a phase manually and want the chain to
 								jump past it, manually create the NEXT phase's Job with the deterministic
 								name. The previous phase's spawn-next will see the Job already exists and
 								short-circuit. Example: master already on target; jump straight to worker:
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```bash
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								TARGET=1.34.7
 								TGT_LBL=${TARGET//./-}
 								# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```
 								### Halt the pipeline in an emergency
 								```bash
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								kubectl -n k8s-upgrade patch cronjob k8s-version-check \
 								  -p '{"spec":{"suspend":true}}' --type=merge
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								# Re-enable: -p '{"spec":{"suspend":false}}'
 								# Option 2: delete all in-flight chain Jobs
 								kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
 								# This leaves the in-flight annotation + Pushgateway gauge intact —
 								# K8sUpgradeStalled will fire to surface the halt.
 								# Option 3: force a blocker alert (same regex kured uses)
 								# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### Clear orphaned in-flight state
 								After deciding NOT to retry a halted chain:
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								```bash
 								kubectl annotate ns k8s-upgrade \
 								  viktorbarzin.me/k8s-upgrade-in-flight- \
 								  viktorbarzin.me/k8s-upgrade-target- \
 								  viktorbarzin.me/k8s-upgrade-snapshot-path-
 								# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
 								kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
 								printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
 								  | curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
 								kill %1
 								```
 								### Rollback paths
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								`kubeadm` does **not** support in-place downgrade. If a run fails:
 								#### Master broke during/after kubeadm upgrade
 . Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
 . Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
 . Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
 								   ```bash
 								   ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
 								   # Pre-upgrade versions are in the most recent "Commandline: apt-get install"
 								   sudo apt-mark unhold kubeadm kubelet kubectl
 								   sudo apt-get install --allow-downgrades -y \
 								     kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
 								   sudo apt-mark hold kubeadm kubelet kubectl
 								   sudo systemctl daemon-reload && sudo systemctl restart kubelet
 								   ```
 								#### Worker broke
 . `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
 . Downgrade apt packages on that node only (see above)
 . `kubectl uncordon <node>`
 . The cluster continues running on the master + remaining workers throughout
 								### One-shot SSH key rotation
 . Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
 . Update Vault:
 								   ```bash
 								   vault kv patch secret/k8s-upgrade \
 								     ssh_key=@/tmp/k8s-upgrade \
 								     ssh_key_pub=@/tmp/k8s-upgrade.pub
 								   ```
 . Push the new pubkey to every node:
 								   ```bash
 								   for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
 								     ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
 								     ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
 								   done
 								   ```
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								## Past Incidents
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
 								- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
 								- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
 								- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
 								- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
 								- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
 								## File Pointers
 								| What | Where |
 								|------|-------|
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
 								| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
 								| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
 								| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
 								| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
-												k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline

Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).

											
										
										
											2026-05-10 19:07:42 +00:00
+								| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
-												k8s-version-upgrade: decompose into Job chain to fix self-preemption

The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-05-11 23:54:05 +00:00
+								| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |