k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-17 16:56:02 +00:00
parent 0c5a9b5f44
commit ed53b34bf4
4 changed files with 112 additions and 79 deletions

View file

@ -4,7 +4,7 @@ This doc covers three independent automation paths:
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — daily detection CronJob → chain of phase Jobs (preflight → master → one worker Job per worker, enumerated live → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
## Overview
@ -262,13 +262,14 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
│ spawns Job 0 = k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master
Job 2 — worker (pinned: k8s-node1) drains k8s-node4
Job 3 — worker (pinned: k8s-node1) drains k8s-node3
Job 4 — worker (pinned: k8s-node1) drains k8s-node2
Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration
Job 6 — postflight (no pinning)
Job 0 — preflight (pinned: first worker)
Job 1 — master upgrade (pinned: first worker) drains k8s-master
Job 2..N — worker (pinned: k8s-master) drains each worker still off-target
← control-plane toleration; one Job
per worker, enumerated live from
`kubectl get nodes` (covers node5/6
+ any future node automatically)
Job N+1 — postflight (no pinning)
```
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
@ -344,7 +345,7 @@ The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply`
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target: the master-drain Job runs on the first worker; every worker-drain Job runs on k8s-master (already upgraded, control-plane toleration). The worker set is enumerated live from `kubectl get nodes`, so new nodes are covered with no script change; SSH targets are node InternalIPs (no DNS dependency).
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.

View file

@ -10,7 +10,7 @@ drain target** — so no pod in the chain can preempt itself.
The chain (Sun 12:00 UTC weekly):
```
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
```
This is **independent** of the OS-side `unattended-upgrades + kured`
@ -76,14 +76,16 @@ Job 6 — postflight (no pinning)
└── Slack: ✅ K8s upgrade complete
```
**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
→ the `case "${PHASE}:${TARGET_NODE:-}"` block.
**Pin choices summarised** (dynamic since 2026-06-17 — no hardcoded node list):
- The **master-drain Job** runs on the first worker (the "runner"); since it
drains the control-plane node, it must not run there.
- **Every worker-drain Job** runs on **k8s-master** (already upgraded by then),
with a `node-role.kubernetes.io/control-plane:NoSchedule` toleration — so a
Job never runs on the node it drains (self-preemption invariant).
- The worker set + order come from `kubectl get nodes` at runtime
(`worker_nodes` / `next_pending_worker` in `scripts/upgrade-step.sh`), so
**adding a node needs no change** — the chain upgrades every worker still
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
## Components