k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.
This commit is contained in:
parent
cecd9fe247
commit
6cb823e431
3 changed files with 119 additions and 1 deletions
|
|
@ -36,6 +36,7 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
|
||||||
▼
|
▼
|
||||||
|
|
||||||
Job 0 — preflight (pinned: k8s-node1)
|
Job 0 — preflight (pinned: k8s-node1)
|
||||||
|
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
|
||||||
├── All nodes Ready + no Mem/Disk pressure
|
├── All nodes Ready + no Mem/Disk pressure
|
||||||
├── halt-on-alert (kured-style ignore-list)
|
├── halt-on-alert (kured-style ignore-list)
|
||||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||||
|
|
@ -87,6 +88,46 @@ Job 6 — postflight (no pinning)
|
||||||
**adding a node needs no change** — the chain upgrades every worker still
|
**adding a node needs no change** — the chain upgrades every worker still
|
||||||
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
|
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
|
||||||
|
|
||||||
|
### Auto-upgrade compat gate
|
||||||
|
|
||||||
|
The chain now attempts **patch AND minor** upgrades autonomously — but before any
|
||||||
|
mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks)
|
||||||
|
the upgrade** if any of these hold for the detected target:
|
||||||
|
|
||||||
|
- a **critical addon's running version doesn't support the target k8s minor**
|
||||||
|
(running version > the addon's highest-supported minor in the compat matrix),
|
||||||
|
- an **in-use deprecated API is removed at/before the target** — measured live
|
||||||
|
from `apiserver_requested_deprecated_apis` (something is still calling a
|
||||||
|
group/version that the target k8s drops), or
|
||||||
|
- a **node's containerd is below the target's floor** (the minimum containerd the
|
||||||
|
target k8s requires).
|
||||||
|
|
||||||
|
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
|
||||||
|
|
||||||
|
**On a block**, the gate:
|
||||||
|
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
|
||||||
|
Prometheus alert),
|
||||||
|
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
|
||||||
|
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
|
||||||
|
this is not a failure). Because the block happens **before any mutation, no
|
||||||
|
rollback is involved**; nothing was changed.
|
||||||
|
|
||||||
|
**To clear a block**: upgrade the named addon (or migrate the API caller off the
|
||||||
|
deprecated group/version, or bump containerd on the named node) so the offending
|
||||||
|
condition no longer holds. The **next nightly run then proceeds automatically** —
|
||||||
|
no manual chain restart needed.
|
||||||
|
|
||||||
|
The **compat matrix** lives in
|
||||||
|
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
|
||||||
|
supported k8s minor`, populated from each addon's own compatibility docs. **Keep
|
||||||
|
it current**; the gate reads it on every run. Gate logic:
|
||||||
|
`stacks/k8s-version-upgrade/scripts/compat-gate.py`.
|
||||||
|
|
||||||
|
> The detector's minor-probe was **fixed** (the `HEAD pkgs.k8s.io/.../v<NEXT_MINOR>`
|
||||||
|
> curl now follows the 302 from `pkgs.k8s.io` via `-L`), so **minor versions are
|
||||||
|
> finally detected** — and are gated behind the compat check above before the
|
||||||
|
> chain will act on them.
|
||||||
|
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### Shared resources (one-time, Terraform-managed)
|
### Shared resources (one-time, Terraform-managed)
|
||||||
|
|
@ -118,7 +159,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
||||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
||||||
- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
|
||||||
|
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||||
|
|
||||||
### CoreDNS is NOT upgraded by kubeadm here
|
### CoreDNS is NOT upgraded by kubeadm here
|
||||||
|
|
||||||
|
|
@ -391,6 +433,8 @@ kill %1
|
||||||
|------|-------|
|
|------|-------|
|
||||||
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
|
||||||
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
||||||
|
| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` |
|
||||||
|
| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` |
|
||||||
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
|
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
|
||||||
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
|
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
|
||||||
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
||||||
|
|
|
||||||
|
|
@ -674,6 +674,60 @@ phase_postflight() {
|
||||||
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
||||||
| jq -r '.data.result[0].value[1] // "0"')
|
| jq -r '.data.result[0].value[1] // "0"')
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Deeper smoke tests — catch a cluster that's "all pods Running" but actually
|
||||||
|
# broken after the upgrade (dead apiserver health endpoints, broken
|
||||||
|
# CoreDNS/in-cluster DNS, or a control-plane component that's only superficially
|
||||||
|
# up). Uses ONLY the chain's existing permissions: read-only kubectl raw API
|
||||||
|
# reads + this pod's own resolver. No new pods/exec/images/RBAC. We do NOT
|
||||||
|
# rollback — kubeadm can't downgrade — we halt loudly for a human.
|
||||||
|
local smoke_failed=0
|
||||||
|
|
||||||
|
# 1. apiserver health endpoints. `kubectl get --raw` exits non-zero on a
|
||||||
|
# non-200, which under `set -e` would abort — capture rc explicitly.
|
||||||
|
local readyz_out readyz_rc=0 livez_out livez_rc=0
|
||||||
|
readyz_out=$($KUBECTL get --raw='/readyz' 2>&1) || readyz_rc=$?
|
||||||
|
if [ "$readyz_rc" -ne 0 ] || [ "$readyz_out" != "ok" ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — apiserver /readyz not ok (rc=$readyz_rc, body='${readyz_out:0:200}')"
|
||||||
|
fi
|
||||||
|
livez_out=$($KUBECTL get --raw='/livez' 2>&1) || livez_rc=$?
|
||||||
|
if [ "$livez_rc" -ne 0 ] || [ "$livez_out" != "ok" ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — apiserver /livez not ok (rc=$livez_rc, body='${livez_out:0:200}')"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. In-cluster DNS resolution from THIS pod's resolver. If CoreDNS / kube-dns
|
||||||
|
# is broken after the upgrade, resolving the apiserver's cluster service
|
||||||
|
# name fails here even though pods may still look Running.
|
||||||
|
local dns_rc=0
|
||||||
|
python3 -c 'import socket; socket.gethostbyname("kubernetes.default.svc.cluster.local")' >/dev/null 2>&1 || dns_rc=$?
|
||||||
|
if [ "$dns_rc" -ne 0 ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — in-cluster DNS broken (could not resolve kubernetes.default.svc.cluster.local; CoreDNS down?)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 3. Core kube-system pods Running: control-plane statics (apiserver,
|
||||||
|
# controller-manager, scheduler, etcd) AND CoreDNS. `grep -v Running`
|
||||||
|
# returns 1 when everything is Running (the happy path) → wrap in `|| true`
|
||||||
|
# so pipefail doesn't abort us at the moment of success.
|
||||||
|
local comp not_running
|
||||||
|
for comp in kube-apiserver kube-controller-manager kube-scheduler etcd coredns; do
|
||||||
|
not_running=$($KUBECTL -n kube-system get pods --no-headers 2>/dev/null \
|
||||||
|
| { grep -E "(^|[[:space:]])${comp}-" || true; } \
|
||||||
|
| { grep -v Running || true; } | wc -l)
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — $not_running kube-system '$comp' pod(s) not Running after upgrade"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$smoke_failed" -ne 0 ]; then
|
||||||
|
slack "postflight smoke tests FAILED — upgrade left the cluster unhealthy, halting for a human (no rollback; kubeadm can't downgrade)"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "postflight smoke tests passed (apiserver health + DNS + core kube-system pods)"
|
||||||
|
|
||||||
# Clear annotations + gauges
|
# Clear annotations + gauges
|
||||||
$KUBECTL annotate ns "$NS" \
|
$KUBECTL annotate ns "$NS" \
|
||||||
'viktorbarzin.me/k8s-upgrade-in-flight-' \
|
'viktorbarzin.me/k8s-upgrade-in-flight-' \
|
||||||
|
|
|
||||||
|
|
@ -2252,6 +2252,26 @@ serverFiles:
|
||||||
subsystem: k8s-upgrade
|
subsystem: k8s-upgrade
|
||||||
annotations:
|
annotations:
|
||||||
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
||||||
|
# K8sUpgradeBlocked: the k8s-version-upgrade chain pushes
|
||||||
|
# `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the
|
||||||
|
# target version — the cluster isn't ready (a critical addon lags the
|
||||||
|
# target's support window, an in-use API is deprecated/removed at the
|
||||||
|
# target, or a node's containerd predates the target's minimum). This
|
||||||
|
# is the designed "halt + alert" outcome, NOT a crash: the chain stops
|
||||||
|
# cleanly and the specific blocking reasons are posted to Slack by the
|
||||||
|
# upgrade chain. Same bare-metric pushgateway selector as
|
||||||
|
# K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump
|
||||||
|
# the named addon / migrate the deprecated API usage / upgrade the
|
||||||
|
# node's containerd, then the next nightly run proceeds automatically.
|
||||||
|
- alert: K8sUpgradeBlocked
|
||||||
|
expr: k8s_upgrade_blocked == 1
|
||||||
|
for: 10m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
subsystem: k8s-upgrade
|
||||||
|
annotations:
|
||||||
|
summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain."
|
||||||
|
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically."
|
||||||
- name: "Traefik Ingress"
|
- name: "Traefik Ingress"
|
||||||
rules:
|
rules:
|
||||||
- alert: TraefikDown
|
- alert: TraefikDown
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue