k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled

Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
This commit is contained in:
Viktor Barzin 2026-06-19 11:27:17 +00:00
parent cecd9fe247
commit 6cb823e431
3 changed files with 119 additions and 1 deletions

View file

@ -674,6 +674,60 @@ phase_postflight() {
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
| jq -r '.data.result[0].value[1] // "0"')
# ---------------------------------------------------------------------------
# Deeper smoke tests — catch a cluster that's "all pods Running" but actually
# broken after the upgrade (dead apiserver health endpoints, broken
# CoreDNS/in-cluster DNS, or a control-plane component that's only superficially
# up). Uses ONLY the chain's existing permissions: read-only kubectl raw API
# reads + this pod's own resolver. No new pods/exec/images/RBAC. We do NOT
# rollback — kubeadm can't downgrade — we halt loudly for a human.
local smoke_failed=0
# 1. apiserver health endpoints. `kubectl get --raw` exits non-zero on a
# non-200, which under `set -e` would abort — capture rc explicitly.
local readyz_out readyz_rc=0 livez_out livez_rc=0
readyz_out=$($KUBECTL get --raw='/readyz' 2>&1) || readyz_rc=$?
if [ "$readyz_rc" -ne 0 ] || [ "$readyz_out" != "ok" ]; then
smoke_failed=1
slack "postflight smoke FAIL — apiserver /readyz not ok (rc=$readyz_rc, body='${readyz_out:0:200}')"
fi
livez_out=$($KUBECTL get --raw='/livez' 2>&1) || livez_rc=$?
if [ "$livez_rc" -ne 0 ] || [ "$livez_out" != "ok" ]; then
smoke_failed=1
slack "postflight smoke FAIL — apiserver /livez not ok (rc=$livez_rc, body='${livez_out:0:200}')"
fi
# 2. In-cluster DNS resolution from THIS pod's resolver. If CoreDNS / kube-dns
# is broken after the upgrade, resolving the apiserver's cluster service
# name fails here even though pods may still look Running.
local dns_rc=0
python3 -c 'import socket; socket.gethostbyname("kubernetes.default.svc.cluster.local")' >/dev/null 2>&1 || dns_rc=$?
if [ "$dns_rc" -ne 0 ]; then
smoke_failed=1
slack "postflight smoke FAIL — in-cluster DNS broken (could not resolve kubernetes.default.svc.cluster.local; CoreDNS down?)"
fi
# 3. Core kube-system pods Running: control-plane statics (apiserver,
# controller-manager, scheduler, etcd) AND CoreDNS. `grep -v Running`
# returns 1 when everything is Running (the happy path) → wrap in `|| true`
# so pipefail doesn't abort us at the moment of success.
local comp not_running
for comp in kube-apiserver kube-controller-manager kube-scheduler etcd coredns; do
not_running=$($KUBECTL -n kube-system get pods --no-headers 2>/dev/null \
| { grep -E "(^|[[:space:]])${comp}-" || true; } \
| { grep -v Running || true; } | wc -l)
if [ "$not_running" -gt 0 ]; then
smoke_failed=1
slack "postflight smoke FAIL — $not_running kube-system '$comp' pod(s) not Running after upgrade"
fi
done
if [ "$smoke_failed" -ne 0 ]; then
slack "postflight smoke tests FAILED — upgrade left the cluster unhealthy, halting for a human (no rollback; kubeadm can't downgrade)"
exit 1
fi
echo "postflight smoke tests passed (apiserver health + DNS + core kube-system pods)"
# Clear annotations + gauges
$KUBECTL annotate ns "$NS" \
'viktorbarzin.me/k8s-upgrade-in-flight-' \

View file

@ -2252,6 +2252,26 @@ serverFiles:
subsystem: k8s-upgrade
annotations:
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
# K8sUpgradeBlocked: the k8s-version-upgrade chain pushes
# `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the
# target version — the cluster isn't ready (a critical addon lags the
# target's support window, an in-use API is deprecated/removed at the
# target, or a node's containerd predates the target's minimum). This
# is the designed "halt + alert" outcome, NOT a crash: the chain stops
# cleanly and the specific blocking reasons are posted to Slack by the
# upgrade chain. Same bare-metric pushgateway selector as
# K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump
# the named addon / migrate the deprecated API usage / upgrade the
# node's containerd, then the next nightly run proceeds automatically.
- alert: K8sUpgradeBlocked
expr: k8s_upgrade_blocked == 1
for: 10m
labels:
severity: warning
subsystem: k8s-upgrade
annotations:
summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain."
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically."
- name: "Traefik Ingress"
rules:
- alert: TraefikDown