k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.
This commit is contained in:
parent
cecd9fe247
commit
6cb823e431
3 changed files with 119 additions and 1 deletions
|
|
@ -2252,6 +2252,26 @@ serverFiles:
|
|||
subsystem: k8s-upgrade
|
||||
annotations:
|
||||
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
||||
# K8sUpgradeBlocked: the k8s-version-upgrade chain pushes
|
||||
# `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the
|
||||
# target version — the cluster isn't ready (a critical addon lags the
|
||||
# target's support window, an in-use API is deprecated/removed at the
|
||||
# target, or a node's containerd predates the target's minimum). This
|
||||
# is the designed "halt + alert" outcome, NOT a crash: the chain stops
|
||||
# cleanly and the specific blocking reasons are posted to Slack by the
|
||||
# upgrade chain. Same bare-metric pushgateway selector as
|
||||
# K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump
|
||||
# the named addon / migrate the deprecated API usage / upgrade the
|
||||
# node's containerd, then the next nightly run proceeds automatically.
|
||||
- alert: K8sUpgradeBlocked
|
||||
expr: k8s_upgrade_blocked == 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
subsystem: k8s-upgrade
|
||||
annotations:
|
||||
summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain."
|
||||
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically."
|
||||
- name: "Traefik Ingress"
|
||||
rules:
|
||||
- alert: TraefikDown
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue