Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
This commit is contained in:
commit
21dbd79ae4
6 changed files with 349 additions and 4 deletions
|
|
@ -36,6 +36,7 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
|
||||||
▼
|
▼
|
||||||
|
|
||||||
Job 0 — preflight (pinned: k8s-node1)
|
Job 0 — preflight (pinned: k8s-node1)
|
||||||
|
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
|
||||||
├── All nodes Ready + no Mem/Disk pressure
|
├── All nodes Ready + no Mem/Disk pressure
|
||||||
├── halt-on-alert (kured-style ignore-list)
|
├── halt-on-alert (kured-style ignore-list)
|
||||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||||
|
|
@ -87,6 +88,46 @@ Job 6 — postflight (no pinning)
|
||||||
**adding a node needs no change** — the chain upgrades every worker still
|
**adding a node needs no change** — the chain upgrades every worker still
|
||||||
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
|
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
|
||||||
|
|
||||||
|
### Auto-upgrade compat gate
|
||||||
|
|
||||||
|
The chain now attempts **patch AND minor** upgrades autonomously — but before any
|
||||||
|
mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks)
|
||||||
|
the upgrade** if any of these hold for the detected target:
|
||||||
|
|
||||||
|
- a **critical addon's running version doesn't support the target k8s minor**
|
||||||
|
(running version > the addon's highest-supported minor in the compat matrix),
|
||||||
|
- an **in-use deprecated API is removed at/before the target** — measured live
|
||||||
|
from `apiserver_requested_deprecated_apis` (something is still calling a
|
||||||
|
group/version that the target k8s drops), or
|
||||||
|
- a **node's containerd is below the target's floor** (the minimum containerd the
|
||||||
|
target k8s requires).
|
||||||
|
|
||||||
|
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
|
||||||
|
|
||||||
|
**On a block**, the gate:
|
||||||
|
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
|
||||||
|
Prometheus alert),
|
||||||
|
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
|
||||||
|
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
|
||||||
|
this is not a failure). Because the block happens **before any mutation, no
|
||||||
|
rollback is involved**; nothing was changed.
|
||||||
|
|
||||||
|
**To clear a block**: upgrade the named addon (or migrate the API caller off the
|
||||||
|
deprecated group/version, or bump containerd on the named node) so the offending
|
||||||
|
condition no longer holds. The **next nightly run then proceeds automatically** —
|
||||||
|
no manual chain restart needed.
|
||||||
|
|
||||||
|
The **compat matrix** lives in
|
||||||
|
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
|
||||||
|
supported k8s minor`, populated from each addon's own compatibility docs. **Keep
|
||||||
|
it current**; the gate reads it on every run. Gate logic:
|
||||||
|
`stacks/k8s-version-upgrade/scripts/compat-gate.py`.
|
||||||
|
|
||||||
|
> The detector's minor-probe was **fixed** (the `HEAD pkgs.k8s.io/.../v<NEXT_MINOR>`
|
||||||
|
> curl now follows the 302 from `pkgs.k8s.io` via `-L`), so **minor versions are
|
||||||
|
> finally detected** — and are gated behind the compat check above before the
|
||||||
|
> chain will act on them.
|
||||||
|
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### Shared resources (one-time, Terraform-managed)
|
### Shared resources (one-time, Terraform-managed)
|
||||||
|
|
@ -118,7 +159,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
||||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
||||||
- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
|
||||||
|
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||||
|
|
||||||
### CoreDNS is NOT upgraded by kubeadm here
|
### CoreDNS is NOT upgraded by kubeadm here
|
||||||
|
|
||||||
|
|
@ -391,6 +433,8 @@ kill %1
|
||||||
|------|-------|
|
|------|-------|
|
||||||
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
|
||||||
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
||||||
|
| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` |
|
||||||
|
| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` |
|
||||||
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
|
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
|
||||||
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
|
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
|
||||||
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
||||||
|
|
|
||||||
|
|
@ -297,8 +297,10 @@ resource "kubernetes_config_map" "k8s_upgrade_scripts" {
|
||||||
labels = local.labels
|
labels = local.labels
|
||||||
}
|
}
|
||||||
data = {
|
data = {
|
||||||
"upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
|
"upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
|
||||||
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
|
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
|
||||||
|
"compat-gate.py" = file("${path.module}/scripts/compat-gate.py")
|
||||||
|
"addon-compat.json" = file("${path.module}/scripts/addon-compat.json")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -418,7 +420,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
||||||
NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
|
NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
|
||||||
NEXT_MINOR="1.$NEXT_MINOR_NUM"
|
NEXT_MINOR="1.$NEXT_MINOR_NUM"
|
||||||
NEXT_MINOR_AVAILABLE="no"
|
NEXT_MINOR_AVAILABLE="no"
|
||||||
if curl -sIo /dev/null -w '%%{http_code}' \
|
if curl -sILo /dev/null -w '%%{http_code}' \
|
||||||
"https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \
|
"https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \
|
||||||
| grep -q '^200$'; then
|
| grep -q '^200$'; then
|
||||||
NEXT_MINOR_AVAILABLE="yes"
|
NEXT_MINOR_AVAILABLE="yes"
|
||||||
|
|
|
||||||
57
stacks/k8s-version-upgrade/scripts/addon-compat.json
Normal file
57
stacks/k8s-version-upgrade/scripts/addon-compat.json
Normal file
|
|
@ -0,0 +1,57 @@
|
||||||
|
{
|
||||||
|
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
|
||||||
|
"addons": [
|
||||||
|
{
|
||||||
|
"name": "calico",
|
||||||
|
"namespace": "calico-system",
|
||||||
|
"kind": "daemonset",
|
||||||
|
"resource": "calico-node",
|
||||||
|
"image_re": "node:v?([0-9]+\\.[0-9]+)",
|
||||||
|
"max_k8s": {
|
||||||
|
"3.26": "1.28",
|
||||||
|
"3.27": "1.29",
|
||||||
|
"3.28": "1.30",
|
||||||
|
"3.29": "1.32",
|
||||||
|
"3.30": "1.35",
|
||||||
|
"3.31": "1.35",
|
||||||
|
"3.32": "1.36"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "external-secrets",
|
||||||
|
"namespace": "external-secrets",
|
||||||
|
"kind": "deployment",
|
||||||
|
"resource": "external-secrets",
|
||||||
|
"image_re": "external-secrets:v?([0-9]+\\.[0-9]+)",
|
||||||
|
"max_k8s": {
|
||||||
|
"0.12": "1.31",
|
||||||
|
"2.0": "1.35"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "kyverno",
|
||||||
|
"namespace": "kyverno",
|
||||||
|
"kind": "deployment",
|
||||||
|
"resource": "kyverno-admission-controller",
|
||||||
|
"image_re": "kyverno:v?([0-9]+\\.[0-9]+)",
|
||||||
|
"max_k8s": {
|
||||||
|
"1.16": "1.34",
|
||||||
|
"1.18": "1.35"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "gpu-operator",
|
||||||
|
"namespace": "nvidia",
|
||||||
|
"kind": "deployment",
|
||||||
|
"resource": "gpu-operator",
|
||||||
|
"image_re": "gpu-operator:v?([0-9]+\\.[0-9]+)",
|
||||||
|
"max_k8s": {
|
||||||
|
"25.10": "1.35",
|
||||||
|
"26.3": "1.36"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"containerd_min": {
|
||||||
|
"1.37": "2.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
142
stacks/k8s-version-upgrade/scripts/compat-gate.py
Normal file
142
stacks/k8s-version-upgrade/scripts/compat-gate.py
Normal file
|
|
@ -0,0 +1,142 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Preflight compatibility gate for the k8s version-upgrade chain.
|
||||||
|
|
||||||
|
Decides whether it is SAFE to auto-upgrade Kubernetes to a target version, so the
|
||||||
|
chain can "upgrade whenever it can, and halt + alert when it can't" without a
|
||||||
|
human in the loop. Reads the addon-compat matrix (JSON on stdin) and checks three
|
||||||
|
classes of blocker:
|
||||||
|
|
||||||
|
1. addon compat — every critical addon's RUNNING version must support the
|
||||||
|
target k8s minor (Calico is the usual blocker)
|
||||||
|
2. removed APIs — no in-use API (Prometheus apiserver_requested_deprecated_apis)
|
||||||
|
is removed at/before the target minor
|
||||||
|
3. containerd — every node's containerd >= the target's floor, if the matrix
|
||||||
|
declares one (e.g. the 1.7.x -> k8s 1.37 cliff)
|
||||||
|
|
||||||
|
Exit 0 = safe, proceed.
|
||||||
|
Exit 2 = BLOCKED — prints one human reason per line (caller pushes
|
||||||
|
k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
|
||||||
|
Exit 3 = the gate itself errored — caller treats as a block (fail safe).
|
||||||
|
|
||||||
|
Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
|
||||||
|
via $PROM for local testing (cluster DNS isn't resolvable off-cluster).
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import urllib.request
|
||||||
|
|
||||||
|
PROM = os.environ.get("PROM", "http://prometheus-server.monitoring.svc.cluster.local:80")
|
||||||
|
|
||||||
|
|
||||||
|
def minor(v):
|
||||||
|
"""'v1.35.6' | '1.35.6' | '1.35' -> (1, 35); None if unparseable."""
|
||||||
|
m = re.search(r"(\d+)\.(\d+)", v or "")
|
||||||
|
return (int(m.group(1)), int(m.group(2))) if m else None
|
||||||
|
|
||||||
|
|
||||||
|
def kget(args):
|
||||||
|
try:
|
||||||
|
r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
|
||||||
|
return r.stdout.strip()
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def check_addons(matrix, tgt):
|
||||||
|
reasons = []
|
||||||
|
for a in matrix.get("addons", []):
|
||||||
|
img = kget(["-n", a["namespace"], "get", a["kind"], a["resource"],
|
||||||
|
"-o", "jsonpath={.spec.template.spec.containers[*].image}"])
|
||||||
|
m = re.search(a["image_re"], img or "")
|
||||||
|
if not m:
|
||||||
|
# Fail safe: if we can't read the running version, don't upgrade blind.
|
||||||
|
reasons.append(f"addon {a['name']}: could not read running version "
|
||||||
|
f"(img='{img or 'not found'}') — refusing to upgrade blind")
|
||||||
|
continue
|
||||||
|
running = m.group(1) # e.g. "3.26"
|
||||||
|
# max_k8s maps an addon-version floor -> highest supported k8s minor.
|
||||||
|
# Pick the highest floor that is <= the running version.
|
||||||
|
max_k8s = None
|
||||||
|
for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
|
||||||
|
if minor(running) >= minor(floor):
|
||||||
|
max_k8s = mk
|
||||||
|
break
|
||||||
|
if max_k8s is None:
|
||||||
|
reasons.append(f"addon {a['name']} v{running}: below the lowest version "
|
||||||
|
f"in the compat matrix — unknown k8s support")
|
||||||
|
continue
|
||||||
|
if tgt > minor(max_k8s):
|
||||||
|
reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
|
||||||
|
f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
|
||||||
|
return reasons
|
||||||
|
|
||||||
|
|
||||||
|
def check_removed_apis(tgt):
|
||||||
|
reasons = []
|
||||||
|
try:
|
||||||
|
url = PROM + "/api/v1/query?query=apiserver_requested_deprecated_apis"
|
||||||
|
data = json.load(urllib.request.urlopen(url, timeout=20))
|
||||||
|
for s in data.get("data", {}).get("result", []):
|
||||||
|
lbl = s["metric"]
|
||||||
|
rr = lbl.get("removed_release", "")
|
||||||
|
if rr and minor(rr) and tgt >= minor(rr):
|
||||||
|
g = lbl.get("group") or "core"
|
||||||
|
reasons.append(f"deprecated API {g}/{lbl.get('version')} "
|
||||||
|
f"{lbl.get('resource')} is in use and is removed in "
|
||||||
|
f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
|
||||||
|
except Exception as e:
|
||||||
|
reasons.append(f"removed-API check could not query Prometheus ({e}) — "
|
||||||
|
f"refusing to upgrade blind")
|
||||||
|
return reasons
|
||||||
|
|
||||||
|
|
||||||
|
def check_containerd(matrix, tgt):
|
||||||
|
reasons = []
|
||||||
|
floor = matrix.get("containerd_min", {}).get(f"{tgt[0]}.{tgt[1]}")
|
||||||
|
if not floor:
|
||||||
|
return reasons
|
||||||
|
out = kget(["get", "nodes", "-o",
|
||||||
|
"jsonpath={range .items[*]}{.metadata.name}{\" \"}"
|
||||||
|
"{.status.nodeInfo.containerRuntimeVersion}{\"\\n\"}{end}"])
|
||||||
|
for line in out.splitlines():
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
name, _, ver = line.partition(" ")
|
||||||
|
cv = ver.replace("containerd://", "")
|
||||||
|
if minor(cv) and minor(cv) < minor(floor):
|
||||||
|
reasons.append(f"node {name} containerd {cv} < required {floor} "
|
||||||
|
f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
|
||||||
|
return reasons
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print("usage: compat-gate.py <target-k8s-version> (matrix JSON on stdin)")
|
||||||
|
sys.exit(3)
|
||||||
|
tgt = minor(sys.argv[1])
|
||||||
|
if not tgt:
|
||||||
|
print(f"bad target version '{sys.argv[1]}'")
|
||||||
|
sys.exit(3)
|
||||||
|
try:
|
||||||
|
matrix = json.load(sys.stdin)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"could not parse compat matrix JSON: {e}")
|
||||||
|
sys.exit(3)
|
||||||
|
|
||||||
|
reasons = (check_addons(matrix, tgt)
|
||||||
|
+ check_removed_apis(tgt)
|
||||||
|
+ check_containerd(matrix, tgt))
|
||||||
|
if reasons:
|
||||||
|
for r in reasons:
|
||||||
|
print(r)
|
||||||
|
sys.exit(2)
|
||||||
|
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -88,6 +88,19 @@ push() {
|
||||||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
|
||||||
|
# the cluster simply isn't ready for this target yet (an addon / in-use API /
|
||||||
|
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
|
||||||
|
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
|
||||||
|
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
|
||||||
|
# we can't" contract.
|
||||||
|
block() {
|
||||||
|
push k8s_upgrade_blocked 1
|
||||||
|
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
|
||||||
|
echo "BLOCKED: $1" >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
halt_on_alert_query() {
|
halt_on_alert_query() {
|
||||||
local extra_ignore="${1:-}"
|
local extra_ignore="${1:-}"
|
||||||
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
||||||
|
|
@ -299,6 +312,19 @@ spawn_next() {
|
||||||
phase_preflight() {
|
phase_preflight() {
|
||||||
slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
|
slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
|
||||||
|
|
||||||
|
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
|
||||||
|
# addon, an in-use deprecated API, or a node's containerd is too old for the
|
||||||
|
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
|
||||||
|
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
|
||||||
|
# only on a refusal. This is what makes unattended minor upgrades safe: the
|
||||||
|
# chain proceeds when the cluster supports the target and halts+alerts when
|
||||||
|
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
|
||||||
|
push k8s_upgrade_blocked 0
|
||||||
|
local gate_out gate_rc=0
|
||||||
|
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
|
||||||
|
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
|
||||||
|
echo "compat-gate passed for v$TARGET_VERSION"
|
||||||
|
|
||||||
# 1. All nodes Ready + no pressure
|
# 1. All nodes Ready + no pressure
|
||||||
local bad_nodes
|
local bad_nodes
|
||||||
bad_nodes=$($KUBECTL get nodes -o json | jq -r '
|
bad_nodes=$($KUBECTL get nodes -o json | jq -r '
|
||||||
|
|
@ -648,6 +674,60 @@ phase_postflight() {
|
||||||
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
||||||
| jq -r '.data.result[0].value[1] // "0"')
|
| jq -r '.data.result[0].value[1] // "0"')
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Deeper smoke tests — catch a cluster that's "all pods Running" but actually
|
||||||
|
# broken after the upgrade (dead apiserver health endpoints, broken
|
||||||
|
# CoreDNS/in-cluster DNS, or a control-plane component that's only superficially
|
||||||
|
# up). Uses ONLY the chain's existing permissions: read-only kubectl raw API
|
||||||
|
# reads + this pod's own resolver. No new pods/exec/images/RBAC. We do NOT
|
||||||
|
# rollback — kubeadm can't downgrade — we halt loudly for a human.
|
||||||
|
local smoke_failed=0
|
||||||
|
|
||||||
|
# 1. apiserver health endpoints. `kubectl get --raw` exits non-zero on a
|
||||||
|
# non-200, which under `set -e` would abort — capture rc explicitly.
|
||||||
|
local readyz_out readyz_rc=0 livez_out livez_rc=0
|
||||||
|
readyz_out=$($KUBECTL get --raw='/readyz' 2>&1) || readyz_rc=$?
|
||||||
|
if [ "$readyz_rc" -ne 0 ] || [ "$readyz_out" != "ok" ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — apiserver /readyz not ok (rc=$readyz_rc, body='${readyz_out:0:200}')"
|
||||||
|
fi
|
||||||
|
livez_out=$($KUBECTL get --raw='/livez' 2>&1) || livez_rc=$?
|
||||||
|
if [ "$livez_rc" -ne 0 ] || [ "$livez_out" != "ok" ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — apiserver /livez not ok (rc=$livez_rc, body='${livez_out:0:200}')"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. In-cluster DNS resolution from THIS pod's resolver. If CoreDNS / kube-dns
|
||||||
|
# is broken after the upgrade, resolving the apiserver's cluster service
|
||||||
|
# name fails here even though pods may still look Running.
|
||||||
|
local dns_rc=0
|
||||||
|
python3 -c 'import socket; socket.gethostbyname("kubernetes.default.svc.cluster.local")' >/dev/null 2>&1 || dns_rc=$?
|
||||||
|
if [ "$dns_rc" -ne 0 ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — in-cluster DNS broken (could not resolve kubernetes.default.svc.cluster.local; CoreDNS down?)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 3. Core kube-system pods Running: control-plane statics (apiserver,
|
||||||
|
# controller-manager, scheduler, etcd) AND CoreDNS. `grep -v Running`
|
||||||
|
# returns 1 when everything is Running (the happy path) → wrap in `|| true`
|
||||||
|
# so pipefail doesn't abort us at the moment of success.
|
||||||
|
local comp not_running
|
||||||
|
for comp in kube-apiserver kube-controller-manager kube-scheduler etcd coredns; do
|
||||||
|
not_running=$($KUBECTL -n kube-system get pods --no-headers 2>/dev/null \
|
||||||
|
| { grep -E "(^|[[:space:]])${comp}-" || true; } \
|
||||||
|
| { grep -v Running || true; } | wc -l)
|
||||||
|
if [ "$not_running" -gt 0 ]; then
|
||||||
|
smoke_failed=1
|
||||||
|
slack "postflight smoke FAIL — $not_running kube-system '$comp' pod(s) not Running after upgrade"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$smoke_failed" -ne 0 ]; then
|
||||||
|
slack "postflight smoke tests FAILED — upgrade left the cluster unhealthy, halting for a human (no rollback; kubeadm can't downgrade)"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "postflight smoke tests passed (apiserver health + DNS + core kube-system pods)"
|
||||||
|
|
||||||
# Clear annotations + gauges
|
# Clear annotations + gauges
|
||||||
$KUBECTL annotate ns "$NS" \
|
$KUBECTL annotate ns "$NS" \
|
||||||
'viktorbarzin.me/k8s-upgrade-in-flight-' \
|
'viktorbarzin.me/k8s-upgrade-in-flight-' \
|
||||||
|
|
|
||||||
|
|
@ -2252,6 +2252,26 @@ serverFiles:
|
||||||
subsystem: k8s-upgrade
|
subsystem: k8s-upgrade
|
||||||
annotations:
|
annotations:
|
||||||
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
||||||
|
# K8sUpgradeBlocked: the k8s-version-upgrade chain pushes
|
||||||
|
# `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the
|
||||||
|
# target version — the cluster isn't ready (a critical addon lags the
|
||||||
|
# target's support window, an in-use API is deprecated/removed at the
|
||||||
|
# target, or a node's containerd predates the target's minimum). This
|
||||||
|
# is the designed "halt + alert" outcome, NOT a crash: the chain stops
|
||||||
|
# cleanly and the specific blocking reasons are posted to Slack by the
|
||||||
|
# upgrade chain. Same bare-metric pushgateway selector as
|
||||||
|
# K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump
|
||||||
|
# the named addon / migrate the deprecated API usage / upgrade the
|
||||||
|
# node's containerd, then the next nightly run proceeds automatically.
|
||||||
|
- alert: K8sUpgradeBlocked
|
||||||
|
expr: k8s_upgrade_blocked == 1
|
||||||
|
for: 10m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
subsystem: k8s-upgrade
|
||||||
|
annotations:
|
||||||
|
summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain."
|
||||||
|
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically."
|
||||||
- name: "Traefik Ingress"
|
- name: "Traefik Ingress"
|
||||||
rules:
|
rules:
|
||||||
- alert: TraefikDown
|
- alert: TraefikDown
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue