diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index 6fddcac3..c0200d84 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes. - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout. - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently. - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor. - - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge. + - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge. - **Pushgateway metrics**: - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight) - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB) diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 55cd6857..08d43926 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -171,10 +171,27 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. -- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. +- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. - **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. +### Nightly upgrade report (Slack) + +CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, +default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London +alert-digest) posts ONE Slack summary each morning of the previous night's run: +running version, detector freshness, detected target + kind, the outcome +(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded / +🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads +the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh +blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. +Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. +This is the day-to-day visibility layer (it does NOT replace the alerts above — +those fire on problems; this reports the outcome every night). Manual run: +`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test` +(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip +`K8sUpgradeChainJobFailed`). + ### CoreDNS is NOT upgraded by kubeadm here CoreDNS runs a **custom split-horizon Corefile** (owned by the technitium stack) diff --git a/stacks/k8s-version-upgrade/main.tf b/stacks/k8s-version-upgrade/main.tf index bbba532f..9c4f1626 100644 --- a/stacks/k8s-version-upgrade/main.tf +++ b/stacks/k8s-version-upgrade/main.tf @@ -41,6 +41,15 @@ variable "enabled" { default = true } +# Nightly upgrade-report CronJob schedule. 06:07 UTC (07:07 London) — safely +# after the 23:00 chain has finished (worst case ~02:00) and before the 08:00 +# London alert-digest, so the morning Slack skim shows last night's upgrade +# outcome + any live blocker. Posts once/day; read-only. +variable "report_schedule" { + type = string + default = "7 6 * * *" +} + # Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump # in lockstep with claude-agent-service rebuilds. The image ships kubectl, # ssh-client, curl, jq, envsubst — everything the upgrade Jobs need. @@ -301,6 +310,7 @@ resource "kubernetes_config_map" "k8s_upgrade_scripts" { "update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh") "compat-gate.py" = file("${path.module}/scripts/compat-gate.py") "addon-compat.json" = file("${path.module}/scripts/addon-compat.json") + "nightly-report.py" = file("${path.module}/scripts/nightly-report.py") } } @@ -548,6 +558,98 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { } } +# --- Nightly upgrade report --- +# +# Each morning, after the 23:00 chain has finished, posts ONE concise Slack +# report of last night's upgrade outcome (no-op / blocked+reasons / upgraded / +# in-progress) so the autonomous upgrader's nightly result — and any live +# blocker — is visible at a glance. Read-only: reads the chain's Pushgateway +# gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons. +# Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the +# chain. Logic + unit tests: scripts/nightly-report.py, scripts/test_nightly_report.py. +resource "kubernetes_cron_job_v1" "k8s_upgrade_nightly_report" { + metadata { + name = "k8s-upgrade-nightly-report" + namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name + labels = local.labels + } + spec { + schedule = var.report_schedule + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 600 + suspend = !var.enabled + job_template { + metadata { + labels = local.labels + } + spec { + backoff_limit = 1 + ttl_seconds_after_finished = 86400 + template { + metadata { + labels = local.labels + } + spec { + service_account_name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name + restart_policy = "Never" + image_pull_secrets { + name = "registry-credentials" + } + volume { + name = "creds" + secret { + secret_name = "k8s-upgrade-creds" + default_mode = "0444" + } + } + volume { + name = "scripts" + config_map { + name = kubernetes_config_map.k8s_upgrade_scripts.metadata[0].name + default_mode = "0755" + } + } + container { + name = "report" + image = local.image + command = ["python3", "/scripts/nightly-report.py"] + env { + name = "HOME" + value = "/tmp" + } + volume_mount { + name = "creds" + mount_path = "/secrets/k8s-upgrade" + read_only = true + } + volume_mount { + name = "scripts" + mount_path = "/scripts" + read_only = true + } + resources { + requests = { + cpu = "50m" + memory = "128Mi" + } + limits = { + memory = "256Mi" + } + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + # CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed) # CI retrigger v2 2026-05-16T13:46:35+00:00 diff --git a/stacks/k8s-version-upgrade/scripts/nightly-report.py b/stacks/k8s-version-upgrade/scripts/nightly-report.py new file mode 100644 index 00000000..63b862ea --- /dev/null +++ b/stacks/k8s-version-upgrade/scripts/nightly-report.py @@ -0,0 +1,224 @@ +#!/usr/bin/env python3 +"""Nightly k8s-upgrade report -> Slack. + +Runs each morning (CronJob k8s-upgrade-nightly-report) after the 23:00 UTC +version-check chain has finished. Reads the chain's Pushgateway gauges + live +cluster state and posts ONE concise, actionable report to Slack so the +autonomous upgrader's nightly outcome — and any blocker holding it back — is +visible at a glance during the upgrade-cleanup window. + +Outcomes it distinguishes: + ⚪ no upgrade needed — cluster already at the latest supported patch + 🔴 BLOCKED — compat gate refused the target; lists live reasons + 🟢 UPGRADED — all nodes now on the detected target + 🟡 in progress / passed — gate passed, chain mid-flight (or partial) + ⚠️ detector STALE — the 23:00 detector did not run last night + +Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report) +are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the +compat-gate subprocess, Slack) lives in thin wrappers below them. +""" +import json +import os +import re +import subprocess +import sys +import urllib.request + +PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics") +SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook") +SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts") +STALE_SECONDS = 90000 # ~25h: a nightly detector older than this didn't run last night + +_METRIC_RE = re.compile(r"^(?P\w+)(?:\{(?P[^}]*)\})?\s+(?P[-+0-9.eE]+)\s*$") +_LABEL_RE = re.compile(r'(\w+)="([^"]*)"') + + +# ---------------------------------------------------------------- pure helpers +def parse_metrics(text): + """Prometheus text exposition -> list of (name, {labels}, float value).""" + out = [] + for line in (text or "").splitlines(): + line = line.strip() + if not line or line.startswith("#"): + continue + m = _METRIC_RE.match(line) + if not m: + continue + labels = dict(_LABEL_RE.findall(m.group("labels") or "")) + try: + val = float(m.group("value")) + except ValueError: + continue + out.append((m.group("name"), labels, val)) + return out + + +def select(metrics, name): + """All (labels, value) tuples for a given metric name.""" + return [(lbl, val) for (n, lbl, val) in metrics if n == name] + + +def fmt_age(seconds): + if seconds < 0: + return "in the future?!" + if seconds < 3600: + return f"{int(seconds // 60)}m ago" + if seconds < 86400: + return f"{seconds / 3600:.1f}h ago" + return f"{seconds / 86400:.1f}d ago" + + +def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs): + """Build the Slack message text from gathered facts. PURE. + + nodes: list of (name, kubeletVersion). metrics: parse_metrics() output. + blocker_reasons: multi-line str (compat-gate output) or None. + jobs: list of {name, status, age_s}. + """ + # kubelet reports "v1.35.6" but the gauges carry "1.35.6" — normalise so the + # UPGRADED comparison against the target actually matches. + versions = sorted({v.lstrip("v") for _, v in nodes}) + if len(versions) == 1: + node_line = f"Running: *{versions[0]}* (all {len(nodes)} nodes uniform)" + elif versions: + node_line = f"Running: *MIXED* {', '.join(versions)} across {len(nodes)} nodes" + else: + node_line = "Running: *unknown* (could not read nodes)" + + lr = select(metrics, "k8s_version_check_last_run_timestamp") + stale = False + if lr: + age = now_ts - lr[0][1] + stale = age > STALE_SECONDS + run_line = f"Last detector run: {fmt_age(age)} ({'STALE ⚠️' if stale else 'fresh ✓'})" + else: + run_line = "Last detector run: *unknown* (no metric)" + stale = True + + avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1] + blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked")) + + if avail: + lbl = avail[0][0] + target = lbl.get("target", "?") + kind = lbl.get("kind", "?") + tgt_line = f"Detected target: *{target}* ({kind})" + if blocked: + headline = f"🔴 BLOCKED — compat gate refused {target}" + elif len(versions) == 1 and target == versions[0]: + headline = f"🟢 UPGRADED — all nodes now on {target}" + else: + headline = f"🟡 IN PROGRESS / gate passed for {target}" + else: + target = None + tgt_line = "Detected target: none" + headline = "⚪ No upgrade needed (cluster at latest supported patch)" + + if stale: + headline = "⚠️ Detector did not run last night — " + headline + + msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line] + + if blocked and blocker_reasons: + msg.append("Blockers (live):") + for r in blocker_reasons.splitlines(): + r = r.strip() + if r: + msg.append(f" • {r}") + + if jobs: + msg.append("Chain jobs (recent):") + for j in jobs: + msg.append(f" • {j['name']}: {j['status']} ({fmt_age(j['age_s'])})") + + return "\n".join(msg) + + +# ----------------------------------------------------------------------- I/O +def _kubectl_json(args): + try: + r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30) + return json.loads(r.stdout) if r.stdout.strip() else {} + except Exception: + return {} + + +def get_nodes(): + d = _kubectl_json(["get", "nodes", "-o", "json"]) + return [(it["metadata"]["name"], + it.get("status", {}).get("nodeInfo", {}).get("kubeletVersion", "?")) + for it in d.get("items", [])] + + +def _job_status(it): + st = it.get("status", {}) + for c in st.get("conditions", []): + if c.get("type") == "Failed" and c.get("status") == "True": + return "Failed" + if c.get("type") == "Complete" and c.get("status") == "True": + return "Complete" + if st.get("active"): + return "Active" + return "Pending" + + +def get_jobs(now_ts): + import datetime + d = _kubectl_json(["-n", "k8s-upgrade", "get", "jobs", "-o", "json"]) + out = [] + for it in d.get("items", []): + ct = it["metadata"].get("creationTimestamp") + try: + age = now_ts - datetime.datetime.strptime( + ct, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=datetime.timezone.utc).timestamp() + except Exception: + age = 0 + if age <= 93600: # last 26h only + out.append({"name": it["metadata"]["name"], "status": _job_status(it), "age_s": age}) + return sorted(out, key=lambda j: j["age_s"]) + + +def get_blocker_reasons(target): + try: + with open(f"{SCRIPTS_DIR}/addon-compat.json") as f: + matrix = f.read() + r = subprocess.run(["python3", f"{SCRIPTS_DIR}/compat-gate.py", target], + input=matrix, capture_output=True, text=True, timeout=60) + return r.stdout.strip() + except Exception as e: + return f"(could not run compat-gate: {e})" + + +def post_slack(text): + if os.environ.get("DRY_RUN"): + return # main() always prints the message; DRY_RUN just skips the POST + with open(SLACK_FILE) as f: + url = f.read().strip() + data = json.dumps({"text": text}).encode() + req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"}) + urllib.request.urlopen(req, timeout=20) + + +def main(): + import time + now_ts = float(os.environ.get("NOW_TS", "")) if os.environ.get("NOW_TS") else time.time() + try: + metrics_txt = urllib.request.urlopen(PUSHGW, timeout=20).read().decode() + except Exception: + metrics_txt = "" + metrics = parse_metrics(metrics_txt) + nodes = get_nodes() + jobs = get_jobs(now_ts) + + avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1] + blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked")) + reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None + + msg = compose_report(now_ts, nodes, metrics, reasons, jobs) + post_slack(msg) + print(msg) + + +if __name__ == "__main__": + main() diff --git a/stacks/k8s-version-upgrade/scripts/test_nightly_report.py b/stacks/k8s-version-upgrade/scripts/test_nightly_report.py new file mode 100644 index 00000000..11c7e4d1 --- /dev/null +++ b/stacks/k8s-version-upgrade/scripts/test_nightly_report.py @@ -0,0 +1,81 @@ +"""Unit tests for nightly-report.py (pure helpers only). + +Run: pytest stacks/k8s-version-upgrade/scripts/test_nightly_report.py +Loaded via importlib because the filename has a hyphen. +""" +import importlib.util +import pathlib + +HERE = pathlib.Path(__file__).parent +_spec = importlib.util.spec_from_file_location("nightly_report", HERE / "nightly-report.py") +nr = importlib.util.module_from_spec(_spec) +_spec.loader.exec_module(nr) + +LAST_RUN = 1781996424.0 # 2026-06-20T23:00:24Z — matches last night's gauge +METRICS_BLOCKED = f"""# TYPE k8s_upgrade_available gauge +k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.34.9",target="1.35.6"}} 1 +k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 1 +k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN} +""" +NODES_UNIFORM = [(f"k8s-node{i}", "v1.34.9") for i in range(7)] + + +def test_parse_metrics_basic(): + m = nr.parse_metrics(METRICS_BLOCKED) + names = {n for n, _, _ in m} + assert names == {"k8s_upgrade_available", "k8s_upgrade_blocked", "k8s_version_check_last_run_timestamp"} + avail = nr.select(m, "k8s_upgrade_available") + assert avail[0][0]["target"] == "1.35.6" + assert avail[0][0]["kind"] == "minor" + assert avail[0][1] == 1.0 + + +def test_parse_metrics_ignores_comments_and_junk(): + assert nr.parse_metrics("# HELP foo\n\ngarbage line\n") == [] + + +def test_fmt_age(): + assert nr.fmt_age(120) == "2m ago" + assert nr.fmt_age(7200) == "2.0h ago" + assert nr.fmt_age(172800) == "2.0d ago" + + +def test_compose_blocked_lists_reasons(): + m = nr.parse_metrics(METRICS_BLOCKED) + reasons = ("addon external-secrets v0.12 supports k8s <= 1.31; target 1.35 exceeds it\n" + "addon kyverno v1.16 supports k8s <= 1.34; target 1.35 exceeds it") + out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, []) + assert "🔴 BLOCKED" in out and "1.35.6" in out + assert "external-secrets" in out and "kyverno" in out + assert "all 7 nodes uniform" in out + assert "fresh ✓" in out + + +def test_compose_noop_when_no_target(): + m = nr.parse_metrics(f'k8s_version_check_last_run_timestamp{{}} {LAST_RUN}\n') + out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, None, []) + assert "⚪ No upgrade needed" in out + + +def test_compose_upgraded_when_nodes_match_target(): + m = nr.parse_metrics(f"""k8s_upgrade_available{{kind="minor",target="1.35.6"}} 1 +k8s_upgrade_blocked{{}} 0 +k8s_version_check_last_run_timestamp{{}} {LAST_RUN} +""") + nodes = [(f"k8s-node{i}", "v1.35.6") for i in range(7)] + out = nr.compose_report(LAST_RUN + 30000, nodes, m, None, []) + assert "🟢 UPGRADED" in out and "1.35.6" in out + + +def test_compose_stale_detector_flagged(): + m = nr.parse_metrics(METRICS_BLOCKED) + out = nr.compose_report(LAST_RUN + 200000, NODES_UNIFORM, m, "x", []) # ~55h later + assert "Detector did not run last night" in out + assert "STALE" in out + + +def test_compose_includes_recent_jobs(): + m = nr.parse_metrics(METRICS_BLOCKED) + jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}] + out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs) + assert "k8s-upgrade-preflight-1-35-6: Failed" in out diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index d8f3a0a5..a504cedc 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2244,6 +2244,10 @@ serverFiles: # idempotency guard the next detection cycle deletes + re-spawns the # Failed Job (clearing this within ~24h); a sustained firing means it # re-failed — investigate the root cause. + # Scoped to the four chain PHASE jobs (preflight|master|worker| + # postflight) — NOT a bare `k8s-upgrade-.*`, which would also match + # helper jobs in the namespace like k8s-upgrade-nightly-report-* and + # false-fire if one of those failed. # `unless on() k8s_upgrade_blocked == 1` excludes the case where the # preflight terminally failed because the compat gate deliberately # REFUSED the target: block() exits 1 (so the Failed Job re-spawns @@ -2255,7 +2259,7 @@ serverFiles: # block until the next run's preflight resets it to 0, so the exclusion # holds for the whole blocked period. - alert: K8sUpgradeChainJobFailed - expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1) + expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1) for: 15m labels: severity: warning