infra/stacks/k8s-version-upgrade/scripts/nightly-report.py

251 lines
9.5 KiB
Python
Raw Normal View History

k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
#!/usr/bin/env python3
"""Weekly k8s-upgrade report -> Slack.
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Runs Monday morning (CronJob k8s-upgrade-nightly-report historical name kept)
after the Sunday-night version-check chain has finished. Reads the chain's
Pushgateway gauges + live cluster state and posts ONE concise, actionable report
to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
back is visible at a glance during the upgrade-cleanup window.
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Outcomes it distinguishes:
no upgrade needed cluster already at the latest supported patch
🔴 BLOCKED compat gate refused the target; lists live reasons
🟢 UPGRADED all nodes now on the detected target
🟡 in progress / passed gate passed, chain mid-flight (or partial)
detector STALE the weekly (Sun 23:00) detector did not run
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
compat-gate subprocess, Slack) lives in thin wrappers below them.
"""
import json
import os
import re
import subprocess
import sys
import urllib.request
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
STALE_SECONDS = 90000 # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
# ---------------------------------------------------------------- pure helpers
def parse_metrics(text):
"""Prometheus text exposition -> list of (name, {labels}, float value)."""
out = []
for line in (text or "").splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
m = _METRIC_RE.match(line)
if not m:
continue
labels = dict(_LABEL_RE.findall(m.group("labels") or ""))
try:
val = float(m.group("value"))
except ValueError:
continue
out.append((m.group("name"), labels, val))
return out
def select(metrics, name):
"""All (labels, value) tuples for a given metric name."""
return [(lbl, val) for (n, lbl, val) in metrics if n == name]
def fmt_age(seconds):
if seconds < 0:
return "in the future?!"
if seconds < 3600:
return f"{int(seconds // 60)}m ago"
if seconds < 86400:
return f"{seconds / 3600:.1f}h ago"
return f"{seconds / 86400:.1f}d ago"
k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
def _render_reasons(blocker_reasons):
"""Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
tag into labelled sections, stripping the tag from each bullet. Untagged
lines (older reason format) fall back to a generic 'Blockers' list. PURE.
Returns a list of message lines."""
lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
out, shown = [], set()
for title, tag in (("Action needed", "[ACTIONABLE]"),
("Waiting on upstream", "[WAITING]"),
("Pinned (held by us)", "[PINNED]")):
sub = [l for l in lines if l.startswith(tag)]
if sub:
out.append(f"{title}:")
for l in sub:
shown.add(l)
out.append(f"{l[len(tag):].strip()}")
rest = [l for l in lines if l not in shown]
if rest:
out.append("Blockers:")
out.extend(f"{l}" for l in rest)
return out
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
"""Build the Slack message text from gathered facts. PURE.
nodes: list of (name, kubeletVersion). metrics: parse_metrics() output.
blocker_reasons: multi-line str (compat-gate output) or None.
jobs: list of {name, status, age_s}.
"""
# kubelet reports "v1.35.6" but the gauges carry "1.35.6" — normalise so the
# UPGRADED comparison against the target actually matches.
versions = sorted({v.lstrip("v") for _, v in nodes})
if len(versions) == 1:
node_line = f"Running: *{versions[0]}* (all {len(nodes)} nodes uniform)"
elif versions:
node_line = f"Running: *MIXED* {', '.join(versions)} across {len(nodes)} nodes"
else:
node_line = "Running: *unknown* (could not read nodes)"
lr = select(metrics, "k8s_version_check_last_run_timestamp")
stale = False
if lr:
age = now_ts - lr[0][1]
stale = age > STALE_SECONDS
run_line = f"Last detector run: {fmt_age(age)} ({'STALE ⚠️' if stale else 'fresh ✓'})"
else:
run_line = "Last detector run: *unknown* (no metric)"
stale = True
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
if avail:
lbl = avail[0][0]
target = lbl.get("target", "?")
kind = lbl.get("kind", "?")
tgt_line = f"Detected target: *{target}* ({kind})"
if blocked:
k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
# actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
headline = f"🔴 BLOCKED (action needed) — {target}"
elif held:
# waiting on upstream and/or a pinned addon — nothing to do but wait;
# intentionally NO alert, this nightly line is the only signal
headline = f"⏸️ HELD — {target} not yet upgradable"
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
elif len(versions) == 1 and target == versions[0]:
headline = f"🟢 UPGRADED — all nodes now on {target}"
else:
headline = f"🟡 IN PROGRESS / gate passed for {target}"
else:
target = None
tgt_line = "Detected target: none"
headline = "⚪ No upgrade needed (cluster at latest supported patch)"
if stale:
headline = "⚠️ Detector did not run last night — " + headline
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
if (blocked or held) and blocker_reasons:
msg.extend(_render_reasons(blocker_reasons))
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
if jobs:
msg.append("Chain jobs (recent):")
for j in jobs:
msg.append(f"{j['name']}: {j['status']} ({fmt_age(j['age_s'])})")
return "\n".join(msg)
# ----------------------------------------------------------------------- I/O
def _kubectl_json(args):
try:
r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
return json.loads(r.stdout) if r.stdout.strip() else {}
except Exception:
return {}
def get_nodes():
d = _kubectl_json(["get", "nodes", "-o", "json"])
return [(it["metadata"]["name"],
it.get("status", {}).get("nodeInfo", {}).get("kubeletVersion", "?"))
for it in d.get("items", [])]
def _job_status(it):
st = it.get("status", {})
for c in st.get("conditions", []):
if c.get("type") == "Failed" and c.get("status") == "True":
return "Failed"
if c.get("type") == "Complete" and c.get("status") == "True":
return "Complete"
if st.get("active"):
return "Active"
return "Pending"
def get_jobs(now_ts):
import datetime
d = _kubectl_json(["-n", "k8s-upgrade", "get", "jobs", "-o", "json"])
out = []
for it in d.get("items", []):
ct = it["metadata"].get("creationTimestamp")
try:
age = now_ts - datetime.datetime.strptime(
ct, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=datetime.timezone.utc).timestamp()
except Exception:
age = 0
if age <= 93600: # last 26h only
out.append({"name": it["metadata"]["name"], "status": _job_status(it), "age_s": age})
return sorted(out, key=lambda j: j["age_s"])
def get_blocker_reasons(target):
try:
with open(f"{SCRIPTS_DIR}/addon-compat.json") as f:
matrix = f.read()
r = subprocess.run(["python3", f"{SCRIPTS_DIR}/compat-gate.py", target],
input=matrix, capture_output=True, text=True, timeout=60)
return r.stdout.strip()
except Exception as e:
return f"(could not run compat-gate: {e})"
def post_slack(text):
if os.environ.get("DRY_RUN"):
return # main() always prints the message; DRY_RUN just skips the POST
with open(SLACK_FILE) as f:
url = f.read().strip()
data = json.dumps({"text": text}).encode()
req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
urllib.request.urlopen(req, timeout=20)
def main():
import time
now_ts = float(os.environ.get("NOW_TS", "")) if os.environ.get("NOW_TS") else time.time()
try:
metrics_txt = urllib.request.urlopen(PUSHGW, timeout=20).read().decode()
except Exception:
metrics_txt = ""
metrics = parse_metrics(metrics_txt)
nodes = get_nodes()
jobs = get_jobs(now_ts)
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
post_slack(msg)
print(msg)
if __name__ == "__main__":
main()