infra/stacks/k8s-version-upgrade/scripts/nightly-report.py

#!/usr/bin/env python3
"""Weekly k8s-upgrade report -> Slack.

Runs Monday morning (CronJob k8s-upgrade-nightly-report — historical name kept)
after the Sunday-night version-check chain has finished. Reads the chain's
Pushgateway gauges + live cluster state and posts ONE concise, actionable report
to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
back — is visible at a glance during the upgrade-cleanup window.

Outcomes it distinguishes:
  ⚪ no upgrade needed     — cluster already at the latest supported patch
  🔴 BLOCKED              — compat gate refused the target; lists live reasons
  🟢 UPGRADED             — all nodes now on the detected target
  🟡 in progress / passed — gate passed, chain mid-flight (or partial)
  ⚠️  detector STALE       — the weekly (Sun 23:00) detector did not run

Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
compat-gate subprocess, Slack) lives in thin wrappers below them.
"""
import json
import os
import re
import subprocess
import sys
import urllib.request

PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
STALE_SECONDS = 90000  # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire

_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')


# ---------------------------------------------------------------- pure helpers
def parse_metrics(text):
    """Prometheus text exposition -> list of (name, {labels}, float value)."""
    out = []
    for line in (text or "").splitlines():
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        m = _METRIC_RE.match(line)
        if not m:
            continue
        labels = dict(_LABEL_RE.findall(m.group("labels") or ""))
        try:
            val = float(m.group("value"))
        except ValueError:
            continue
        out.append((m.group("name"), labels, val))
    return out


def select(metrics, name):
    """All (labels, value) tuples for a given metric name."""
    return [(lbl, val) for (n, lbl, val) in metrics if n == name]


def fmt_age(seconds):
    if seconds < 0:
        return "in the future?!"
    if seconds < 3600:
        return f"{int(seconds // 60)}m ago"
    if seconds < 86400:
        return f"{seconds / 3600:.1f}h ago"
    return f"{seconds / 86400:.1f}d ago"


def _render_reasons(blocker_reasons):
    """Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
    tag into labelled sections, stripping the tag from each bullet. Untagged
    lines (older reason format) fall back to a generic 'Blockers' list. PURE.
    Returns a list of message lines."""
    lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
    out, shown = [], set()
    for title, tag in (("Action needed", "[ACTIONABLE]"),
                       ("Waiting on upstream", "[WAITING]"),
                       ("Pinned (held by us)", "[PINNED]")):
        sub = [l for l in lines if l.startswith(tag)]
        if sub:
            out.append(f"{title}:")
            for l in sub:
                shown.add(l)
                out.append(f"  • {l[len(tag):].strip()}")
    rest = [l for l in lines if l not in shown]
    if rest:
        out.append("Blockers:")
        out.extend(f"  • {l}" for l in rest)
    return out


def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
    """Build the Slack message text from gathered facts. PURE.

    nodes: list of (name, kubeletVersion). metrics: parse_metrics() output.
    blocker_reasons: multi-line str (compat-gate output) or None.
    jobs: list of {name, status, age_s}.
    """
    # kubelet reports "v1.35.6" but the gauges carry "1.35.6" — normalise so the
    # UPGRADED comparison against the target actually matches.
    versions = sorted({v.lstrip("v") for _, v in nodes})
    if len(versions) == 1:
        node_line = f"Running: *{versions[0]}* (all {len(nodes)} nodes uniform)"
    elif versions:
        node_line = f"Running: *MIXED* {', '.join(versions)} across {len(nodes)} nodes"
    else:
        node_line = "Running: *unknown* (could not read nodes)"

    lr = select(metrics, "k8s_version_check_last_run_timestamp")
    stale = False
    if lr:
        age = now_ts - lr[0][1]
        stale = age > STALE_SECONDS
        run_line = f"Last detector run: {fmt_age(age)} ({'STALE ⚠️' if stale else 'fresh ✓'})"
    else:
        run_line = "Last detector run: *unknown* (no metric)"
        stale = True

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))

    if avail:
        lbl = avail[0][0]
        target = lbl.get("target", "?")
        kind = lbl.get("kind", "?")
        tgt_line = f"Detected target: *{target}* ({kind})"
        if blocked:
            # actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
            headline = f"🔴 BLOCKED (action needed) — {target}"
        elif held:
            # waiting on upstream and/or a pinned addon — nothing to do but wait;
            # intentionally NO alert, this nightly line is the only signal
            headline = f"⏸️ HELD — {target} not yet upgradable"
        elif len(versions) == 1 and target == versions[0]:
            headline = f"🟢 UPGRADED — all nodes now on {target}"
        else:
            headline = f"🟡 IN PROGRESS / gate passed for {target}"
    else:
        target = None
        tgt_line = "Detected target: none"
        headline = "⚪ No upgrade needed (cluster at latest supported patch)"

    if stale:
        headline = "⚠️ Detector did not run last night — " + headline

    msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]

    if (blocked or held) and blocker_reasons:
        msg.extend(_render_reasons(blocker_reasons))

    if jobs:
        msg.append("Chain jobs (recent):")
        for j in jobs:
            msg.append(f"  • {j['name']}: {j['status']} ({fmt_age(j['age_s'])})")

    return "\n".join(msg)


# ----------------------------------------------------------------------- I/O
def _kubectl_json(args):
    try:
        r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
        return json.loads(r.stdout) if r.stdout.strip() else {}
    except Exception:
        return {}


def get_nodes():
    d = _kubectl_json(["get", "nodes", "-o", "json"])
    return [(it["metadata"]["name"],
             it.get("status", {}).get("nodeInfo", {}).get("kubeletVersion", "?"))
            for it in d.get("items", [])]


def _job_status(it):
    st = it.get("status", {})
    for c in st.get("conditions", []):
        if c.get("type") == "Failed" and c.get("status") == "True":
            return "Failed"
        if c.get("type") == "Complete" and c.get("status") == "True":
            return "Complete"
    if st.get("active"):
        return "Active"
    return "Pending"


def get_jobs(now_ts):
    import datetime
    d = _kubectl_json(["-n", "k8s-upgrade", "get", "jobs", "-o", "json"])
    out = []
    for it in d.get("items", []):
        ct = it["metadata"].get("creationTimestamp")
        try:
            age = now_ts - datetime.datetime.strptime(
                ct, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=datetime.timezone.utc).timestamp()
        except Exception:
            age = 0
        if age <= 93600:  # last 26h only
            out.append({"name": it["metadata"]["name"], "status": _job_status(it), "age_s": age})
    return sorted(out, key=lambda j: j["age_s"])


def get_blocker_reasons(target):
    try:
        with open(f"{SCRIPTS_DIR}/addon-compat.json") as f:
            matrix = f.read()
        r = subprocess.run(["python3", f"{SCRIPTS_DIR}/compat-gate.py", target],
                           input=matrix, capture_output=True, text=True, timeout=60)
        return r.stdout.strip()
    except Exception as e:
        return f"(could not run compat-gate: {e})"


def post_slack(text):
    if os.environ.get("DRY_RUN"):
        return  # main() always prints the message; DRY_RUN just skips the POST
    with open(SLACK_FILE) as f:
        url = f.read().strip()
    data = json.dumps({"text": text}).encode()
    req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req, timeout=20)


def main():
    import time
    now_ts = float(os.environ.get("NOW_TS", "")) if os.environ.get("NOW_TS") else time.time()
    try:
        metrics_txt = urllib.request.urlopen(PUSHGW, timeout=20).read().decode()
    except Exception:
        metrics_txt = ""
    metrics = parse_metrics(metrics_txt)
    nodes = get_nodes()
    jobs = get_jobs(now_ts)

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None

    msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
    post_slack(msg)
    print(msg)


if __name__ == "__main__":
    main()
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
+								#!/usr/bin/env python3
-												k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)

Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-29 06:22:20 +00:00
+								"""Weekly k8s-upgrade report -> Slack.
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
-												k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)

Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-29 06:22:20 +00:00
+								Runs Monday morning (CronJob k8s-upgrade-nightly-report — historical name kept)
 								after the Sunday-night version-check chain has finished. Reads the chain's
 								Pushgateway gauges + live cluster state and posts ONE concise, actionable report
 								to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
 								back — is visible at a glance during the upgrade-cleanup window.
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								Outcomes it distinguishes:
 								  ⚪ no upgrade needed     — cluster already at the latest supported patch
 								  🔴 BLOCKED              — compat gate refused the target; lists live reasons
 								  🟢 UPGRADED             — all nodes now on the detected target
 								  🟡 in progress / passed — gate passed, chain mid-flight (or partial)
-												k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)

Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-29 06:22:20 +00:00
+								  ⚠️  detector STALE       — the weekly (Sun 23:00) detector did not run
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
 								are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
 								compat-gate subprocess, Slack) lives in thin wrappers below them.
 								"""
 								import json
 								import os
 								import re
 								import subprocess
 								import sys
 								import urllib.request
 								PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
 								SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
 								SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
-												k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)

Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-29 06:22:20 +00:00
+								STALE_SECONDS = 90000  # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
 								_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
 								# ---------------------------------------------------------------- pure helpers
 								def parse_metrics(text):
 								    """Prometheus text exposition -> list of (name, {labels}, float value)."""
 								    out = []
 								    for line in (text or "").splitlines():
 								        line = line.strip()
 								        if not line or line.startswith("#"):
 								            continue
 								        m = _METRIC_RE.match(line)
 								        if not m:
 								            continue
 								        labels = dict(_LABEL_RE.findall(m.group("labels") or ""))
 								        try:
 								            val = float(m.group("value"))
 								        except ValueError:
 								            continue
 								        out.append((m.group("name"), labels, val))
 								    return out
 								def select(metrics, name):
 								    """All (labels, value) tuples for a given metric name."""
 								    return [(lbl, val) for (n, lbl, val) in metrics if n == name]
 								def fmt_age(seconds):
 								    if seconds < 0:
 								        return "in the future?!"
 								    if seconds < 3600:
 								        return f"{int(seconds // 60)}m ago"
 								    if seconds < 86400:
 								        return f"{seconds / 3600:.1f}h ago"
 								    return f"{seconds / 86400:.1f}d ago"
-												k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-28 10:08:20 +00:00
+								def _render_reasons(blocker_reasons):
 								    """Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
 								    tag into labelled sections, stripping the tag from each bullet. Untagged
 								    lines (older reason format) fall back to a generic 'Blockers' list. PURE.
 								    Returns a list of message lines."""
 								    lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
 								    out, shown = [], set()
 								    for title, tag in (("Action needed", "[ACTIONABLE]"),
 								                       ("Waiting on upstream", "[WAITING]"),
 								                       ("Pinned (held by us)", "[PINNED]")):
 								        sub = [l for l in lines if l.startswith(tag)]
 								        if sub:
 								            out.append(f"{title}:")
 								            for l in sub:
 								                shown.add(l)
 								                out.append(f"  • {l[len(tag):].strip()}")
 								    rest = [l for l in lines if l not in shown]
 								    if rest:
 								        out.append("Blockers:")
 								        out.extend(f"  • {l}" for l in rest)
 								    return out
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
+								def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
 								    """Build the Slack message text from gathered facts. PURE.
 								    nodes: list of (name, kubeletVersion). metrics: parse_metrics() output.
 								    blocker_reasons: multi-line str (compat-gate output) or None.
 								    jobs: list of {name, status, age_s}.
 								    """
 								    # kubelet reports "v1.35.6" but the gauges carry "1.35.6" — normalise so the
 								    # UPGRADED comparison against the target actually matches.
 								    versions = sorted({v.lstrip("v") for _, v in nodes})
 								    if len(versions) == 1:
 								        node_line = f"Running: *{versions[0]}* (all {len(nodes)} nodes uniform)"
 								    elif versions:
 								        node_line = f"Running: *MIXED* {', '.join(versions)} across {len(nodes)} nodes"
 								    else:
 								        node_line = "Running: *unknown* (could not read nodes)"
 								    lr = select(metrics, "k8s_version_check_last_run_timestamp")
 								    stale = False
 								    if lr:
 								        age = now_ts - lr[0][1]
 								        stale = age > STALE_SECONDS
 								        run_line = f"Last detector run: {fmt_age(age)} ({'STALE ⚠️' if stale else 'fresh ✓'})"
 								    else:
 								        run_line = "Last detector run: *unknown* (no metric)"
 								        stale = True
 								    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
 								    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
-												k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-28 10:08:20 +00:00
+								    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								    if avail:
 								        lbl = avail[0][0]
 								        target = lbl.get("target", "?")
 								        kind = lbl.get("kind", "?")
 								        tgt_line = f"Detected target: *{target}* ({kind})"
 								        if blocked:
-												k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-28 10:08:20 +00:00
+								            # actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
 								            headline = f"🔴 BLOCKED (action needed) — {target}"
 								        elif held:
 								            # waiting on upstream and/or a pinned addon — nothing to do but wait;
 								            # intentionally NO alert, this nightly line is the only signal
 								            headline = f"⏸️ HELD — {target} not yet upgradable"
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
+								        elif len(versions) == 1 and target == versions[0]:
 								            headline = f"🟢 UPGRADED — all nodes now on {target}"
 								        else:
 								            headline = f"🟡 IN PROGRESS / gate passed for {target}"
 								    else:
 								        target = None
 								        tgt_line = "Detected target: none"
 								        headline = "⚪ No upgrade needed (cluster at latest supported patch)"
 								    if stale:
 								        headline = "⚠️ Detector did not run last night — " + headline
 								    msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
-												k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-28 10:08:20 +00:00
+								    if (blocked or held) and blocker_reasons:
 								        msg.extend(_render_reasons(blocker_reasons))
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								    if jobs:
 								        msg.append("Chain jobs (recent):")
 								        for j in jobs:
 								            msg.append(f"  • {j['name']}: {j['status']} ({fmt_age(j['age_s'])})")
 								    return "\n".join(msg)
 								# ----------------------------------------------------------------------- I/O
 								def _kubectl_json(args):
 								    try:
 								        r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
 								        return json.loads(r.stdout) if r.stdout.strip() else {}
 								    except Exception:
 								        return {}
 								def get_nodes():
 								    d = _kubectl_json(["get", "nodes", "-o", "json"])
 								    return [(it["metadata"]["name"],
 								             it.get("status", {}).get("nodeInfo", {}).get("kubeletVersion", "?"))
 								            for it in d.get("items", [])]
 								def _job_status(it):
 								    st = it.get("status", {})
 								    for c in st.get("conditions", []):
 								        if c.get("type") == "Failed" and c.get("status") == "True":
 								            return "Failed"
 								        if c.get("type") == "Complete" and c.get("status") == "True":
 								            return "Complete"
 								    if st.get("active"):
 								        return "Active"
 								    return "Pending"
 								def get_jobs(now_ts):
 								    import datetime
 								    d = _kubectl_json(["-n", "k8s-upgrade", "get", "jobs", "-o", "json"])
 								    out = []
 								    for it in d.get("items", []):
 								        ct = it["metadata"].get("creationTimestamp")
 								        try:
 								            age = now_ts - datetime.datetime.strptime(
 								                ct, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=datetime.timezone.utc).timestamp()
 								        except Exception:
 								            age = 0
 								        if age <= 93600:  # last 26h only
 								            out.append({"name": it["metadata"]["name"], "status": _job_status(it), "age_s": age})
 								    return sorted(out, key=lambda j: j["age_s"])
 								def get_blocker_reasons(target):
 								    try:
 								        with open(f"{SCRIPTS_DIR}/addon-compat.json") as f:
 								            matrix = f.read()
 								        r = subprocess.run(["python3", f"{SCRIPTS_DIR}/compat-gate.py", target],
 								                           input=matrix, capture_output=True, text=True, timeout=60)
 								        return r.stdout.strip()
 								    except Exception as e:
 								        return f"(could not run compat-gate: {e})"
 								def post_slack(text):
 								    if os.environ.get("DRY_RUN"):
 								        return  # main() always prints the message; DRY_RUN just skips the POST
 								    with open(SLACK_FILE) as f:
 								        url = f.read().strip()
 								    data = json.dumps({"text": text}).encode()
 								    req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
 								    urllib.request.urlopen(req, timeout=20)
 								def main():
 								    import time
 								    now_ts = float(os.environ.get("NOW_TS", "")) if os.environ.get("NOW_TS") else time.time()
 								    try:
 								        metrics_txt = urllib.request.urlopen(PUSHGW, timeout=20).read().decode()
 								    except Exception:
 								        metrics_txt = ""
 								    metrics = parse_metrics(metrics_txt)
 								    nodes = get_nodes()
 								    jobs = get_jobs(now_ts)
 								    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
 								    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
-												k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-28 10:08:20 +00:00
+								    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
 								    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
-												k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases

Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-21 16:57:44 +00:00
 								    msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
 								    post_slack(msg)
 								    print(msg)
 								if __name__ == "__main__":
 								    main()