infra/stacks/k8s-version-upgrade/scripts/test_nightly_report.py
Viktor Barzin ead876ec65
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00

81 lines
3.2 KiB
Python

"""Unit tests for nightly-report.py (pure helpers only).
Run: pytest stacks/k8s-version-upgrade/scripts/test_nightly_report.py
Loaded via importlib because the filename has a hyphen.
"""
import importlib.util
import pathlib
HERE = pathlib.Path(__file__).parent
_spec = importlib.util.spec_from_file_location("nightly_report", HERE / "nightly-report.py")
nr = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(nr)
LAST_RUN = 1781996424.0 # 2026-06-20T23:00:24Z — matches last night's gauge
METRICS_BLOCKED = f"""# TYPE k8s_upgrade_available gauge
k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.34.9",target="1.35.6"}} 1
k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 1
k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
"""
NODES_UNIFORM = [(f"k8s-node{i}", "v1.34.9") for i in range(7)]
def test_parse_metrics_basic():
m = nr.parse_metrics(METRICS_BLOCKED)
names = {n for n, _, _ in m}
assert names == {"k8s_upgrade_available", "k8s_upgrade_blocked", "k8s_version_check_last_run_timestamp"}
avail = nr.select(m, "k8s_upgrade_available")
assert avail[0][0]["target"] == "1.35.6"
assert avail[0][0]["kind"] == "minor"
assert avail[0][1] == 1.0
def test_parse_metrics_ignores_comments_and_junk():
assert nr.parse_metrics("# HELP foo\n\ngarbage line\n") == []
def test_fmt_age():
assert nr.fmt_age(120) == "2m ago"
assert nr.fmt_age(7200) == "2.0h ago"
assert nr.fmt_age(172800) == "2.0d ago"
def test_compose_blocked_lists_reasons():
m = nr.parse_metrics(METRICS_BLOCKED)
reasons = ("addon external-secrets v0.12 supports k8s <= 1.31; target 1.35 exceeds it\n"
"addon kyverno v1.16 supports k8s <= 1.34; target 1.35 exceeds it")
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
assert "🔴 BLOCKED" in out and "1.35.6" in out
assert "external-secrets" in out and "kyverno" in out
assert "all 7 nodes uniform" in out
assert "fresh ✓" in out
def test_compose_noop_when_no_target():
m = nr.parse_metrics(f'k8s_version_check_last_run_timestamp{{}} {LAST_RUN}\n')
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, None, [])
assert "⚪ No upgrade needed" in out
def test_compose_upgraded_when_nodes_match_target():
m = nr.parse_metrics(f"""k8s_upgrade_available{{kind="minor",target="1.35.6"}} 1
k8s_upgrade_blocked{{}} 0
k8s_version_check_last_run_timestamp{{}} {LAST_RUN}
""")
nodes = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
out = nr.compose_report(LAST_RUN + 30000, nodes, m, None, [])
assert "🟢 UPGRADED" in out and "1.35.6" in out
def test_compose_stale_detector_flagged():
m = nr.parse_metrics(METRICS_BLOCKED)
out = nr.compose_report(LAST_RUN + 200000, NODES_UNIFORM, m, "x", []) # ~55h later
assert "Detector did not run last night" in out
assert "STALE" in out
def test_compose_includes_recent_jobs():
m = nr.parse_metrics(METRICS_BLOCKED)
jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
assert "k8s-upgrade-preflight-1-35-6: Failed" in out