k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7270e2be3b
commit
ead876ec65
6 changed files with 431 additions and 3 deletions
|
|
@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
|
|||
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
|
||||
- `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
|
||||
- `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
|
||||
- `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
|
||||
- `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
|
||||
- **Pushgateway metrics**:
|
||||
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
|
||||
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
|
||||
|
|
|
|||
|
|
@ -171,10 +171,27 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
|||
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
|
||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||
- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
|
||||
- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
|
||||
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
|
||||
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
|
||||
### Nightly upgrade report (Slack)
|
||||
|
||||
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
|
||||
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
|
||||
alert-digest) posts ONE Slack summary each morning of the previous night's run:
|
||||
running version, detector freshness, detected target + kind, the outcome
|
||||
(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
|
||||
🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
|
||||
the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
|
||||
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
|
||||
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
|
||||
This is the day-to-day visibility layer (it does NOT replace the alerts above —
|
||||
those fire on problems; this reports the outcome every night). Manual run:
|
||||
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
|
||||
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
|
||||
`K8sUpgradeChainJobFailed`).
|
||||
|
||||
### CoreDNS is NOT upgraded by kubeadm here
|
||||
|
||||
CoreDNS runs a **custom split-horizon Corefile** (owned by the technitium stack)
|
||||
|
|
|
|||
|
|
@ -41,6 +41,15 @@ variable "enabled" {
|
|||
default = true
|
||||
}
|
||||
|
||||
# Nightly upgrade-report CronJob schedule. 06:07 UTC (07:07 London) — safely
|
||||
# after the 23:00 chain has finished (worst case ~02:00) and before the 08:00
|
||||
# London alert-digest, so the morning Slack skim shows last night's upgrade
|
||||
# outcome + any live blocker. Posts once/day; read-only.
|
||||
variable "report_schedule" {
|
||||
type = string
|
||||
default = "7 6 * * *"
|
||||
}
|
||||
|
||||
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump
|
||||
# in lockstep with claude-agent-service rebuilds. The image ships kubectl,
|
||||
# ssh-client, curl, jq, envsubst — everything the upgrade Jobs need.
|
||||
|
|
@ -301,6 +310,7 @@ resource "kubernetes_config_map" "k8s_upgrade_scripts" {
|
|||
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
|
||||
"compat-gate.py" = file("${path.module}/scripts/compat-gate.py")
|
||||
"addon-compat.json" = file("${path.module}/scripts/addon-compat.json")
|
||||
"nightly-report.py" = file("${path.module}/scripts/nightly-report.py")
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -548,6 +558,98 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
}
|
||||
}
|
||||
|
||||
# --- Nightly upgrade report ---
|
||||
#
|
||||
# Each morning, after the 23:00 chain has finished, posts ONE concise Slack
|
||||
# report of last night's upgrade outcome (no-op / blocked+reasons / upgraded /
|
||||
# in-progress) so the autonomous upgrader's nightly result — and any live
|
||||
# blocker — is visible at a glance. Read-only: reads the chain's Pushgateway
|
||||
# gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons.
|
||||
# Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the
|
||||
# chain. Logic + unit tests: scripts/nightly-report.py, scripts/test_nightly_report.py.
|
||||
resource "kubernetes_cron_job_v1" "k8s_upgrade_nightly_report" {
|
||||
metadata {
|
||||
name = "k8s-upgrade-nightly-report"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
schedule = var.report_schedule
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 3
|
||||
starting_deadline_seconds = 600
|
||||
suspend = !var.enabled
|
||||
job_template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
|
||||
restart_policy = "Never"
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
volume {
|
||||
name = "creds"
|
||||
secret {
|
||||
secret_name = "k8s-upgrade-creds"
|
||||
default_mode = "0444"
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "scripts"
|
||||
config_map {
|
||||
name = kubernetes_config_map.k8s_upgrade_scripts.metadata[0].name
|
||||
default_mode = "0755"
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "report"
|
||||
image = local.image
|
||||
command = ["python3", "/scripts/nightly-report.py"]
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/tmp"
|
||||
}
|
||||
volume_mount {
|
||||
name = "creds"
|
||||
mount_path = "/secrets/k8s-upgrade"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "scripts"
|
||||
mount_path = "/scripts"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
|
||||
# CI retrigger v2 2026-05-16T13:46:35+00:00
|
||||
|
||||
|
|
|
|||
224
stacks/k8s-version-upgrade/scripts/nightly-report.py
Normal file
224
stacks/k8s-version-upgrade/scripts/nightly-report.py
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Nightly k8s-upgrade report -> Slack.
|
||||
|
||||
Runs each morning (CronJob k8s-upgrade-nightly-report) after the 23:00 UTC
|
||||
version-check chain has finished. Reads the chain's Pushgateway gauges + live
|
||||
cluster state and posts ONE concise, actionable report to Slack so the
|
||||
autonomous upgrader's nightly outcome — and any blocker holding it back — is
|
||||
visible at a glance during the upgrade-cleanup window.
|
||||
|
||||
Outcomes it distinguishes:
|
||||
⚪ no upgrade needed — cluster already at the latest supported patch
|
||||
🔴 BLOCKED — compat gate refused the target; lists live reasons
|
||||
🟢 UPGRADED — all nodes now on the detected target
|
||||
🟡 in progress / passed — gate passed, chain mid-flight (or partial)
|
||||
⚠️ detector STALE — the 23:00 detector did not run last night
|
||||
|
||||
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
|
||||
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
|
||||
compat-gate subprocess, Slack) lives in thin wrappers below them.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import urllib.request
|
||||
|
||||
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
|
||||
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
|
||||
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
|
||||
STALE_SECONDS = 90000 # ~25h: a nightly detector older than this didn't run last night
|
||||
|
||||
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
|
||||
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
|
||||
|
||||
|
||||
# ---------------------------------------------------------------- pure helpers
|
||||
def parse_metrics(text):
|
||||
"""Prometheus text exposition -> list of (name, {labels}, float value)."""
|
||||
out = []
|
||||
for line in (text or "").splitlines():
|
||||
line = line.strip()
|
||||
if not line or line.startswith("#"):
|
||||
continue
|
||||
m = _METRIC_RE.match(line)
|
||||
if not m:
|
||||
continue
|
||||
labels = dict(_LABEL_RE.findall(m.group("labels") or ""))
|
||||
try:
|
||||
val = float(m.group("value"))
|
||||
except ValueError:
|
||||
continue
|
||||
out.append((m.group("name"), labels, val))
|
||||
return out
|
||||
|
||||
|
||||
def select(metrics, name):
|
||||
"""All (labels, value) tuples for a given metric name."""
|
||||
return [(lbl, val) for (n, lbl, val) in metrics if n == name]
|
||||
|
||||
|
||||
def fmt_age(seconds):
|
||||
if seconds < 0:
|
||||
return "in the future?!"
|
||||
if seconds < 3600:
|
||||
return f"{int(seconds // 60)}m ago"
|
||||
if seconds < 86400:
|
||||
return f"{seconds / 3600:.1f}h ago"
|
||||
return f"{seconds / 86400:.1f}d ago"
|
||||
|
||||
|
||||
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
||||
"""Build the Slack message text from gathered facts. PURE.
|
||||
|
||||
nodes: list of (name, kubeletVersion). metrics: parse_metrics() output.
|
||||
blocker_reasons: multi-line str (compat-gate output) or None.
|
||||
jobs: list of {name, status, age_s}.
|
||||
"""
|
||||
# kubelet reports "v1.35.6" but the gauges carry "1.35.6" — normalise so the
|
||||
# UPGRADED comparison against the target actually matches.
|
||||
versions = sorted({v.lstrip("v") for _, v in nodes})
|
||||
if len(versions) == 1:
|
||||
node_line = f"Running: *{versions[0]}* (all {len(nodes)} nodes uniform)"
|
||||
elif versions:
|
||||
node_line = f"Running: *MIXED* {', '.join(versions)} across {len(nodes)} nodes"
|
||||
else:
|
||||
node_line = "Running: *unknown* (could not read nodes)"
|
||||
|
||||
lr = select(metrics, "k8s_version_check_last_run_timestamp")
|
||||
stale = False
|
||||
if lr:
|
||||
age = now_ts - lr[0][1]
|
||||
stale = age > STALE_SECONDS
|
||||
run_line = f"Last detector run: {fmt_age(age)} ({'STALE ⚠️' if stale else 'fresh ✓'})"
|
||||
else:
|
||||
run_line = "Last detector run: *unknown* (no metric)"
|
||||
stale = True
|
||||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
|
||||
if avail:
|
||||
lbl = avail[0][0]
|
||||
target = lbl.get("target", "?")
|
||||
kind = lbl.get("kind", "?")
|
||||
tgt_line = f"Detected target: *{target}* ({kind})"
|
||||
if blocked:
|
||||
headline = f"🔴 BLOCKED — compat gate refused {target}"
|
||||
elif len(versions) == 1 and target == versions[0]:
|
||||
headline = f"🟢 UPGRADED — all nodes now on {target}"
|
||||
else:
|
||||
headline = f"🟡 IN PROGRESS / gate passed for {target}"
|
||||
else:
|
||||
target = None
|
||||
tgt_line = "Detected target: none"
|
||||
headline = "⚪ No upgrade needed (cluster at latest supported patch)"
|
||||
|
||||
if stale:
|
||||
headline = "⚠️ Detector did not run last night — " + headline
|
||||
|
||||
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
|
||||
|
||||
if blocked and blocker_reasons:
|
||||
msg.append("Blockers (live):")
|
||||
for r in blocker_reasons.splitlines():
|
||||
r = r.strip()
|
||||
if r:
|
||||
msg.append(f" • {r}")
|
||||
|
||||
if jobs:
|
||||
msg.append("Chain jobs (recent):")
|
||||
for j in jobs:
|
||||
msg.append(f" • {j['name']}: {j['status']} ({fmt_age(j['age_s'])})")
|
||||
|
||||
return "\n".join(msg)
|
||||
|
||||
|
||||
# ----------------------------------------------------------------------- I/O
|
||||
def _kubectl_json(args):
|
||||
try:
|
||||
r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
|
||||
return json.loads(r.stdout) if r.stdout.strip() else {}
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
|
||||
def get_nodes():
|
||||
d = _kubectl_json(["get", "nodes", "-o", "json"])
|
||||
return [(it["metadata"]["name"],
|
||||
it.get("status", {}).get("nodeInfo", {}).get("kubeletVersion", "?"))
|
||||
for it in d.get("items", [])]
|
||||
|
||||
|
||||
def _job_status(it):
|
||||
st = it.get("status", {})
|
||||
for c in st.get("conditions", []):
|
||||
if c.get("type") == "Failed" and c.get("status") == "True":
|
||||
return "Failed"
|
||||
if c.get("type") == "Complete" and c.get("status") == "True":
|
||||
return "Complete"
|
||||
if st.get("active"):
|
||||
return "Active"
|
||||
return "Pending"
|
||||
|
||||
|
||||
def get_jobs(now_ts):
|
||||
import datetime
|
||||
d = _kubectl_json(["-n", "k8s-upgrade", "get", "jobs", "-o", "json"])
|
||||
out = []
|
||||
for it in d.get("items", []):
|
||||
ct = it["metadata"].get("creationTimestamp")
|
||||
try:
|
||||
age = now_ts - datetime.datetime.strptime(
|
||||
ct, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=datetime.timezone.utc).timestamp()
|
||||
except Exception:
|
||||
age = 0
|
||||
if age <= 93600: # last 26h only
|
||||
out.append({"name": it["metadata"]["name"], "status": _job_status(it), "age_s": age})
|
||||
return sorted(out, key=lambda j: j["age_s"])
|
||||
|
||||
|
||||
def get_blocker_reasons(target):
|
||||
try:
|
||||
with open(f"{SCRIPTS_DIR}/addon-compat.json") as f:
|
||||
matrix = f.read()
|
||||
r = subprocess.run(["python3", f"{SCRIPTS_DIR}/compat-gate.py", target],
|
||||
input=matrix, capture_output=True, text=True, timeout=60)
|
||||
return r.stdout.strip()
|
||||
except Exception as e:
|
||||
return f"(could not run compat-gate: {e})"
|
||||
|
||||
|
||||
def post_slack(text):
|
||||
if os.environ.get("DRY_RUN"):
|
||||
return # main() always prints the message; DRY_RUN just skips the POST
|
||||
with open(SLACK_FILE) as f:
|
||||
url = f.read().strip()
|
||||
data = json.dumps({"text": text}).encode()
|
||||
req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
|
||||
urllib.request.urlopen(req, timeout=20)
|
||||
|
||||
|
||||
def main():
|
||||
import time
|
||||
now_ts = float(os.environ.get("NOW_TS", "")) if os.environ.get("NOW_TS") else time.time()
|
||||
try:
|
||||
metrics_txt = urllib.request.urlopen(PUSHGW, timeout=20).read().decode()
|
||||
except Exception:
|
||||
metrics_txt = ""
|
||||
metrics = parse_metrics(metrics_txt)
|
||||
nodes = get_nodes()
|
||||
jobs = get_jobs(now_ts)
|
||||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
|
||||
|
||||
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
|
||||
post_slack(msg)
|
||||
print(msg)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
81
stacks/k8s-version-upgrade/scripts/test_nightly_report.py
Normal file
81
stacks/k8s-version-upgrade/scripts/test_nightly_report.py
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
"""Unit tests for nightly-report.py (pure helpers only).
|
||||
|
||||
Run: pytest stacks/k8s-version-upgrade/scripts/test_nightly_report.py
|
||||
Loaded via importlib because the filename has a hyphen.
|
||||
"""
|
||||
import importlib.util
|
||||
import pathlib
|
||||
|
||||
HERE = pathlib.Path(__file__).parent
|
||||
_spec = importlib.util.spec_from_file_location("nightly_report", HERE / "nightly-report.py")
|
||||
nr = importlib.util.module_from_spec(_spec)
|
||||
_spec.loader.exec_module(nr)
|
||||
|
||||
LAST_RUN = 1781996424.0 # 2026-06-20T23:00:24Z — matches last night's gauge
|
||||
METRICS_BLOCKED = f"""# TYPE k8s_upgrade_available gauge
|
||||
k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.34.9",target="1.35.6"}} 1
|
||||
k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 1
|
||||
k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
|
||||
"""
|
||||
NODES_UNIFORM = [(f"k8s-node{i}", "v1.34.9") for i in range(7)]
|
||||
|
||||
|
||||
def test_parse_metrics_basic():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED)
|
||||
names = {n for n, _, _ in m}
|
||||
assert names == {"k8s_upgrade_available", "k8s_upgrade_blocked", "k8s_version_check_last_run_timestamp"}
|
||||
avail = nr.select(m, "k8s_upgrade_available")
|
||||
assert avail[0][0]["target"] == "1.35.6"
|
||||
assert avail[0][0]["kind"] == "minor"
|
||||
assert avail[0][1] == 1.0
|
||||
|
||||
|
||||
def test_parse_metrics_ignores_comments_and_junk():
|
||||
assert nr.parse_metrics("# HELP foo\n\ngarbage line\n") == []
|
||||
|
||||
|
||||
def test_fmt_age():
|
||||
assert nr.fmt_age(120) == "2m ago"
|
||||
assert nr.fmt_age(7200) == "2.0h ago"
|
||||
assert nr.fmt_age(172800) == "2.0d ago"
|
||||
|
||||
|
||||
def test_compose_blocked_lists_reasons():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED)
|
||||
reasons = ("addon external-secrets v0.12 supports k8s <= 1.31; target 1.35 exceeds it\n"
|
||||
"addon kyverno v1.16 supports k8s <= 1.34; target 1.35 exceeds it")
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
|
||||
assert "🔴 BLOCKED" in out and "1.35.6" in out
|
||||
assert "external-secrets" in out and "kyverno" in out
|
||||
assert "all 7 nodes uniform" in out
|
||||
assert "fresh ✓" in out
|
||||
|
||||
|
||||
def test_compose_noop_when_no_target():
|
||||
m = nr.parse_metrics(f'k8s_version_check_last_run_timestamp{{}} {LAST_RUN}\n')
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, None, [])
|
||||
assert "⚪ No upgrade needed" in out
|
||||
|
||||
|
||||
def test_compose_upgraded_when_nodes_match_target():
|
||||
m = nr.parse_metrics(f"""k8s_upgrade_available{{kind="minor",target="1.35.6"}} 1
|
||||
k8s_upgrade_blocked{{}} 0
|
||||
k8s_version_check_last_run_timestamp{{}} {LAST_RUN}
|
||||
""")
|
||||
nodes = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
|
||||
out = nr.compose_report(LAST_RUN + 30000, nodes, m, None, [])
|
||||
assert "🟢 UPGRADED" in out and "1.35.6" in out
|
||||
|
||||
|
||||
def test_compose_stale_detector_flagged():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED)
|
||||
out = nr.compose_report(LAST_RUN + 200000, NODES_UNIFORM, m, "x", []) # ~55h later
|
||||
assert "Detector did not run last night" in out
|
||||
assert "STALE" in out
|
||||
|
||||
|
||||
def test_compose_includes_recent_jobs():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED)
|
||||
jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
|
||||
assert "k8s-upgrade-preflight-1-35-6: Failed" in out
|
||||
|
|
@ -2244,6 +2244,10 @@ serverFiles:
|
|||
# idempotency guard the next detection cycle deletes + re-spawns the
|
||||
# Failed Job (clearing this within ~24h); a sustained firing means it
|
||||
# re-failed — investigate the root cause.
|
||||
# Scoped to the four chain PHASE jobs (preflight|master|worker|
|
||||
# postflight) — NOT a bare `k8s-upgrade-.*`, which would also match
|
||||
# helper jobs in the namespace like k8s-upgrade-nightly-report-* and
|
||||
# false-fire if one of those failed.
|
||||
# `unless on() k8s_upgrade_blocked == 1` excludes the case where the
|
||||
# preflight terminally failed because the compat gate deliberately
|
||||
# REFUSED the target: block() exits 1 (so the Failed Job re-spawns
|
||||
|
|
@ -2255,7 +2259,7 @@ serverFiles:
|
|||
# block until the next run's preflight resets it to 0, so the exclusion
|
||||
# holds for the whole blocked period.
|
||||
- alert: K8sUpgradeChainJobFailed
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue