k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
e43e64c666
commit
3398873a16
3 changed files with 44 additions and 39 deletions
|
|
@ -1,18 +1,18 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Nightly k8s-upgrade report -> Slack.
|
||||
"""Weekly k8s-upgrade report -> Slack.
|
||||
|
||||
Runs each morning (CronJob k8s-upgrade-nightly-report) after the 23:00 UTC
|
||||
version-check chain has finished. Reads the chain's Pushgateway gauges + live
|
||||
cluster state and posts ONE concise, actionable report to Slack so the
|
||||
autonomous upgrader's nightly outcome — and any blocker holding it back — is
|
||||
visible at a glance during the upgrade-cleanup window.
|
||||
Runs Monday morning (CronJob k8s-upgrade-nightly-report — historical name kept)
|
||||
after the Sunday-night version-check chain has finished. Reads the chain's
|
||||
Pushgateway gauges + live cluster state and posts ONE concise, actionable report
|
||||
to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
|
||||
back — is visible at a glance during the upgrade-cleanup window.
|
||||
|
||||
Outcomes it distinguishes:
|
||||
⚪ no upgrade needed — cluster already at the latest supported patch
|
||||
🔴 BLOCKED — compat gate refused the target; lists live reasons
|
||||
🟢 UPGRADED — all nodes now on the detected target
|
||||
🟡 in progress / passed — gate passed, chain mid-flight (or partial)
|
||||
⚠️ detector STALE — the 23:00 detector did not run last night
|
||||
⚠️ detector STALE — the weekly (Sun 23:00) detector did not run
|
||||
|
||||
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
|
||||
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
|
||||
|
|
@ -28,7 +28,7 @@ import urllib.request
|
|||
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
|
||||
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
|
||||
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
|
||||
STALE_SECONDS = 90000 # ~25h: a nightly detector older than this didn't run last night
|
||||
STALE_SECONDS = 90000 # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire
|
||||
|
||||
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
|
||||
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue