k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-29 06:22:20 +00:00
parent e43e64c666
commit 3398873a16
3 changed files with 44 additions and 39 deletions

View file

@ -3,11 +3,11 @@
## Overview ## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s
nodes (k8s-master + k8s-node1..6) are upgraded automatically by a nightly nodes (k8s-master + k8s-node1..6) are upgraded automatically by a weekly
detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself. drain target** — so no pod in the chain can preempt itself.
The chain (23:00 UTC nightly): The chain (weekly Sunday 23:00 UTC):
``` ```
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
@ -16,7 +16,7 @@ detection CronJob → preflight Job → master Job → one worker Job per worker
This is **independent** of the OS-side `unattended-upgrades + kured` This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts. pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs 23:00 UTC nightly) — when a kured reboot lands within 24h of the here runs weekly Sunday 23:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node. to the next Sunday rather than rolling on top of a half-fresh node.
@ -24,7 +24,7 @@ to the next Sunday rather than rolling on top of a half-fresh node.
## Architecture ## Architecture
``` ```
k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns, SA: k8s-upgrade-job) k8s-version-check CronJob (weekly Sunday 23:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
│ kubectl get nodes → running version │ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor) │ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available? │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
@ -134,11 +134,11 @@ for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
before any mutation, so no rollback. Reasons (grouped by class) appear in the before any mutation, so no rollback. Reasons (grouped by class) appear in the
**morning nightly report**, not a per-run Slack. **morning weekly report**, not a per-run Slack.
- **Actionable**`K8sUpgradeBlocked` fires (once, via alert-on-change). Clear - **Actionable**`K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
it by doing the named upgrade/migration; the next nightly run proceeds. it by doing the named upgrade/migration; the next weekly run proceeds.
- **Held****deliberately NO alert** — only the nightly report's `⏸️ HELD` - **Held****deliberately NO alert** — only the weekly report's `⏸️ HELD`
line, because it can't be actioned now (a nightly alert would cry wolf). It line, because it can't be actioned now (a nightly alert would cry wolf). It
clears itself once upstream ships support (refresh `addon-compat.json`) or the clears itself once upstream ships support (refresh `addon-compat.json`) or the
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
@ -171,7 +171,7 @@ it current**; the gate reads it on every run. Gate logic:
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. | | **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. | | **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. | | **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
| **CronJob `k8s-version-check`** | 23:00 UTC nightly. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | | **CronJob `k8s-version-check`** | weekly Sunday 23:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
### Pushgateway metrics ### Pushgateway metrics
@ -194,14 +194,15 @@ Pushed by upgrade-step.sh during phase execution; observed by the
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude. - **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line. - **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning weekly report**; clear it by doing the named upgrade/migration, after which the next weekly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the weekly report's `⏸️ HELD` line.
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Nightly upgrade report (Slack) ### Weekly upgrade report (Slack)
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London default `7 6 * * 1` = Monday 06:07 UTC — after the Sunday-night chain, before the
alert-digest) posts ONE Slack summary each morning of the previous night's run: 08:00 London alert-digest; historical CronJob name kept) posts ONE Slack summary
each Monday of the past week's run:
running version, detector freshness, detected target + kind, the outcome running version, detector freshness, detected target + kind, the outcome
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned / (⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads 🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
@ -209,7 +210,7 @@ the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
This is the day-to-day visibility layer (it does NOT replace the alerts above — This is the day-to-day visibility layer (it does NOT replace the alerts above —
those fire on problems; this reports the outcome every night). Manual run: those fire on problems; this reports the outcome every week). Manual run:
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test` `kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip (name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
`K8sUpgradeChainJobFailed`). `K8sUpgradeChainJobFailed`).
@ -356,7 +357,7 @@ which `K8sUpgradeStalled` cannot see).
`spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on `spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on
name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious
critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days — critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days —
clears on the next daily cycle. A mid-chain phase that keeps failing still needs clears on the next weekly cycle. A mid-chain phase that keeps failing still needs
manual recovery: fix the root cause, then: manual recovery: fix the root cause, then:
```bash ```bash

View file

@ -27,13 +27,15 @@
variable "schedule" { variable "schedule" {
type = string type = string
# Nightly 23:00 UTC (00:00 London) overnight / low cluster usage, and clear # Weekly Sunday 23:00 UTC (00:00 London) overnight / low cluster usage, clear
# of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the # of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the
# two drain-pipelines never overlap. Moved from 12:00 UTC noon on 2026-06-17 # two drain-pipelines never overlap. Cadence history: weekly Sunday until
# (Viktor: disruptive node drains should run overnight). Was weekly Sunday # 2026-05-18 daily noon daily 23:00 (2026-06-17) back to weekly Sunday
# until 2026-05-18. Concurrency bounded by the CronJob's Forbid policy + # (2026-06-28). Rationale for weekly: the actionable-vs-held gate now quiets the
# retry-on-failure Job-name idempotency. # routine "held" churn (e.g. 1.36), so a daily check/attempt buys little; weekly
default = "0 23 * * *" # is enough and patch uptake lags 7d (an accepted trade-off). Concurrency
# bounded by the CronJob's Forbid policy + retry-on-failure Job-name idempotency.
default = "0 23 * * 0"
} }
variable "enabled" { variable "enabled" {
@ -41,13 +43,15 @@ variable "enabled" {
default = true default = true
} }
# Nightly upgrade-report CronJob schedule. 06:07 UTC (07:07 London) safely # Weekly upgrade-report CronJob schedule. Monday 06:07 UTC (07:07 London) the
# after the 23:00 chain has finished (worst case ~02:00) and before the 08:00 # morning AFTER the Sunday-night check (~7h later, so nightly-report.py's ~25h
# London alert-digest, so the morning Slack skim shows last night's upgrade # staleness threshold stays valid AND still flags a missed weekly run), and before
# outcome + any live blocker. Posts once/day; read-only. # the 08:00 London alert-digest so the Monday skim shows the weekly upgrade
# outcome + any live blocker. Posts once/week; read-only. (The CronJob keeps the
# historical name k8s-upgrade-nightly-report not renamed to avoid churn.)
variable "report_schedule" { variable "report_schedule" {
type = string type = string
default = "7 6 * * *" default = "7 6 * * 1"
} }
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf bump # Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf bump
@ -494,16 +498,16 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
# Idempotency: deterministic name reconciles via `apply`. # Idempotency: deterministic name reconciles via `apply`.
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}" JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}" MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal. ANNOUNCE=yes # Slack the spawn? Suppressed for silent per-run re-evaluations of a standing gate refusal.
# Idempotency + nightly re-evaluation: # Idempotency + per-run re-evaluation:
# - FAILED preflight (transient gate abort, e.g. a spurious # - FAILED preflight (transient gate abort, e.g. a spurious
# critical alert / unhealthy node) -> delete + re-spawn, announced. # critical alert / unhealthy node) -> delete + re-spawn, announced.
# - COMPLETE preflight but NO master Job spawned -> the compat # - COMPLETE preflight but NO master Job spawned -> the compat
# gate REFUSED the target (blocked/held now Complete cleanly # gate REFUSED the target (blocked/held now Complete cleanly
# rather than Failing). Re-spawn SILENTLY so the gate re-checks # rather than Failing). Re-spawn SILENTLY so the gate re-checks
# nightly (the refusal may have cleared: addon upgraded / matrix # each scheduled run (the refusal may have cleared: addon
# updated / upstream shipped) WITHOUT nightly Slack noise for a # upgraded / matrix updated / upstream shipped) WITHOUT per-run Slack noise for a
# standing refusal the morning report (+ K8sUpgradeBlocked for # standing refusal the morning report (+ K8sUpgradeBlocked for
# actionable) is the signal. # actionable) is the signal.
# - Otherwise (Active, or Complete with the chain advanced) -> skip. # - Otherwise (Active, or Complete with the chain advanced) -> skip.
@ -521,7 +525,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning" slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate" echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent per-run re-evaluate"
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
ANNOUNCE=no ANNOUNCE=no
else else
@ -591,7 +595,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
# #
# Each morning, after the 23:00 chain has finished, posts ONE concise Slack # Each morning, after the 23:00 chain has finished, posts ONE concise Slack
# report of last night's upgrade outcome (no-op / blocked+reasons / upgraded / # report of last night's upgrade outcome (no-op / blocked+reasons / upgraded /
# in-progress) so the autonomous upgrader's nightly result and any live # in-progress) so the autonomous upgrader's weekly result and any live
# blocker is visible at a glance. Read-only: reads the chain's Pushgateway # blocker is visible at a glance. Read-only: reads the chain's Pushgateway
# gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons. # gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons.
# Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the # Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the

View file

@ -1,18 +1,18 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Nightly k8s-upgrade report -> Slack. """Weekly k8s-upgrade report -> Slack.
Runs each morning (CronJob k8s-upgrade-nightly-report) after the 23:00 UTC Runs Monday morning (CronJob k8s-upgrade-nightly-report historical name kept)
version-check chain has finished. Reads the chain's Pushgateway gauges + live after the Sunday-night version-check chain has finished. Reads the chain's
cluster state and posts ONE concise, actionable report to Slack so the Pushgateway gauges + live cluster state and posts ONE concise, actionable report
autonomous upgrader's nightly outcome — and any blocker holding it back — is to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
visible at a glance during the upgrade-cleanup window. back is visible at a glance during the upgrade-cleanup window.
Outcomes it distinguishes: Outcomes it distinguishes:
no upgrade needed cluster already at the latest supported patch no upgrade needed cluster already at the latest supported patch
🔴 BLOCKED compat gate refused the target; lists live reasons 🔴 BLOCKED compat gate refused the target; lists live reasons
🟢 UPGRADED all nodes now on the detected target 🟢 UPGRADED all nodes now on the detected target
🟡 in progress / passed gate passed, chain mid-flight (or partial) 🟡 in progress / passed gate passed, chain mid-flight (or partial)
detector STALE the 23:00 detector did not run last night detector STALE the weekly (Sun 23:00) detector did not run
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report) Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
@ -28,7 +28,7 @@ import urllib.request
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics") PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook") SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts") SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
STALE_SECONDS = 90000 # ~25h: a nightly detector older than this didn't run last night STALE_SECONDS = 90000 # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$") _METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"') _LABEL_RE = re.compile(r'(\w+)="([^"]*)"')