k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
e43e64c666
commit
3398873a16
3 changed files with 44 additions and 39 deletions
|
|
@ -3,11 +3,11 @@
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s
|
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s
|
||||||
nodes (k8s-master + k8s-node1..6) are upgraded automatically by a nightly
|
nodes (k8s-master + k8s-node1..6) are upgraded automatically by a weekly
|
||||||
detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
|
detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
|
||||||
drain target** — so no pod in the chain can preempt itself.
|
drain target** — so no pod in the chain can preempt itself.
|
||||||
|
|
||||||
The chain (23:00 UTC nightly):
|
The chain (weekly Sunday 23:00 UTC):
|
||||||
|
|
||||||
```
|
```
|
||||||
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
|
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
|
||||||
|
|
@ -16,7 +16,7 @@ detection CronJob → preflight Job → master Job → one worker Job per worker
|
||||||
This is **independent** of the OS-side `unattended-upgrades + kured`
|
This is **independent** of the OS-side `unattended-upgrades + kured`
|
||||||
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
||||||
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
||||||
here runs 23:00 UTC nightly) — when a kured reboot lands within 24h of the
|
here runs weekly Sunday 23:00 UTC) — when a kured reboot lands within 24h of the
|
||||||
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
||||||
group blocks the version-upgrade preflight, so the chain self-defers
|
group blocks the version-upgrade preflight, so the chain self-defers
|
||||||
to the next Sunday rather than rolling on top of a half-fresh node.
|
to the next Sunday rather than rolling on top of a half-fresh node.
|
||||||
|
|
@ -24,7 +24,7 @@ to the next Sunday rather than rolling on top of a half-fresh node.
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns, SA: k8s-upgrade-job)
|
k8s-version-check CronJob (weekly Sunday 23:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
|
||||||
│ kubectl get nodes → running version
|
│ kubectl get nodes → running version
|
||||||
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
|
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
|
||||||
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
|
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
|
||||||
|
|
@ -134,11 +134,11 @@ for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
|
||||||
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
|
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
|
||||||
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
|
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
|
||||||
before any mutation, so no rollback. Reasons (grouped by class) appear in the
|
before any mutation, so no rollback. Reasons (grouped by class) appear in the
|
||||||
**morning nightly report**, not a per-run Slack.
|
**morning weekly report**, not a per-run Slack.
|
||||||
|
|
||||||
- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
|
- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
|
||||||
it by doing the named upgrade/migration; the next nightly run proceeds.
|
it by doing the named upgrade/migration; the next weekly run proceeds.
|
||||||
- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD`
|
- **Held** → **deliberately NO alert** — only the weekly report's `⏸️ HELD`
|
||||||
line, because it can't be actioned now (a nightly alert would cry wolf). It
|
line, because it can't be actioned now (a nightly alert would cry wolf). It
|
||||||
clears itself once upstream ships support (refresh `addon-compat.json`) or the
|
clears itself once upstream ships support (refresh `addon-compat.json`) or the
|
||||||
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
|
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
|
||||||
|
|
@ -171,7 +171,7 @@ it current**; the gate reads it on every run. Gate logic:
|
||||||
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
|
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
|
||||||
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
|
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
|
||||||
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
|
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
|
||||||
| **CronJob `k8s-version-check`** | 23:00 UTC nightly. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
|
| **CronJob `k8s-version-check`** | weekly Sunday 23:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
|
||||||
|
|
||||||
### Pushgateway metrics
|
### Pushgateway metrics
|
||||||
|
|
||||||
|
|
@ -194,14 +194,15 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
||||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
|
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
|
||||||
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
|
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning weekly report**; clear it by doing the named upgrade/migration, after which the next weekly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the weekly report's `⏸️ HELD` line.
|
||||||
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||||
|
|
||||||
### Nightly upgrade report (Slack)
|
### Weekly upgrade report (Slack)
|
||||||
|
|
||||||
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
|
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
|
||||||
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
|
default `7 6 * * 1` = Monday 06:07 UTC — after the Sunday-night chain, before the
|
||||||
alert-digest) posts ONE Slack summary each morning of the previous night's run:
|
08:00 London alert-digest; historical CronJob name kept) posts ONE Slack summary
|
||||||
|
each Monday of the past week's run:
|
||||||
running version, detector freshness, detected target + kind, the outcome
|
running version, detector freshness, detected target + kind, the outcome
|
||||||
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
|
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
|
||||||
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
|
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
|
||||||
|
|
@ -209,7 +210,7 @@ the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
|
||||||
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
|
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
|
||||||
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
|
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
|
||||||
This is the day-to-day visibility layer (it does NOT replace the alerts above —
|
This is the day-to-day visibility layer (it does NOT replace the alerts above —
|
||||||
those fire on problems; this reports the outcome every night). Manual run:
|
those fire on problems; this reports the outcome every week). Manual run:
|
||||||
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
|
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
|
||||||
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
|
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
|
||||||
`K8sUpgradeChainJobFailed`).
|
`K8sUpgradeChainJobFailed`).
|
||||||
|
|
@ -356,7 +357,7 @@ which `K8sUpgradeStalled` cannot see).
|
||||||
`spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on
|
`spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on
|
||||||
name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious
|
name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious
|
||||||
critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days —
|
critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days —
|
||||||
clears on the next daily cycle. A mid-chain phase that keeps failing still needs
|
clears on the next weekly cycle. A mid-chain phase that keeps failing still needs
|
||||||
manual recovery: fix the root cause, then:
|
manual recovery: fix the root cause, then:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -27,13 +27,15 @@
|
||||||
|
|
||||||
variable "schedule" {
|
variable "schedule" {
|
||||||
type = string
|
type = string
|
||||||
# Nightly 23:00 UTC (00:00 London) — overnight / low cluster usage, and clear
|
# Weekly Sunday 23:00 UTC (00:00 London) — overnight / low cluster usage, clear
|
||||||
# of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the
|
# of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the
|
||||||
# two drain-pipelines never overlap. Moved from 12:00 UTC noon on 2026-06-17
|
# two drain-pipelines never overlap. Cadence history: weekly Sunday until
|
||||||
# (Viktor: disruptive node drains should run overnight). Was weekly Sunday
|
# 2026-05-18 → daily noon → daily 23:00 (2026-06-17) → back to weekly Sunday
|
||||||
# until 2026-05-18. Concurrency bounded by the CronJob's Forbid policy +
|
# (2026-06-28). Rationale for weekly: the actionable-vs-held gate now quiets the
|
||||||
# retry-on-failure Job-name idempotency.
|
# routine "held" churn (e.g. 1.36), so a daily check/attempt buys little; weekly
|
||||||
default = "0 23 * * *"
|
# is enough and patch uptake lags ≤7d (an accepted trade-off). Concurrency
|
||||||
|
# bounded by the CronJob's Forbid policy + retry-on-failure Job-name idempotency.
|
||||||
|
default = "0 23 * * 0"
|
||||||
}
|
}
|
||||||
|
|
||||||
variable "enabled" {
|
variable "enabled" {
|
||||||
|
|
@ -41,13 +43,15 @@ variable "enabled" {
|
||||||
default = true
|
default = true
|
||||||
}
|
}
|
||||||
|
|
||||||
# Nightly upgrade-report CronJob schedule. 06:07 UTC (07:07 London) — safely
|
# Weekly upgrade-report CronJob schedule. Monday 06:07 UTC (07:07 London) — the
|
||||||
# after the 23:00 chain has finished (worst case ~02:00) and before the 08:00
|
# morning AFTER the Sunday-night check (~7h later, so nightly-report.py's ~25h
|
||||||
# London alert-digest, so the morning Slack skim shows last night's upgrade
|
# staleness threshold stays valid AND still flags a missed weekly run), and before
|
||||||
# outcome + any live blocker. Posts once/day; read-only.
|
# the 08:00 London alert-digest so the Monday skim shows the weekly upgrade
|
||||||
|
# outcome + any live blocker. Posts once/week; read-only. (The CronJob keeps the
|
||||||
|
# historical name k8s-upgrade-nightly-report — not renamed to avoid churn.)
|
||||||
variable "report_schedule" {
|
variable "report_schedule" {
|
||||||
type = string
|
type = string
|
||||||
default = "7 6 * * *"
|
default = "7 6 * * 1"
|
||||||
}
|
}
|
||||||
|
|
||||||
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump
|
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump
|
||||||
|
|
@ -494,16 +498,16 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
||||||
# Idempotency: deterministic name reconciles via `apply`.
|
# Idempotency: deterministic name reconciles via `apply`.
|
||||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||||
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
|
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
|
||||||
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.
|
ANNOUNCE=yes # Slack the spawn? Suppressed for silent per-run re-evaluations of a standing gate refusal.
|
||||||
|
|
||||||
# Idempotency + nightly re-evaluation:
|
# Idempotency + per-run re-evaluation:
|
||||||
# - FAILED preflight (transient gate abort, e.g. a spurious
|
# - FAILED preflight (transient gate abort, e.g. a spurious
|
||||||
# critical alert / unhealthy node) -> delete + re-spawn, announced.
|
# critical alert / unhealthy node) -> delete + re-spawn, announced.
|
||||||
# - COMPLETE preflight but NO master Job spawned -> the compat
|
# - COMPLETE preflight but NO master Job spawned -> the compat
|
||||||
# gate REFUSED the target (blocked/held now Complete cleanly
|
# gate REFUSED the target (blocked/held now Complete cleanly
|
||||||
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
|
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
|
||||||
# nightly (the refusal may have cleared: addon upgraded / matrix
|
# each scheduled run (the refusal may have cleared: addon
|
||||||
# updated / upstream shipped) WITHOUT nightly Slack noise for a
|
# upgraded / matrix updated / upstream shipped) WITHOUT per-run Slack noise for a
|
||||||
# standing refusal — the morning report (+ K8sUpgradeBlocked for
|
# standing refusal — the morning report (+ K8sUpgradeBlocked for
|
||||||
# actionable) is the signal.
|
# actionable) is the signal.
|
||||||
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
|
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
|
||||||
|
|
@ -521,7 +525,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
||||||
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
||||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||||
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
|
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
|
||||||
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
|
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent per-run re-evaluate"
|
||||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||||
ANNOUNCE=no
|
ANNOUNCE=no
|
||||||
else
|
else
|
||||||
|
|
@ -591,7 +595,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
||||||
#
|
#
|
||||||
# Each morning, after the 23:00 chain has finished, posts ONE concise Slack
|
# Each morning, after the 23:00 chain has finished, posts ONE concise Slack
|
||||||
# report of last night's upgrade outcome (no-op / blocked+reasons / upgraded /
|
# report of last night's upgrade outcome (no-op / blocked+reasons / upgraded /
|
||||||
# in-progress) so the autonomous upgrader's nightly result — and any live
|
# in-progress) so the autonomous upgrader's weekly result — and any live
|
||||||
# blocker — is visible at a glance. Read-only: reads the chain's Pushgateway
|
# blocker — is visible at a glance. Read-only: reads the chain's Pushgateway
|
||||||
# gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons.
|
# gauges + live nodes/jobs and re-runs compat-gate.py for fresh blocker reasons.
|
||||||
# Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the
|
# Reuses the same SA, creds secret (slack_webhook), and scripts ConfigMap as the
|
||||||
|
|
|
||||||
|
|
@ -1,18 +1,18 @@
|
||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""Nightly k8s-upgrade report -> Slack.
|
"""Weekly k8s-upgrade report -> Slack.
|
||||||
|
|
||||||
Runs each morning (CronJob k8s-upgrade-nightly-report) after the 23:00 UTC
|
Runs Monday morning (CronJob k8s-upgrade-nightly-report — historical name kept)
|
||||||
version-check chain has finished. Reads the chain's Pushgateway gauges + live
|
after the Sunday-night version-check chain has finished. Reads the chain's
|
||||||
cluster state and posts ONE concise, actionable report to Slack so the
|
Pushgateway gauges + live cluster state and posts ONE concise, actionable report
|
||||||
autonomous upgrader's nightly outcome — and any blocker holding it back — is
|
to Slack so the autonomous upgrader's weekly outcome — and any blocker holding it
|
||||||
visible at a glance during the upgrade-cleanup window.
|
back — is visible at a glance during the upgrade-cleanup window.
|
||||||
|
|
||||||
Outcomes it distinguishes:
|
Outcomes it distinguishes:
|
||||||
⚪ no upgrade needed — cluster already at the latest supported patch
|
⚪ no upgrade needed — cluster already at the latest supported patch
|
||||||
🔴 BLOCKED — compat gate refused the target; lists live reasons
|
🔴 BLOCKED — compat gate refused the target; lists live reasons
|
||||||
🟢 UPGRADED — all nodes now on the detected target
|
🟢 UPGRADED — all nodes now on the detected target
|
||||||
🟡 in progress / passed — gate passed, chain mid-flight (or partial)
|
🟡 in progress / passed — gate passed, chain mid-flight (or partial)
|
||||||
⚠️ detector STALE — the 23:00 detector did not run last night
|
⚠️ detector STALE — the weekly (Sun 23:00) detector did not run
|
||||||
|
|
||||||
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
|
Read-only. The pure helpers (parse_metrics / select / fmt_age / compose_report)
|
||||||
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
|
are unit-tested in test_nightly_report.py; all I/O (kubectl, Pushgateway, the
|
||||||
|
|
@ -28,7 +28,7 @@ import urllib.request
|
||||||
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
|
PUSHGW = os.environ.get("PUSHGW", "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics")
|
||||||
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
|
SLACK_FILE = os.environ.get("SLACK_FILE", "/secrets/k8s-upgrade/slack_webhook")
|
||||||
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
|
SCRIPTS_DIR = os.environ.get("SCRIPTS_DIR", "/scripts")
|
||||||
STALE_SECONDS = 90000 # ~25h: a nightly detector older than this didn't run last night
|
STALE_SECONDS = 90000 # ~25h: report runs Mon ~7h after the Sun-night check, so >25h ⇒ the weekly check didn't fire
|
||||||
|
|
||||||
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
|
_METRIC_RE = re.compile(r"^(?P<name>\w+)(?:\{(?P<labels>[^}]*)\})?\s+(?P<value>[-+0-9.eE]+)\s*$")
|
||||||
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
|
_LABEL_RE = re.compile(r'(\w+)="([^"]*)"')
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue