k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7270e2be3b
commit
ead876ec65
6 changed files with 431 additions and 3 deletions
|
|
@ -2244,6 +2244,10 @@ serverFiles:
|
|||
# idempotency guard the next detection cycle deletes + re-spawns the
|
||||
# Failed Job (clearing this within ~24h); a sustained firing means it
|
||||
# re-failed — investigate the root cause.
|
||||
# Scoped to the four chain PHASE jobs (preflight|master|worker|
|
||||
# postflight) — NOT a bare `k8s-upgrade-.*`, which would also match
|
||||
# helper jobs in the namespace like k8s-upgrade-nightly-report-* and
|
||||
# false-fire if one of those failed.
|
||||
# `unless on() k8s_upgrade_blocked == 1` excludes the case where the
|
||||
# preflight terminally failed because the compat gate deliberately
|
||||
# REFUSED the target: block() exits 1 (so the Failed Job re-spawns
|
||||
|
|
@ -2255,7 +2259,7 @@ serverFiles:
|
|||
# block until the next run's preflight resets it to 0, so the exclusion
|
||||
# holds for the whole blocked period.
|
||||
- alert: K8sUpgradeChainJobFailed
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue