diff --git a/.claude/skills/upgrade-state/SKILL.md b/.claude/skills/upgrade-state/SKILL.md index f88c228e..34fe4731 100644 --- a/.claude/skills/upgrade-state/SKILL.md +++ b/.claude/skills/upgrade-state/SKILL.md @@ -51,7 +51,7 @@ Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken. |---|---|---|---| | **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs | | **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes | -| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` | +| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` | The K8s pipeline pushes a small set of gauges to the Prometheus Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`): diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index 8829ec46..0b8837cb 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -252,7 +252,7 @@ kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs. ### Architecture ``` -k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns) +k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns) │ probe apt-cache madison kubeadm (master) → latest available patch │ probe HEAD https://pkgs.k8s.io/.../v/deb/Release → next minor? │ push k8s_upgrade_available metric to Pushgateway diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 6d07030b..5439a498 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -7,7 +7,7 @@ VMs are upgraded automatically by a weekly detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its drain target** — so no pod in the chain can preempt itself. -The chain (Sun 12:00 UTC weekly): +The chain (23:00 UTC nightly): ``` detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job @@ -16,7 +16,7 @@ detection CronJob → preflight Job → master Job → one worker Job per worker This is **independent** of the OS-side `unattended-upgrades + kured` pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts. Schedules can overlap (kured runs daily 02:00-06:00 London; detection -here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the +here runs 23:00 UTC nightly) — when a kured reboot lands within 24h of the Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates group blocks the version-upgrade preflight, so the chain self-defers to the next Sunday rather than rolling on top of a half-fresh node. @@ -24,7 +24,7 @@ to the next Sunday rather than rolling on top of a half-fresh node. ## Architecture ``` -k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job) +k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns, SA: k8s-upgrade-job) │ kubectl get nodes → running version │ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor) │ HEAD pkgs.k8s.io/.../v/deb/Release → next minor available? @@ -97,7 +97,7 @@ Job 6 — postflight (no pinning) | **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. | | **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. | | **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. | -| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | +| **CronJob `k8s-version-check`** | 23:00 UTC nightly. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | ### Pushgateway metrics diff --git a/scripts/upgrade_state.sh b/scripts/upgrade_state.sh index 51fbf5d8..e5722f8b 100755 --- a/scripts/upgrade_state.sh +++ b/scripts/upgrade_state.sh @@ -10,7 +10,7 @@ # keel.sh/policy. Metrics on container :9300/metrics. # 2. OS — unattended-upgrades patches in-release per node; kured # reboots within a daily 02:00-06:00 London window. -# 3. K8s — k8s-version-check CronJob (Sun 12:00 UTC) detects new +# 3. K8s — k8s-version-check CronJob (23:00 UTC nightly) detects new # kubeadm patch/minor releases; Job-chain drains+upgrades # node-by-node. Pushgateway holds k8s_upgrade_* gauges. # @@ -443,7 +443,7 @@ collect_k8s() { fi fi - K8S_NEXT="$(next_daily_noon_utc)" + K8S_NEXT="$(next_scheduled_run_utc)" # Failed chain-Job detection. A preflight/phase Job can abort BEFORE pushing # k8s_upgrade_in_flight=1 (the preflight gates exit pre-metric), so in-flight @@ -496,15 +496,15 @@ collect_k8s() { fi } -# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was -# weekly Sunday until 2026-05-18; now `0 12 * * *` in the -# k8s-version-upgrade stack. If we're still before today's 12:00 UTC, -# the next run is today; otherwise it's tomorrow. -next_daily_noon_utc() { +# Next daily 23:00 UTC — pure bash date math, no croniter. Schedule is +# `0 23 * * *` in the k8s-version-upgrade stack (overnight; moved from 12:00 UTC +# on 2026-06-17). If we're still before today's 23:00 UTC the next run is today; +# otherwise tomorrow. +next_scheduled_run_utc() { local hr days_ahead hr=$(date -u +%H) - if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi - date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC" + if [[ "$hr" -lt 23 ]]; then days_ahead=0; else days_ahead=1; fi + date -u -d "+$days_ahead days" +"%a %Y-%m-%d 23:00 UTC" } # --- Renderers --- diff --git a/stacks/k8s-version-upgrade/main.tf b/stacks/k8s-version-upgrade/main.tf index e172955b..1a77510e 100644 --- a/stacks/k8s-version-upgrade/main.tf +++ b/stacks/k8s-version-upgrade/main.tf @@ -27,12 +27,13 @@ variable "schedule" { type = string - # Daily 12:00 UTC — outside kured window (kured runs 02:00-06:00 - # London). Was weekly Sunday until 2026-05-18; daily picks up upstream - # patch releases the same day they land. Concurrency is bounded by the - # CronJob's Forbid policy + Job-name idempotency (the detection job - # skips spawning a preflight Job if one already exists). - default = "0 12 * * *" + # Nightly 23:00 UTC (00:00 London) — overnight / low cluster usage, and clear + # of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the + # two drain-pipelines never overlap. Moved from 12:00 UTC noon on 2026-06-17 + # (Viktor: disruptive node drains should run overnight). Was weekly Sunday + # until 2026-05-18. Concurrency bounded by the CronJob's Forbid policy + + # retry-on-failure Job-name idempotency. + default = "0 23 * * *" } variable "enabled" {