k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)
- stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
- scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
(was next_daily_noon_utc).
- docs (runbook, architecture) + upgrade-state SKILL: schedule references
updated to 23:00 UTC nightly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
ed53b34bf4
commit
c04efa3d3a
5 changed files with 22 additions and 21 deletions
|
|
@ -51,7 +51,7 @@ Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
|
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
|
||||||
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
|
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
|
||||||
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
|
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
|
||||||
|
|
||||||
The K8s pipeline pushes a small set of gauges to the Prometheus
|
The K8s pipeline pushes a small set of gauges to the Prometheus
|
||||||
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
||||||
|
|
|
||||||
|
|
@ -252,7 +252,7 @@ kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
|
||||||
### Architecture
|
### Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
|
k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns)
|
||||||
│ probe apt-cache madison kubeadm (master) → latest available patch
|
│ probe apt-cache madison kubeadm (master) → latest available patch
|
||||||
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
|
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
|
||||||
│ push k8s_upgrade_available metric to Pushgateway
|
│ push k8s_upgrade_available metric to Pushgateway
|
||||||
|
|
|
||||||
|
|
@ -7,7 +7,7 @@ VMs are upgraded automatically by a weekly detection CronJob that seeds a
|
||||||
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
|
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
|
||||||
drain target** — so no pod in the chain can preempt itself.
|
drain target** — so no pod in the chain can preempt itself.
|
||||||
|
|
||||||
The chain (Sun 12:00 UTC weekly):
|
The chain (23:00 UTC nightly):
|
||||||
|
|
||||||
```
|
```
|
||||||
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
|
detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job
|
||||||
|
|
@ -16,7 +16,7 @@ detection CronJob → preflight Job → master Job → one worker Job per worker
|
||||||
This is **independent** of the OS-side `unattended-upgrades + kured`
|
This is **independent** of the OS-side `unattended-upgrades + kured`
|
||||||
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
||||||
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
||||||
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
|
here runs 23:00 UTC nightly) — when a kured reboot lands within 24h of the
|
||||||
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
||||||
group blocks the version-upgrade preflight, so the chain self-defers
|
group blocks the version-upgrade preflight, so the chain self-defers
|
||||||
to the next Sunday rather than rolling on top of a half-fresh node.
|
to the next Sunday rather than rolling on top of a half-fresh node.
|
||||||
|
|
@ -24,7 +24,7 @@ to the next Sunday rather than rolling on top of a half-fresh node.
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
|
k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns, SA: k8s-upgrade-job)
|
||||||
│ kubectl get nodes → running version
|
│ kubectl get nodes → running version
|
||||||
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
|
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
|
||||||
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
|
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
|
||||||
|
|
@ -97,7 +97,7 @@ Job 6 — postflight (no pinning)
|
||||||
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
|
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
|
||||||
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
|
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
|
||||||
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
|
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
|
||||||
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
|
| **CronJob `k8s-version-check`** | 23:00 UTC nightly. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
|
||||||
|
|
||||||
### Pushgateway metrics
|
### Pushgateway metrics
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -10,7 +10,7 @@
|
||||||
# keel.sh/policy. Metrics on container :9300/metrics.
|
# keel.sh/policy. Metrics on container :9300/metrics.
|
||||||
# 2. OS — unattended-upgrades patches in-release per node; kured
|
# 2. OS — unattended-upgrades patches in-release per node; kured
|
||||||
# reboots within a daily 02:00-06:00 London window.
|
# reboots within a daily 02:00-06:00 London window.
|
||||||
# 3. K8s — k8s-version-check CronJob (Sun 12:00 UTC) detects new
|
# 3. K8s — k8s-version-check CronJob (23:00 UTC nightly) detects new
|
||||||
# kubeadm patch/minor releases; Job-chain drains+upgrades
|
# kubeadm patch/minor releases; Job-chain drains+upgrades
|
||||||
# node-by-node. Pushgateway holds k8s_upgrade_* gauges.
|
# node-by-node. Pushgateway holds k8s_upgrade_* gauges.
|
||||||
#
|
#
|
||||||
|
|
@ -443,7 +443,7 @@ collect_k8s() {
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
K8S_NEXT="$(next_daily_noon_utc)"
|
K8S_NEXT="$(next_scheduled_run_utc)"
|
||||||
|
|
||||||
# Failed chain-Job detection. A preflight/phase Job can abort BEFORE pushing
|
# Failed chain-Job detection. A preflight/phase Job can abort BEFORE pushing
|
||||||
# k8s_upgrade_in_flight=1 (the preflight gates exit pre-metric), so in-flight
|
# k8s_upgrade_in_flight=1 (the preflight gates exit pre-metric), so in-flight
|
||||||
|
|
@ -496,15 +496,15 @@ collect_k8s() {
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was
|
# Next daily 23:00 UTC — pure bash date math, no croniter. Schedule is
|
||||||
# weekly Sunday until 2026-05-18; now `0 12 * * *` in the
|
# `0 23 * * *` in the k8s-version-upgrade stack (overnight; moved from 12:00 UTC
|
||||||
# k8s-version-upgrade stack. If we're still before today's 12:00 UTC,
|
# on 2026-06-17). If we're still before today's 23:00 UTC the next run is today;
|
||||||
# the next run is today; otherwise it's tomorrow.
|
# otherwise tomorrow.
|
||||||
next_daily_noon_utc() {
|
next_scheduled_run_utc() {
|
||||||
local hr days_ahead
|
local hr days_ahead
|
||||||
hr=$(date -u +%H)
|
hr=$(date -u +%H)
|
||||||
if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi
|
if [[ "$hr" -lt 23 ]]; then days_ahead=0; else days_ahead=1; fi
|
||||||
date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC"
|
date -u -d "+$days_ahead days" +"%a %Y-%m-%d 23:00 UTC"
|
||||||
}
|
}
|
||||||
|
|
||||||
# --- Renderers ---
|
# --- Renderers ---
|
||||||
|
|
|
||||||
|
|
@ -27,12 +27,13 @@
|
||||||
|
|
||||||
variable "schedule" {
|
variable "schedule" {
|
||||||
type = string
|
type = string
|
||||||
# Daily 12:00 UTC — outside kured window (kured runs 02:00-06:00
|
# Nightly 23:00 UTC (00:00 London) — overnight / low cluster usage, and clear
|
||||||
# London). Was weekly Sunday until 2026-05-18; daily picks up upstream
|
# of the kured OS-reboot window (01:00-05:00 UTC = 02:00-06:00 London) so the
|
||||||
# patch releases the same day they land. Concurrency is bounded by the
|
# two drain-pipelines never overlap. Moved from 12:00 UTC noon on 2026-06-17
|
||||||
# CronJob's Forbid policy + Job-name idempotency (the detection job
|
# (Viktor: disruptive node drains should run overnight). Was weekly Sunday
|
||||||
# skips spawning a preflight Job if one already exists).
|
# until 2026-05-18. Concurrency bounded by the CronJob's Forbid policy +
|
||||||
default = "0 12 * * *"
|
# retry-on-failure Job-name idempotency.
|
||||||
|
default = "0 23 * * *"
|
||||||
}
|
}
|
||||||
|
|
||||||
variable "enabled" {
|
variable "enabled" {
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue