kured: drop Mon-Fri restriction, reboot any day
The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
8cbfa6856c
commit
48abb7c520
4 changed files with 17 additions and 8 deletions
|
|
@ -218,7 +218,7 @@ Independent of the service-upgrade pipeline above. Drives apt package updates +
|
|||
|
||||
### Stack
|
||||
- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
|
||||
- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window Mon-Fri 02:00-06:00 Europe/London, period=1h, concurrency=1, reboot-delay=30s.
|
||||
- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
|
||||
- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
|
||||
- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
|
||||
- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
|
||||
|
|
|
|||
|
|
@ -31,7 +31,7 @@ kured-sentinel-gate checks:
|
|||
▼ all pass
|
||||
touch /var/run/gated-reboot-required
|
||||
│
|
||||
▼ (kured polls every 1h within 02:00-06:00 London Mon-Fri window)
|
||||
▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
|
||||
kured checks Prometheus before draining:
|
||||
│ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
|
||||
│ ANY firing alert (except ignore-list) blocks the drain
|
||||
|
|
@ -57,7 +57,7 @@ kured uncordons + posts Slack notification (configuration.notifyUrl)
|
|||
### kured (Helm release)
|
||||
- **Stack**: `infra/stacks/kured/main.tf`
|
||||
- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
|
||||
- **Window**: Mon-Fri 02:00-06:00 Europe/London, period=1h, concurrency=1
|
||||
- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
|
||||
- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
|
||||
- **Slack hook**: Vault `secret/kured` → `slack_kured_webhook`
|
||||
|
||||
|
|
|
|||
|
|
@ -14,9 +14,12 @@ detection CronJob → preflight Job → master Job → worker × 4 Jobs → post
|
|||
```
|
||||
|
||||
This is **independent** of the OS-side `unattended-upgrades + kured`
|
||||
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts and
|
||||
their schedules don't overlap (kured runs Mon-Fri 02:00-06:00 London;
|
||||
detection here runs Sun 12:00 UTC).
|
||||
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
|
||||
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
|
||||
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
|
||||
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
|
||||
group blocks the version-upgrade preflight, so the chain self-defers
|
||||
to the next Sunday rather than rolling on top of a half-fresh node.
|
||||
|
||||
## Architecture
|
||||
|
||||
|
|
|
|||
|
|
@ -57,7 +57,13 @@ resource "helm_release" "kured" {
|
|||
timeZone = "Europe/London"
|
||||
startTime = "02:00"
|
||||
endTime = "06:00"
|
||||
rebootDays = ["mo", "tu", "we", "th", "fr"]
|
||||
# All 7 days — operator decision 2026-05-16. The Mon–Fri restriction
|
||||
# was a 2026-03-16-era guardrail (overlapping with weekend on-call
|
||||
# response). Today the rest of the safety net (halt-on-alert,
|
||||
# sentinel-gate Check 4 = 24h soak, single-concurrency, the
|
||||
# K8sUpgradeStalled alert) is strong enough to operate any day; the
|
||||
# weekday-only schedule was just slowing the backlog down.
|
||||
rebootDays = ["mo", "tu", "we", "th", "fr", "sa", "su"]
|
||||
# IMPORTANT: must match where kured-sentinel-gate writes (below):
|
||||
# `touch /host/var-run/gated-reboot-required` → host
|
||||
# `/var/run/gated-reboot-required`. The kured chart derives the host
|
||||
|
|
@ -90,7 +96,7 @@ resource "helm_release" "kured" {
|
|||
alertFiringOnly = true
|
||||
alertFilterMatchOnly = false
|
||||
}
|
||||
reboot_days = "mon,tue,wed,thu,fri"
|
||||
reboot_days = "mon,tue,wed,thu,fri,sat,sun"
|
||||
window_end = "06:00"
|
||||
window_start = "22:00"
|
||||
service = {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue