infra

Author SHA1 Message Date

Author	SHA1	Message	Date
Viktor Barzin	48abb7c520	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:29:01 +00:00
Viktor Barzin	a245e6e569	docs: add k8s node auto-upgrade runbook + architecture section The OS-side counterpart to the service-upgrade pipeline. Covers the unattended-upgrades + kured + sentinel-gate + Prometheus halt-on-alert design landed in `c0991f7f8`. Runbook: ops procedures (verify health, halt rollout, restore config to a re-imaged node, roll back a bad upgrade, investigate which alert is blocking). Architecture doc: extends the existing service-upgrade flow with a "K8s Node OS Upgrades" section (stack, sources of truth, day-2 mechanism, why-this-design rationale tied to the March 2026 post-mortem). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 17:26:15 +00:00

Viktor Barzin

48abb7c520

kured: drop Mon-Fri restriction, reboot any day

The weekday-only schedule was a 2026-03-16-incident-era guardrail when
the rest of the safety net was thin. Today's gates — halt-on-alert,
sentinel-gate Check 4 (24h soak via node Ready transitions), the
K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the
sentinel-path fix from earlier today — make weekend reboots safe and
just clear the backlog faster.

Effect: 5 pending node reboots clear in 5 calendar days instead of
queueing up over weekends. The K8s version-upgrade detection at Sun
12:00 UTC self-defers if a Sunday-morning kured reboot fires (the
RecentNodeReboot alert is in the Upgrade Gates ignore-less list for
the version-upgrade preflight — same mechanism kured uses).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-16 12:29:01 +00:00

Viktor Barzin

a245e6e569

docs: add k8s node auto-upgrade runbook + architecture section

The OS-side counterpart to the service-upgrade pipeline. Covers
the unattended-upgrades + kured + sentinel-gate + Prometheus
halt-on-alert design landed in c0991f7f8.

Runbook: ops procedures (verify health, halt rollout, restore
config to a re-imaged node, roll back a bad upgrade, investigate
which alert is blocking).

Architecture doc: extends the existing service-upgrade flow with
a "K8s Node OS Upgrades" section (stack, sources of truth, day-2
mechanism, why-this-design rationale tied to the March 2026
post-mortem).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-10 17:26:15 +00:00

2 commits