infra/stacks/k8s-version-upgrade
Viktor Barzin 8a6ec72039 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 22:22:01 +00:00
..
scripts RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight 2026-05-17 22:22:01 +00:00
job-template.yaml k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-11 23:54:22 +00:00
main.tf k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst 2026-05-17 21:10:58 +00:00
terragrunt.hcl k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline 2026-05-10 19:07:42 +00:00