RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight

The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-17 22:22:01 +00:00
parent d1dcc5d12d
commit 5482f46125
2 changed files with 19 additions and 5 deletions

View file

@ -233,15 +233,20 @@ phase_preflight() {
exit 1
fi
# 3. 24h-quiet baseline
# 3. Quiet-baseline check — fail if any node had a Ready transition in the
# last hour. Threshold matches the RecentNodeReboot alert (3600s) — the
# 24h-between-cluster-reboots protection lives in kured-sentinel-gate
# Check 4, not here. Tightened from 86400 → 3600 on 2026-05-17; with the
# alert clearing in 1h, this duplicate gate was the actual blocker for
# the chain after a session of manual reboots.
local recent=0
while IFS= read -r ts; do
[ -z "$ts" ] && continue
local diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
if [ "$diff" -lt 86400 ]; then recent=1; break; fi
if [ "$diff" -lt 3600 ]; then recent=1; break; fi
done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$recent" -eq 1 ]; then
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
slack "ABORT preflight — node transitioned Ready <1h ago (settle window)"
exit 1
fi

View file

@ -1887,13 +1887,22 @@ serverFiles:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} stuck Pending for 10m+"
# RecentNodeReboot — kubelet just restarted, give the node some time
# to settle before any other reboot-driving thing (kured, K8s
# version-upgrade chain) acts. Threshold tightened from 86400 →
# 3600 on 2026-05-17 — post-reboot workloads (calico-node,
# kube-proxy, CSI sidecars, GPU drivers) typically reconverge
# within minutes; 1h is comfortable margin. The 24h-between-
# cluster-reboots protection lives separately in
# `kured-sentinel-gate` Check 4 (reads node Ready
# lastTransitionTime, independent of this alert).
- alert: RecentNodeReboot
expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 86400
expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 3600
for: 0m
labels:
severity: info
annotations:
summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 24h soak window halts further reboots"
summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 1h settle window halts further reboots"
- alert: MysqlStandaloneDown
expr: kube_statefulset_status_replicas_ready{statefulset="mysql-standalone"} < 1
for: 2m