RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
62fb46353c
commit
8a6ec72039
2 changed files with 19 additions and 5 deletions
|
|
@ -233,15 +233,20 @@ phase_preflight() {
|
|||
exit 1
|
||||
fi
|
||||
|
||||
# 3. 24h-quiet baseline
|
||||
# 3. Quiet-baseline check — fail if any node had a Ready transition in the
|
||||
# last hour. Threshold matches the RecentNodeReboot alert (3600s) — the
|
||||
# 24h-between-cluster-reboots protection lives in kured-sentinel-gate
|
||||
# Check 4, not here. Tightened from 86400 → 3600 on 2026-05-17; with the
|
||||
# alert clearing in 1h, this duplicate gate was the actual blocker for
|
||||
# the chain after a session of manual reboots.
|
||||
local recent=0
|
||||
while IFS= read -r ts; do
|
||||
[ -z "$ts" ] && continue
|
||||
local diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
|
||||
if [ "$diff" -lt 86400 ]; then recent=1; break; fi
|
||||
if [ "$diff" -lt 3600 ]; then recent=1; break; fi
|
||||
done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
|
||||
if [ "$recent" -eq 1 ]; then
|
||||
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
|
||||
slack "ABORT preflight — node transitioned Ready <1h ago (settle window)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue