RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
62fb46353c
commit
8a6ec72039
2 changed files with 19 additions and 5 deletions
|
|
@ -1887,13 +1887,22 @@ serverFiles:
|
|||
severity: warning
|
||||
annotations:
|
||||
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} stuck Pending for 10m+"
|
||||
# RecentNodeReboot — kubelet just restarted, give the node some time
|
||||
# to settle before any other reboot-driving thing (kured, K8s
|
||||
# version-upgrade chain) acts. Threshold tightened from 86400 →
|
||||
# 3600 on 2026-05-17 — post-reboot workloads (calico-node,
|
||||
# kube-proxy, CSI sidecars, GPU drivers) typically reconverge
|
||||
# within minutes; 1h is comfortable margin. The 24h-between-
|
||||
# cluster-reboots protection lives separately in
|
||||
# `kured-sentinel-gate` Check 4 (reads node Ready
|
||||
# lastTransitionTime, independent of this alert).
|
||||
- alert: RecentNodeReboot
|
||||
expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 86400
|
||||
expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 3600
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
annotations:
|
||||
summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 24h soak window halts further reboots"
|
||||
summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 1h settle window halts further reboots"
|
||||
- alert: MysqlStandaloneDown
|
||||
expr: kube_statefulset_status_replicas_ready{statefulset="mysql-standalone"} < 1
|
||||
for: 2m
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue