From 5482f46125df9a0de30c69c8ebacd7aa5801121d Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 17 May 2026 22:22:01 +0000 Subject: [PATCH] =?UTF-8?q?RecentNodeReboot:=2024h=20=E2=86=92=201h=20thre?= =?UTF-8?q?shold,=20matching=20upgrade-chain=20preflight?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 --- stacks/k8s-version-upgrade/scripts/upgrade-step.sh | 11 ++++++++--- .../modules/monitoring/prometheus_chart_values.tpl | 13 +++++++++++-- 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh index f716aa06..e0b71bba 100644 --- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh +++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh @@ -233,15 +233,20 @@ phase_preflight() { exit 1 fi - # 3. 24h-quiet baseline + # 3. Quiet-baseline check — fail if any node had a Ready transition in the + # last hour. Threshold matches the RecentNodeReboot alert (3600s) — the + # 24h-between-cluster-reboots protection lives in kured-sentinel-gate + # Check 4, not here. Tightened from 86400 → 3600 on 2026-05-17; with the + # alert clearing in 1h, this duplicate gate was the actual blocker for + # the chain after a session of manual reboots. local recent=0 while IFS= read -r ts; do [ -z "$ts" ] && continue local diff=$(( $(date +%s) - $(date -d "$ts" +%s) )) - if [ "$diff" -lt 86400 ]; then recent=1; break; fi + if [ "$diff" -lt 3600 ]; then recent=1; break; fi done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}') if [ "$recent" -eq 1 ]; then - slack "ABORT preflight — node transitioned Ready <24h ago (soak window)" + slack "ABORT preflight — node transitioned Ready <1h ago (settle window)" exit 1 fi diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 187408ff..06d36266 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1887,13 +1887,22 @@ serverFiles: severity: warning annotations: summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} stuck Pending for 10m+" + # RecentNodeReboot — kubelet just restarted, give the node some time + # to settle before any other reboot-driving thing (kured, K8s + # version-upgrade chain) acts. Threshold tightened from 86400 → + # 3600 on 2026-05-17 — post-reboot workloads (calico-node, + # kube-proxy, CSI sidecars, GPU drivers) typically reconverge + # within minutes; 1h is comfortable margin. The 24h-between- + # cluster-reboots protection lives separately in + # `kured-sentinel-gate` Check 4 (reads node Ready + # lastTransitionTime, independent of this alert). - alert: RecentNodeReboot - expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 86400 + expr: (time() - process_start_time_seconds{job="kubernetes-nodes"}) < 3600 for: 0m labels: severity: info annotations: - summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 24h soak window halts further reboots" + summary: "Node {{ $labels.node }} kubelet started {{ $value | humanizeDuration }} ago — 1h settle window halts further reboots" - alert: MysqlStandaloneDown expr: kube_statefulset_status_replicas_ready{statefulset="mysql-standalone"} < 1 for: 2m