kured: fix sentinel path mismatch that stalled rolling reboots

The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.

Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.

Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.

Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-16 11:19:13 +00:00
parent 80e6314bf0
commit 065982d978

View file

@ -58,7 +58,17 @@ resource "helm_release" "kured" {
startTime = "02:00"
endTime = "06:00"
rebootDays = ["mo", "tu", "we", "th", "fr"]
rebootSentinel = "/sentinel/gated-reboot-required"
# IMPORTANT: must match where kured-sentinel-gate writes (below):
# `touch /host/var-run/gated-reboot-required` host
# `/var/run/gated-reboot-required`. The kured chart derives the host
# path from `dirname(rebootSentinel)`, so this single setting controls
# BOTH the in-pod mountPath AND the host hostPath. Previously
# `/sentinel/gated-reboot-required` that pointed the chart's hostPath
# at `/sentinel/` (empty, auto-created by hostPath:Directory) while the
# gate kept writing to `/var/run/`. kured never saw the open gate so
# nodes stopped rebooting on 2026-05-10 when unattended-upgrades was
# re-enabled. Fixed 2026-05-16.
rebootSentinel = "/var/run/gated-reboot-required"
notifyUrl = data.vault_kv_secret_v2.secrets.data["slack_kured_webhook"]
concurrency = 1
rebootDelay = "30s"