kured(sentinel-gate): fix auth + write-perm so safety checks actually run
Test 3 validation surfaced two latent bugs in the sentinel-gate DaemonSet that have been masked since 2026-04-18 (when uu was off, nothing wrote /var/run/reboot-required, so the gate never had to fire): 1. automount_service_account_token=false on both the SA and the pod spec → kubectl in the script falls back to localhost:8080 on every call. Each check (`kubectl get nodes`, `kubectl get pods -n calico-system`, transition-time read) errors to stderr and emits empty stdout. `wc -l` reports 0 → checks "pass" with no real data. 2. bitnami/kubectl:latest runs as uid=1001 by default. The hostPath /var/run is root:root 0755 → final `touch /host/var-run/gated-reboot-required` failed with EACCES. Fail-safe by accident — but if anything had ever loosened those perms, the broken checks above would have green-lit the gate with no real validation. Fix: enable token mount on the SA + pod, set securityContext.run_as_user=0 on the container. Verified post-fix: kubectl returns all 5 nodes, touch succeeds, sentinel-gate now reports the correct `BLOCKED: A node transitioned Ready within the last 24 hours (soak window)` when triggered with k8s-node1's recent reboot within the cool-down period. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
64c71615e8
commit
d1777d6119
1 changed files with 16 additions and 2 deletions
|
|
@ -103,7 +103,12 @@ resource "kubernetes_service_account" "kured_sentinel_gate" {
|
|||
name = "kured-sentinel-gate"
|
||||
namespace = kubernetes_namespace.kured.metadata[0].name
|
||||
}
|
||||
automount_service_account_token = false
|
||||
# Token IS mounted — the script uses kubectl to read nodes + pods state for
|
||||
# the safety checks. Without an authenticated token, kubectl falls back to
|
||||
# localhost:8080 (no proxy in distroless-ish image), every check silently
|
||||
# no-ops on parse-empty stdout, and the gate appears to PASS when it
|
||||
# shouldn't. Mount the token. (Found 2026-05-10 during Test 3 validation.)
|
||||
automount_service_account_token = true
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role" "kured_sentinel_gate" {
|
||||
|
|
@ -161,8 +166,17 @@ resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" {
|
|||
}
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.kured_sentinel_gate.metadata[0].name
|
||||
automount_service_account_token = false
|
||||
automount_service_account_token = true
|
||||
enable_service_links = false
|
||||
# bitnami/kubectl:latest runs as uid=1001 by default. The hostPath
|
||||
# /var/run is root:root 0755 → final `touch
|
||||
# /host/var-run/gated-reboot-required` fails with EACCES, so the gate
|
||||
# never opens. Run as root inside the container (the hostPath mount
|
||||
# already gives privileged-equivalent access; this just lets us write
|
||||
# to /var/run). Found 2026-05-10 during Test 3 validation.
|
||||
security_context {
|
||||
run_as_user = 0
|
||||
}
|
||||
toleration {
|
||||
effect = "NoSchedule"
|
||||
key = "node-role.kubernetes.io/control-plane"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue