kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard

The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:49:04 +00:00 · 2026-05-31 14:49:04 +00:00 · 51313ee088
commit 51313ee088
parent 0c64fc2948
3 changed files with 120 additions and 10 deletions
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -219,7 +219,7 @@ Independent of the service-upgrade pipeline above. Drives apt package updates +
 ### Stack
 - **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
 - **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
+- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window). The gate runs as an immortal `bash` loop that forks `kubectl` each cycle; the pod whose host has a pending reboot runs the full kubectl-heavy path indefinitely and slowly leaks. Mitigated 2026-05-31 (limit 64Mi→256Mi + `MAX_ITER=72` self-exit ≈6h so kubelet restarts it fresh) — see PM `2026-05-31-kured-sentinel-gate-oom.md`.
 - **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
 - **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
 - **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
--- a/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md
+++ b/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md
@ -0,0 +1,89 @@
+# Post-Mortem: kured-sentinel-gate OOM while k8s-master stuck pending-reboot
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-31 |
+| **Duration** | OOMs began 2026-05-30 ~03:33, escalating until fixed 2026-05-31 14:40 UTC |
+| **Severity** | SEV4 — no user-facing impact; noisy + latent risk (a wedged gate pod could eventually mis-gate reboots) |
+| **Affected** | `kured-sentinel-gate` pod on k8s-master only |
+| **Status** | Fixed (gate hardened). Two contributing alerts still open, tracked separately. |
+
+## Summary
+
+Noticed by the operator during a routine cluster health check ("an app OOMing
+periodically"). The `kured-sentinel-gate` pod on k8s-master was the *only*
+container in the cluster with OOM events: `container_oom_events_total` showed
+0/day through May 29, **15 on May 30, 134 on May 31** (by 08:21). The kernel
+OOM-killer was killing child `kubectl` processes inside the pod's cgroup; PID 1
+(bash) survived, so the pod never restarted — restartCount stayed at 1 despite
+149 oom_events in 7d. Symptom: the gate's check cycle stretched from 5 min to
+~25 min.
+
+## Root cause (chain)
+
+```
+hermes-agent deploy = 0/0 (parked 2026-04-22, PVC-perms bug) → PVC WaitForFirstConsumer
+  never binds → PVCStuckPending fires; its dead external monitor → ExternalAccessDivergence
+/mnt/synology-backup (192.168.1.13 offsite NAS) at 96% → NodeFilesystemFull fires
+        │  none of these 3 are in kured's alert ignore-list
+        ▼
+kured halts ALL reboots (correct fail-safe)
+        ▼
+k8s-master got /var/run/reboot-required on 2026-05-30 03:33 (kernel update) but can't reboot
+        ▼
+master's gate pod is now the ONLY one running the kubectl-heavy hot path every cycle
+(the other 6 hit the early "no reboot required → continue" at ~3 MiB)
+        ▼
+the immortal `while true` bash loop slowly leaks (repeated kubectl forks + the
+Check-4 `< <(kubectl ...)` process substitution), crosses the 64Mi cgroup limit
+~5 days in, and the OOM-killer culls child kubectls — accelerating as it wedges
+```
+
+The 64Mi limit was the proximate misconfiguration: each `kubectl` fork is a
+~30-50Mi Go binary, and the hot path runs up to 3 per cycle.
+
+## Why hard to spot
+
+- The pod showed `Running` / `1 restart` the whole time — the OOMs hit child
+  processes, not PID 1. Only `container_oom_events_total` (cAdvisor) revealed it;
+  `kube_pod_container_status_*` restart metrics did not.
+- Logs looked clean ("ALL CHECKS PASSED") — the gate kept producing correct
+  decisions, just slowly.
+- Same blind spot as the 2026-05-16 PM: there is still no Prometheus signal for
+  "a node has been pending-reboot too long" (the deferred `KuredRebootBacklog`
+  alert). That alert would have surfaced the stuck-master state on May 30.
+
+## Fix (`stacks/kured/main.tf`, applied + committed 2026-05-31)
+
+1. **Immediate**: deleted the leaking pod (DaemonSet recreated it at ~3 MiB).
+2. **Durable**: memory limit `64Mi → 256Mi` (headroom for kubectl forks) **plus**
+   a self-restart guard — the loop counts iterations and `exit 0`s every
+   `MAX_ITER=72` cycles (~6h at 300s), so kubelet restarts the pod fresh and the
+   slow leak can never accumulate, regardless of how long a node stays
+   pending-reboot. Verified: all 7 pods at 256Mi, `iter N/72` loop live, OOMs
+   stopped.
+
+## Contributing items (open — being addressed separately)
+
+- **hermes-agent** parked at `replicas=0` since 2026-04-22 (PVC `/opt/data` perms
+  mismatch). Its orphaned `WaitForFirstConsumer` PVC drives PVCStuckPending +
+  ExternalAccessDivergence. Resolve = fix perms + scale up, OR remove the PVC and
+  external monitor while parked, OR scope PVCStuckPending to ignore 0-replica
+  consumers.
+- **Synology offsite backup at 96%** (5.0T/5.3T, 265G free; `#recycle` holds 17G).
+  Resolve = prune retention / empty recycle / expand volume. NodeFilesystemFull
+  cannot be blanket-ignored in kured (a full *node* disk SHOULD block reboots) —
+  if scoped, scope to the offsite mount only.
+
+Until at least the first two clear, kured will keep (correctly) refusing to
+reboot master — but the gate pod is now leak-proof either way.
+
+## Lessons
+
+1. **`container_oom_events_total` is the canonical "is anything OOMing" signal** —
+   not restart counts. A cgroup can OOM-kill children while PID 1 lives.
+2. **Immortal in-pod loops that fork heavy binaries need either a generous limit
+   or a periodic self-restart.** A periodic task is really a CronJob; the
+   self-exit guard is the minimal fix within the DaemonSet model.
+3. **The `KuredRebootBacklog` alert (deferred from 2026-05-16) is now twice-implicated.**
+   Worth promoting from the backlog: `kured_reboot_required == 1 for > 24h`.
--- a/stacks/kured/main.tf
+++ b/stacks/kured/main.tf
@ -29,8 +29,8 @@ resource "kubernetes_namespace" "kured" {
  metadata {
    name = "kured"
    labels = {
-      "istio-injection" = "disabled"
-      tier              = local.tiers.cluster
+      "istio-injection"  = "disabled"
+      tier               = local.tiers.cluster
      "keel.sh/enrolled" = "true"
    }
  }
@ -54,17 +54,17 @@ resource "helm_release" "kured" {

  values = [yamlencode({
    configuration = {
-      period         = "1h0m0s"
-      timeZone       = "Europe/London"
-      startTime      = "02:00"
-      endTime        = "06:00"
+      period    = "1h0m0s"
+      timeZone  = "Europe/London"
+      startTime = "02:00"
+      endTime   = "06:00"
      # All 7 days — operator decision 2026-05-16. The Mon–Fri restriction
      # was a 2026-03-16-era guardrail (overlapping with weekend on-call
      # response). Today the rest of the safety net (halt-on-alert,
      # sentinel-gate Check 4 = 24h soak, single-concurrency, the
      # K8sUpgradeStalled alert) is strong enough to operate any day; the
      # weekday-only schedule was just slowing the backlog down.
-      rebootDays     = ["mo", "tu", "we", "th", "fr", "sa", "su"]
+      rebootDays = ["mo", "tu", "we", "th", "fr", "sa", "su"]
      # IMPORTANT: must match where kured-sentinel-gate writes (below):
      # `touch /host/var-run/gated-reboot-required` → host
      # `/var/run/gated-reboot-required`. The kured chart derives the host
@ -220,8 +220,24 @@ resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" {
            "/bin/bash",
            "-c",
            <<-EOT
+              # Self-restart guard (added 2026-05-31): a node stuck in
+              # pending-reboot keeps THIS pod on the kubectl-heavy hot path
+              # every cycle. The long-lived bash loop slowly leaks (repeated
+              # kubectl forks + the Check-4 process substitution) until the
+              # cgroup OOM-kills child processes — PID 1 bash survives, so the
+              # pod never restarts, it just racks up silent oom_events
+              # (149 in 7d / accelerating on k8s-master, 2026-05-30..31).
+              # Exit 0 every MAX_ITER cycles (~6h at 300s) so kubelet restarts
+              # the pod fresh and memory can never accumulate.
+              ITER=0
+              MAX_ITER=72
              while true; do
-                echo "[$(date)] Checking reboot gate conditions..."
+                ITER=$((ITER + 1))
+                if [ "$ITER" -gt "$MAX_ITER" ]; then
+                  echo "  Iteration cap ($MAX_ITER) reached — exit 0 for a clean restart (leak guard)"
+                  exit 0
+                fi
+                echo "[$(date)] Checking reboot gate conditions... (iter $ITER/$MAX_ITER)"

                # Check 1: Does the host need a reboot?
                if [ ! -f /host/var-run/reboot-required ]; then
@ -288,8 +304,13 @@ resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" {
              cpu    = "10m"
              memory = "32Mi"
            }
+            # 64Mi was too tight for the kubectl-heavy hot path (each kubectl
+            # fork is a ~30-50Mi Go binary). Raised to 256Mi 2026-05-31 after
+            # the k8s-master gate pod OOM-killed child kubectls 149x/7d while
+            # master sat in pending-reboot. The self-restart guard (loop above)
+            # is the primary leak fix; this just gives comfortable headroom.
            limits = {
-              memory = "64Mi"
+              memory = "256Mi"
            }
          }
          volume_mount {