diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index c6d423d6..8ff35609 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -219,7 +219,7 @@ Independent of the service-upgrade pipeline above. Drives apt package updates + ### Stack - **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots. - **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m. -- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window). +- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window). The gate runs as an immortal `bash` loop that forks `kubectl` each cycle; the pod whose host has a pending reboot runs the full kubectl-heavy path indefinitely and slowly leaks. Mitigated 2026-05-31 (limit 64Mi→256Mi + `MAX_ITER=72` self-exit ≈6h so kubelet restarts it fresh) — see PM `2026-05-31-kured-sentinel-gate-oom.md`. - **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`. - **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured). - **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`. diff --git a/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md b/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md new file mode 100644 index 00000000..ee0a7053 --- /dev/null +++ b/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md @@ -0,0 +1,89 @@ +# Post-Mortem: kured-sentinel-gate OOM while k8s-master stuck pending-reboot + +| Field | Value | +|-------|-------| +| **Date** | 2026-05-31 | +| **Duration** | OOMs began 2026-05-30 ~03:33, escalating until fixed 2026-05-31 14:40 UTC | +| **Severity** | SEV4 — no user-facing impact; noisy + latent risk (a wedged gate pod could eventually mis-gate reboots) | +| **Affected** | `kured-sentinel-gate` pod on k8s-master only | +| **Status** | Fixed (gate hardened). Two contributing alerts still open, tracked separately. | + +## Summary + +Noticed by the operator during a routine cluster health check ("an app OOMing +periodically"). The `kured-sentinel-gate` pod on k8s-master was the *only* +container in the cluster with OOM events: `container_oom_events_total` showed +0/day through May 29, **15 on May 30, 134 on May 31** (by 08:21). The kernel +OOM-killer was killing child `kubectl` processes inside the pod's cgroup; PID 1 +(bash) survived, so the pod never restarted — restartCount stayed at 1 despite +149 oom_events in 7d. Symptom: the gate's check cycle stretched from 5 min to +~25 min. + +## Root cause (chain) + +``` +hermes-agent deploy = 0/0 (parked 2026-04-22, PVC-perms bug) → PVC WaitForFirstConsumer + never binds → PVCStuckPending fires; its dead external monitor → ExternalAccessDivergence +/mnt/synology-backup (192.168.1.13 offsite NAS) at 96% → NodeFilesystemFull fires + │ none of these 3 are in kured's alert ignore-list + ▼ +kured halts ALL reboots (correct fail-safe) + ▼ +k8s-master got /var/run/reboot-required on 2026-05-30 03:33 (kernel update) but can't reboot + ▼ +master's gate pod is now the ONLY one running the kubectl-heavy hot path every cycle +(the other 6 hit the early "no reboot required → continue" at ~3 MiB) + ▼ +the immortal `while true` bash loop slowly leaks (repeated kubectl forks + the +Check-4 `< <(kubectl ...)` process substitution), crosses the 64Mi cgroup limit +~5 days in, and the OOM-killer culls child kubectls — accelerating as it wedges +``` + +The 64Mi limit was the proximate misconfiguration: each `kubectl` fork is a +~30-50Mi Go binary, and the hot path runs up to 3 per cycle. + +## Why hard to spot + +- The pod showed `Running` / `1 restart` the whole time — the OOMs hit child + processes, not PID 1. Only `container_oom_events_total` (cAdvisor) revealed it; + `kube_pod_container_status_*` restart metrics did not. +- Logs looked clean ("ALL CHECKS PASSED") — the gate kept producing correct + decisions, just slowly. +- Same blind spot as the 2026-05-16 PM: there is still no Prometheus signal for + "a node has been pending-reboot too long" (the deferred `KuredRebootBacklog` + alert). That alert would have surfaced the stuck-master state on May 30. + +## Fix (`stacks/kured/main.tf`, applied + committed 2026-05-31) + +1. **Immediate**: deleted the leaking pod (DaemonSet recreated it at ~3 MiB). +2. **Durable**: memory limit `64Mi → 256Mi` (headroom for kubectl forks) **plus** + a self-restart guard — the loop counts iterations and `exit 0`s every + `MAX_ITER=72` cycles (~6h at 300s), so kubelet restarts the pod fresh and the + slow leak can never accumulate, regardless of how long a node stays + pending-reboot. Verified: all 7 pods at 256Mi, `iter N/72` loop live, OOMs + stopped. + +## Contributing items (open — being addressed separately) + +- **hermes-agent** parked at `replicas=0` since 2026-04-22 (PVC `/opt/data` perms + mismatch). Its orphaned `WaitForFirstConsumer` PVC drives PVCStuckPending + + ExternalAccessDivergence. Resolve = fix perms + scale up, OR remove the PVC and + external monitor while parked, OR scope PVCStuckPending to ignore 0-replica + consumers. +- **Synology offsite backup at 96%** (5.0T/5.3T, 265G free; `#recycle` holds 17G). + Resolve = prune retention / empty recycle / expand volume. NodeFilesystemFull + cannot be blanket-ignored in kured (a full *node* disk SHOULD block reboots) — + if scoped, scope to the offsite mount only. + +Until at least the first two clear, kured will keep (correctly) refusing to +reboot master — but the gate pod is now leak-proof either way. + +## Lessons + +1. **`container_oom_events_total` is the canonical "is anything OOMing" signal** — + not restart counts. A cgroup can OOM-kill children while PID 1 lives. +2. **Immortal in-pod loops that fork heavy binaries need either a generous limit + or a periodic self-restart.** A periodic task is really a CronJob; the + self-exit guard is the minimal fix within the DaemonSet model. +3. **The `KuredRebootBacklog` alert (deferred from 2026-05-16) is now twice-implicated.** + Worth promoting from the backlog: `kured_reboot_required == 1 for > 24h`. diff --git a/stacks/kured/main.tf b/stacks/kured/main.tf index 74eb91c1..1c2fe9c0 100644 --- a/stacks/kured/main.tf +++ b/stacks/kured/main.tf @@ -29,8 +29,8 @@ resource "kubernetes_namespace" "kured" { metadata { name = "kured" labels = { - "istio-injection" = "disabled" - tier = local.tiers.cluster + "istio-injection" = "disabled" + tier = local.tiers.cluster "keel.sh/enrolled" = "true" } } @@ -54,17 +54,17 @@ resource "helm_release" "kured" { values = [yamlencode({ configuration = { - period = "1h0m0s" - timeZone = "Europe/London" - startTime = "02:00" - endTime = "06:00" + period = "1h0m0s" + timeZone = "Europe/London" + startTime = "02:00" + endTime = "06:00" # All 7 days — operator decision 2026-05-16. The Mon–Fri restriction # was a 2026-03-16-era guardrail (overlapping with weekend on-call # response). Today the rest of the safety net (halt-on-alert, # sentinel-gate Check 4 = 24h soak, single-concurrency, the # K8sUpgradeStalled alert) is strong enough to operate any day; the # weekday-only schedule was just slowing the backlog down. - rebootDays = ["mo", "tu", "we", "th", "fr", "sa", "su"] + rebootDays = ["mo", "tu", "we", "th", "fr", "sa", "su"] # IMPORTANT: must match where kured-sentinel-gate writes (below): # `touch /host/var-run/gated-reboot-required` → host # `/var/run/gated-reboot-required`. The kured chart derives the host @@ -220,8 +220,24 @@ resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" { "/bin/bash", "-c", <<-EOT + # Self-restart guard (added 2026-05-31): a node stuck in + # pending-reboot keeps THIS pod on the kubectl-heavy hot path + # every cycle. The long-lived bash loop slowly leaks (repeated + # kubectl forks + the Check-4 process substitution) until the + # cgroup OOM-kills child processes — PID 1 bash survives, so the + # pod never restarts, it just racks up silent oom_events + # (149 in 7d / accelerating on k8s-master, 2026-05-30..31). + # Exit 0 every MAX_ITER cycles (~6h at 300s) so kubelet restarts + # the pod fresh and memory can never accumulate. + ITER=0 + MAX_ITER=72 while true; do - echo "[$(date)] Checking reboot gate conditions..." + ITER=$((ITER + 1)) + if [ "$ITER" -gt "$MAX_ITER" ]; then + echo " Iteration cap ($MAX_ITER) reached — exit 0 for a clean restart (leak guard)" + exit 0 + fi + echo "[$(date)] Checking reboot gate conditions... (iter $ITER/$MAX_ITER)" # Check 1: Does the host need a reboot? if [ ! -f /host/var-run/reboot-required ]; then @@ -288,8 +304,13 @@ resource "kubernetes_daemon_set_v1" "kured_sentinel_gate" { cpu = "10m" memory = "32Mi" } + # 64Mi was too tight for the kubectl-heavy hot path (each kubectl + # fork is a ~30-50Mi Go binary). Raised to 256Mi 2026-05-31 after + # the k8s-master gate pod OOM-killed child kubectls 149x/7d while + # master sat in pending-reboot. The self-restart guard (loop above) + # is the primary leak fix; this just gives comfortable headroom. limits = { - memory = "64Mi" + memory = "256Mi" } } volume_mount {