docs: add k8s node auto-upgrade runbook + architecture section

The OS-side counterpart to the service-upgrade pipeline. Covers the unattended-upgrades + kured + sentinel-gate + Prometheus halt-on-alert design landed in c0991f7f8. Runbook: ops procedures (verify health, halt rollout, restore config to a re-imaged node, roll back a bad upgrade, investigate which alert is blocking). Architecture doc: extends the existing service-upgrade flow with a "K8s Node OS Upgrades" section (stack, sources of truth, day-2 mechanism, why-this-design rationale tied to the March 2026 post-mortem). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 17:26:15 +00:00 · 2026-05-10 17:26:15 +00:00 · f5b1fb179a
commit f5b1fb179a
parent 278ef5f19b
2 changed files with 245 additions and 1 deletions
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -1,4 +1,9 @@
-# Automated Service Upgrades
+# Automated Upgrades
+
+This doc covers two independent automation paths:
+
+1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
+2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section near the end and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.

 ## Overview

@ -205,3 +210,35 @@ The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **
 - **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
 - **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
 - **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
+
+## K8s Node OS Upgrades
+
+Independent of the service-upgrade pipeline above. Drives apt package updates + reboots on the 5 K8s VMs (master + 4 workers).
+
+### Stack
+- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
+- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window Mon-Fri 02:00-06:00 Europe/London, period=1h, concurrency=1, reboot-delay=30s.
+- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
+- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
+- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
+- **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
+
+### Source of truth
+| Concern | Location |
+|---|---|
+| Package config (uu, holds, blacklist) | `modules/create-template-vm/cloud_init.yaml` (within `is_k8s_template`) |
+| kured Helm release + sentinel-gate DS | `stacks/kured/main.tf` |
+| Upgrade Gates alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
+
+### Day-2 changes
+Cloud-init only runs on first boot. Existing nodes are brought into compliance with a one-shot SSH push — see the runbook section "Restore / re-apply unattended-upgrades config to existing nodes" in `docs/runbooks/k8s-node-auto-upgrades.md`.
+
+### Why this design
+The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades kernel push that corrupted containerd's overlayfs snapshotter cluster-wide. The remediations:
+- 24h soak (sentinel-gate Check 4) gives a full day of observation between consecutive node reboots — broken updates show up as Prometheus alerts before any other node restarts.
+- Prometheus halt-on-alert turns ANY firing alert into a hard block — including the 6 Node Runtime Health alerts and the 10 Upgrade Gates alerts that explicitly model "the cluster is in a bad state."
+- Package-Blacklist on runtime components prevents the exact failure mode (containerd/runc auto-bumps).
+- `Automatic-Reboot=false` keeps reboot policy in kured (window, ordering, gating), not in apt.
+
+### Operational reference
+See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
--- a/docs/runbooks/k8s-node-auto-upgrades.md
+++ b/docs/runbooks/k8s-node-auto-upgrades.md
@ -0,0 +1,207 @@
+# K8s Node Auto-Upgrades
+
+## Overview
+
+OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
+
+## Architecture
+
+```
+apt-daily.timer (random within window)
+  │ apt-get update
+  │
+  ▼
+apt-daily-upgrade.timer (random within window)
+  │ unattended-upgrades runs
+  │   - Allowed-Origins: -security, -updates, ESM
+  │   - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
+  │   - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
+  │   - Automatic-Reboot=false (kured handles reboots)
+  │
+  ▼ if kernel/glibc/systemd updated
+/var/run/reboot-required appears on the host
+  │
+  ▼ (sentinel-gate DaemonSet polls every 5min)
+kured-sentinel-gate checks:
+  ├── 1. Host has /var/run/reboot-required
+  ├── 2. ALL nodes Ready
+  ├── 3. ALL calico-node pods Running
+  └── 4. NO node Ready-transition in last 24h (soak window)
+  │
+  ▼ all pass
+touch /var/run/gated-reboot-required
+  │
+  ▼ (kured polls every 1h within 02:00-06:00 London Mon-Fri window)
+kured checks Prometheus before draining:
+  │ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
+  │ ANY firing alert (except ignore-list) blocks the drain
+  │ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
+  │
+  ▼ no blockers
+kured drains the node (priority-ordered, 310s budget)
+kured runs /bin/systemctl reboot
+  │
+  ▼ node returns
+kured uncordons + posts Slack notification (configuration.notifyUrl)
+  │
+  ▼ 24h cool-down begins (sentinel-gate Check 4)
+```
+
+## Components
+
+### unattended-upgrades (in-guest)
+- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
+- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
+- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
+
+### kured (Helm release)
+- **Stack**: `infra/stacks/kured/main.tf`
+- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
+- **Window**: Mon-Fri 02:00-06:00 Europe/London, period=1h, concurrency=1
+- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
+- **Slack hook**: Vault `secret/kured` → `slack_kured_webhook`
+
+### kured-sentinel-gate (DaemonSet)
+- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
+- **Image**: `bitnami/kubectl:latest`
+- **Loop period**: every 300s
+- **Gate logic**: 4 checks — see Architecture diagram
+
+### Upgrade Gates Prometheus alerts
+- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
+- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
+- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
+
+## Common Operations
+
+### Verify the system is healthy
+```bash
+# kured pods + sentinel-gate Running on all 5 nodes
+kubectl -n kured get pods
+
+# kured can reach Prometheus
+kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
+
+# Upgrade Gates rules loaded + state
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
+  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
+
+# Per-node unattended-upgrades status
+for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  echo "=== $n ==="
+  ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
+done
+```
+
+### Halt rollout in an emergency
+```bash
+# Option 1: scale kured to 0 (most decisive)
+kubectl -n kured scale ds kured --replicas=0
+# When ready: kubectl -n kured scale ds kured --replicas=5
+
+# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
+# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
+```
+
+### Force halt by adding a custom blocker alert
+- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
+- Apply, wait for sync (~120s), kured will block on the next poll.
+- Remove when ready.
+
+### Pause apt upgrades on a single node
+```bash
+ssh <node> sudo systemctl stop unattended-upgrades
+ssh <node> sudo systemctl disable unattended-upgrades
+# Re-enable when ready:
+ssh <node> sudo systemctl enable --now unattended-upgrades
+```
+
+### Restore / re-apply unattended-upgrades config to existing nodes
+Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
+
+```bash
+# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
+for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  ssh $n sudo bash -s <<'EOF'
+set -e
+systemctl unmask unattended-upgrades 2>/dev/null || true
+DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
+cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
+Unattended-Upgrade::Allowed-Origins {
+    "${distro_id}:${distro_codename}";
+    "${distro_id}:${distro_codename}-security";
+    "${distro_id}:${distro_codename}-updates";
+    "${distro_id}ESMApps:${distro_codename}-apps-security";
+    "${distro_id}ESM:${distro_codename}-infra-security";
+};
+Unattended-Upgrade::Package-Blacklist {
+    "^containerd(\.io)?$";
+    "^runc$";
+    "^cri-tools$";
+    "^kubernetes-cni$";
+    "^calico-.*";
+    "^cni-plugins-.*";
+    "^docker-ce$";
+};
+Unattended-Upgrade::DevRelease "false";
+Unattended-Upgrade::Automatic-Reboot "false";
+CONF
+cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
+APT::Periodic::Update-Package-Lists "1";
+APT::Periodic::Unattended-Upgrade "1";
+CONF
+apt-mark hold kubelet kubeadm kubectl
+apt-mark hold containerd containerd.io runc 2>/dev/null || true
+systemctl enable --now unattended-upgrades
+EOF
+done
+```
+
+### Roll back a bad apt upgrade
+1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
+2. Hold them: `sudo apt-mark hold <pkg>`.
+3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
+4. Reboot the node manually if the package needs it.
+5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
+
+### kured halted — investigate which alert is blocking
+```bash
+# Show kured logs — it logs "blocking alerts" when halting
+kubectl -n kured logs ds/kured --tail=100 | grep -i alert
+
+# List currently firing alerts (any of these blocks kured):
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
+  jq -r '.data.alerts[] | select(.state == "firing") | "  \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
+```
+
+The alert is either:
+- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
+- A pre-existing alert (any of the ~211 in the library — investigate),
+- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
+
+### Verify the 24h soak is enforcing
+```bash
+# Sentinel-gate logs Check 4 outcome
+kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
+
+# kured won't drain another node until the most recent Ready-transition is >24h ago.
+# If you need to override (e.g. emergency security patch), shorten the cool-down by
+# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
+```
+
+## Past Incidents
+
+- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
+
+## File Pointers
+
+| What | Where |
+|------|-------|
+| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
+| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
+| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
+| Slack webhook | Vault `secret/kured` → `slack_kured_webhook` |
+| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |