From fb66676d7bc9fcbdd23ec8add9d77cc8d1dffb78 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 16 Mar 2026 22:06:10 +0000 Subject: [PATCH] =?UTF-8?q?post-mortem:=20kured=20+=20containerd=20cascade?= =?UTF-8?q?=20outage=20=E2=80=94=20alerts=20+=20report?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path --- ...03-16-kured-containerd-cascade-outage.html | 1223 +++++++++++++++++ .../monitoring/prometheus_chart_values.tpl | 49 + 2 files changed, 1272 insertions(+) create mode 100644 post-mortems/2026-03-16-kured-containerd-cascade-outage.html diff --git a/post-mortems/2026-03-16-kured-containerd-cascade-outage.html b/post-mortems/2026-03-16-kured-containerd-cascade-outage.html new file mode 100644 index 00000000..7cc1c872 --- /dev/null +++ b/post-mortems/2026-03-16-kured-containerd-cascade-outage.html @@ -0,0 +1,1223 @@ + + + + + + +Post-Incident Review: Kured + Containerd Cascade Outage + + + + + + + + + + +
+ + +
+
+ SEV 1 + Resolved +
+

Kured + Containerd Cascade Outage

+
+ Owner: Viktor Barzin  •  + Duration: ~26 hours  •  + Cluster: viktorbarzin.me k8s  •  + Date: March 2026 +
+
+ + +
+
+

What Broke

+

Containerd's overlayfs snapshotter corrupted after kernel update reboots. Image pulls failed, calico networking broke, cascading node-by-node outage.

+
+
+

Why It Took So Long

+

Kured had no health gating — kept rebooting nodes even as the cluster degraded. No alert existed for image pull errors (stage 3 in the cascade). Reboot window config used wrong Helm keys.

+
+
+

How It Was Fixed

+

Manually cleaned containerd state on each node. Deployed sentinel gate DaemonSet to block reboots when cluster is unhealthy. Added 6 new Prometheus alerts covering the detection gap.

+
+
+ + +
+
+
26h
+
Total Outage
+
+
+
~2h
+
Time to Detect
+
+
+
~26h
+
Time to Mitigate
+
+
+
5
+
Nodes Affected
+
+
+ + +

Failure Cascade

+
+
+
+
Stage 1
+
Kernel Update
+
+
+
+
Stage 2
+
Kured Reboot
+
+
+
+
Stage 3
+
Snapshotter Corrupt
+
+
+
+
Stage 4
+
Calico Down
+
+
+
+
Stage 5
+
Node NotReady
+
+
+
+
Stage 6
+
Pods Cascade Fail
+
+
+
+ + +

Incident Timeline

+
+
+
+
T+0
+
unattended-upgrades installs kernel update
+
Automatic kernel update applied to all 5 nodes. /var/run/reboot-required created on each host.
+
+
+
+
T+0 to T+2h
+
Kured begins rebooting nodes
+
Kured detects sentinel files and starts rebooting nodes one by one. No health gating — proceeds regardless of cluster state. Reboot window config was not applied (wrong Helm keys).
+
+
+
+
T+1h
+
First node: containerd snapshotter corrupts
+
After reboot, containerd's overlayfs snapshotter is corrupted by the new kernel. Image pulls start failing on this node.
+
+
+
+
T+2h
+
Detection: services failing
+
Noticed services going down. No Prometheus alert for image pull errors existed — detection was manual observation.
+
+
+
+
T+2h to T+10h
+
Cascade accelerates
+
Kured continues rebooting remaining nodes. Each rebooted node suffers the same containerd corruption. Calico-node pods fail to pull images, networking breaks node by node.
+
+
+
+
T+10h to T+24h
+
Manual remediation begins
+
SSH to each node, clean containerd state, restart containerd + kubelet, drain and uncordon. Process repeated for all 5 nodes.
+
+
+
+
T+26h
+
Cluster fully recovered
+
All nodes Ready, all calico-node pods running, all services restored. Post-mortem remediation work begins.
+
+
+ + +

Root Cause

+
+

Primary Root Cause

+

Containerd's overlayfs snapshotter became corrupted after a kernel update reboot. The new kernel was incompatible with existing overlayfs state, causing all subsequent image pulls to fail. This made calico-node (and all other pods) unable to start, breaking cluster networking.

+
+
+

Contributing Factors

+
    +
  • No health gating on kured: Kured kept rebooting nodes even as the cluster degraded. It had no mechanism to check if previous reboots were successful before proceeding to the next node.
  • +
  • Wrong Helm configuration keys: Kured's reboot window used legacy keys (reboot_days) instead of the correct configuration.rebootDays. The window was never enforced.
  • +
  • No monitoring for image pull errors: Stage 3 in the cascade (snapshotter corruption) had zero alerting. Detection relied on manual observation of service failures.
  • +
  • No cool-down between reboots: Kured would reboot the next node immediately after the previous one came back, regardless of whether the cluster had stabilized.
  • +
  • unattended-upgrades on k8s nodes: Kernel updates should not be automatically installed on production Kubernetes nodes. This was the initial trigger.
  • +
+
+ + +

DERP Analysis

+
+
+

Detection

+
    +
  • No alert for image pull errors — key gap filled by new KubeletImagePullErrors alert
  • +
  • Manual detection after ~2h when services started failing
  • +
  • Kured Slack notification was the only signal, but didn't indicate problems
  • +
+
+
+

Escalation

+
    +
  • Single operator incident — no formal escalation needed (homelab)
  • +
  • Root cause identified by SSH-ing to nodes and checking containerd logs
  • +
  • Kured was not stopped quickly enough — continued rebooting during diagnosis
  • +
+
+
+

Remediation

+
    +
  • Cleaned containerd overlayfs state on each node manually
  • +
  • Restarted containerd + kubelet on all affected nodes
  • +
  • Drained and uncordoned nodes one by one
  • +
  • Disabled unattended-upgrades on all nodes
  • +
+
+
+

Prevention

+
    +
  • Sentinel gate DaemonSet: blocks kured unless all nodes Ready + calico healthy + 30m cool-down
  • +
  • Fixed kured Helm values: reboot window Mon-Fri 02:00-06:00 London
  • +
  • 6 new Prometheus alerts covering node runtime health
  • +
  • Containerd cascade inhibition rule to suppress noise
  • +
+
+
+ + +

Detection Chain Coverage

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StageWhat HappensAlertLatency
1. Kernel updatereboot-required creatednone future
2. Kured rebootsSlack notificationKured built-inImmediate
3. Snapshotter corruptsImage pull errorsKubeletImagePullErrors new~10m
4. Calico breaksDaemonSet mismatchCalicoNodeNotReady new~5m
5. Node networking failsNode NotReadyNodeNotReady (existing)~5m
6. Pods cascade failReplica mismatchDeploymentReplicasMismatch (existing)~30m
+ + +

Follow-Up Tasks

+
+
+
+
Codify sentinel gate DaemonSet in Terraform (currently kubectl-applied)
+ P0 + 7d +
+
+
+
Investigate KubeletRuntimeOperationsLatency alert currently firing
+ P0 + 7d +
+
+
+
Disable unattended-upgrades on all k8s nodes
+ P0 + done +
+
+
+
Fix kured Helm values — reboot window + gated sentinel
+ P0 + done +
+
+
+
Deploy sentinel gate DaemonSet with cluster health checks
+ P0 + done +
+
+
+
Add 6 Prometheus alerts for node runtime health
+ P0 + done +
+
+
+
Add containerd health check to node provisioning (Terraform/cloud-init)
+ P1 + 30d +
+
+
+
Add Prometheus alert for /var/run/reboot-required existence (early warning)
+ P1 + 30d +
+
+
+
Evaluate switching from overlayfs to native snapshotter
+ P2 + 90d +
+
+
+
Add runbook links to all new alerts
+ P2 + 90d +
+
+ + + + +
+ + + + + \ No newline at end of file diff --git a/stacks/platform/modules/monitoring/prometheus_chart_values.tpl b/stacks/platform/modules/monitoring/prometheus_chart_values.tpl index 3f902bcd..c4d5070b 100755 --- a/stacks/platform/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/platform/modules/monitoring/prometheus_chart_values.tpl @@ -98,6 +98,11 @@ alertmanager: - alertname = PowerOutage target_matchers: - alertname =~ "NodeDown|NFSServerUnresponsive|NodeExporterDown|CloudflaredDown|MetalLBSpeakerDown|MetalLBControllerDown" + # Containerd broken suppresses downstream pod alerts + - source_matchers: + - alertname = KubeletImagePullErrors + target_matchers: + - alertname =~ "PodsStuckContainerCreating|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods" receivers: - name: slack-critical slack_configs: @@ -702,6 +707,50 @@ serverFiles: severity: info annotations: summary: "No node load data for 10m - check Prometheus scraping" + - name: "Node Runtime Health" + rules: + - alert: KubeletImagePullErrors + expr: sum by (node) (rate(kubelet_runtime_operations_errors_total{operation_type=~"pull_image|PullImage"}[10m])) > 0.1 + for: 10m + labels: + severity: critical + annotations: + summary: "Image pull errors on {{ $labels.node }}: {{ $value | printf \"%.2f\" }}/s — containerd may be broken" + - alert: KubeletPLEGUnhealthy + expr: (time() - kubelet_pleg_last_seen_seconds) > 180 + for: 5m + labels: + severity: critical + annotations: + summary: "PLEG on {{ $labels.instance }} not seen for {{ $value | printf \"%.0f\" }}s — kubelet lifecycle management broken" + - alert: PodsStuckContainerCreating + expr: count by (node) (kube_pod_container_status_waiting_reason{reason="ContainerCreating"} == 1) > 3 + for: 15m + labels: + severity: warning + annotations: + summary: "{{ $value | printf \"%.0f\" }} pods stuck in ContainerCreating on {{ $labels.node }}" + - alert: KubeletRuntimeOperationsLatency + expr: histogram_quantile(0.99, sum by (instance, operation_type, le) (rate(kubelet_runtime_operations_duration_seconds_bucket[10m]))) > 30 + for: 10m + labels: + severity: warning + annotations: + summary: "Kubelet {{ $labels.operation_type }} p99: {{ $value | printf \"%.0f\" }}s on {{ $labels.instance }} (threshold: 30s)" + - alert: KubeletRunningContainersDrop + expr: (kubelet_running_containers{container_state="running"} - kubelet_running_containers{container_state="running"} offset 10m) < -10 + for: 5m + labels: + severity: critical + annotations: + summary: "Running containers on {{ $labels.instance }} dropped by {{ $value | printf \"%.0f\" }} in 10m" + - alert: CalicoNodeNotReady + expr: kube_daemonset_status_number_ready{namespace="calico-system", daemonset="calico-node"} < kube_daemonset_status_desired_number_scheduled{namespace="calico-system", daemonset="calico-node"} + for: 5m + labels: + severity: critical + annotations: + summary: "Calico: only {{ $value | printf \"%.0f\" }} of desired calico-node pods ready — networking degraded" - name: "Traefik Ingress" rules: - alert: TraefikDown