diff --git a/post-mortems/2026-03-16-kured-containerd-cascade-outage.html b/post-mortems/2026-03-16-kured-containerd-cascade-outage.html new file mode 100644 index 00000000..7cc1c872 --- /dev/null +++ b/post-mortems/2026-03-16-kured-containerd-cascade-outage.html @@ -0,0 +1,1223 @@ + + +
+ + +Containerd's overlayfs snapshotter corrupted after kernel update reboots. Image pulls failed, calico networking broke, cascading node-by-node outage.
+Kured had no health gating — kept rebooting nodes even as the cluster degraded. No alert existed for image pull errors (stage 3 in the cascade). Reboot window config used wrong Helm keys.
+Manually cleaned containerd state on each node. Deployed sentinel gate DaemonSet to block reboots when cluster is unhealthy. Added 6 new Prometheus alerts covering the detection gap.
+/var/run/reboot-required created on each host.Containerd's overlayfs snapshotter became corrupted after a kernel update reboot. The new kernel was incompatible with existing overlayfs state, causing all subsequent image pulls to fail. This made calico-node (and all other pods) unable to start, breaking cluster networking.
+reboot_days) instead of the correct configuration.rebootDays. The window was never enforced.KubeletImagePullErrors alert| Stage | +What Happens | +Alert | +Latency | +
|---|---|---|---|
| 1. Kernel update | +reboot-required created | +none future | +— | +
| 2. Kured reboots | +Slack notification | +Kured built-in | +Immediate | +
| 3. Snapshotter corrupts | +Image pull errors | +KubeletImagePullErrors new | +~10m | +
| 4. Calico breaks | +DaemonSet mismatch | +CalicoNodeNotReady new | +~5m | +
| 5. Node networking fails | +Node NotReady | +NodeNotReady (existing) | +~5m | +
| 6. Pods cascade fail | +Replica mismatch | +DeploymentReplicasMismatch (existing) | +~30m | +