Containerd's overlayfs snapshotter corrupted after kernel update reboots. Image pulls failed, calico networking broke, cascading node-by-node outage.
Kured had no health gating — kept rebooting nodes even as the cluster degraded. No alert existed for image pull errors (stage 3 in the cascade). Reboot window config used wrong Helm keys.
Manually cleaned containerd state on each node. Deployed sentinel gate DaemonSet to block reboots when cluster is unhealthy. Added 6 new Prometheus alerts covering the detection gap.
/var/run/reboot-required created on each host.Containerd's overlayfs snapshotter became corrupted after a kernel update reboot. The new kernel was incompatible with existing overlayfs state, causing all subsequent image pulls to fail. This made calico-node (and all other pods) unable to start, breaking cluster networking.
reboot_days) instead of the correct configuration.rebootDays. The window was never enforced.KubeletImagePullErrors alert| Stage | What Happens | Alert | Latency |
|---|---|---|---|
| 1. Kernel update | reboot-required created | none future | — |
| 2. Kured reboots | Slack notification | Kured built-in | Immediate |
| 3. Snapshotter corrupts | Image pull errors | KubeletImagePullErrors new | ~10m |
| 4. Calico breaks | DaemonSet mismatch | CalicoNodeNotReady new | ~5m |
| 5. Node networking fails | Node NotReady | NodeNotReady (existing) | ~5m |
| 6. Pods cascade fail | Replica mismatch | DeploymentReplicasMismatch (existing) | ~30m |