post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path
This commit is contained in:
parent
d6afbe84c8
commit
fb66676d7b
2 changed files with 1272 additions and 0 deletions
1223
post-mortems/2026-03-16-kured-containerd-cascade-outage.html
Normal file
1223
post-mortems/2026-03-16-kured-containerd-cascade-outage.html
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue