Viktor Barzin
|
fb66676d7b
|
post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.
Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/
Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
|
2026-03-16 22:06:10 +00:00 |
|