infra/post-mortems
Viktor Barzin 12918dd491 post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-18 08:04:04 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html post-mortem: kured + containerd cascade outage — alerts + report 2026-03-18 08:04:04 +00:00