Commit graph

2 commits

Author SHA1 Message Date
Viktor Barzin
ebc1aca791 add GitHub Pages for post-mortems
- Index page listing all incident reports
- GHA workflow deploys post-mortems/ on push
- Available at viktorbarzin.github.io/infra/
2026-03-18 08:04:04 +00:00
Viktor Barzin
12918dd491 post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-18 08:04:04 +00:00