Commit graph

2 commits

Author SHA1 Message Date
Viktor Barzin
6cc4d526f1 add GitHub Pages for post-mortems
- Index page listing all incident reports
- GHA workflow deploys post-mortems/ on push
- Available at viktorbarzin.github.io/infra/
2026-03-16 22:16:05 +00:00
Viktor Barzin
fb66676d7b post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-16 22:06:10 +00:00