post-mortem: kured + containerd cascade outage — alerts + report · fb66676d7b - viktor/infra

post-mortem: kured + containerd cascade outage — alerts + report

26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path

This commit is contained in:

Viktor Barzin

2026-03-16 22:06:10 +00:00

parent d6afbe84c8

commit fb66676d7b

2 changed files with 1272 additions and 0 deletions

1223

post-mortems/2026-03-16-kured-containerd-cascade-outage.html Normal file

View file

File diff suppressed because it is too large Load diff

Rows
Columns

post-mortem: kured + containerd cascade outage — alerts + report

1223 post-mortems/2026-03-16-kured-containerd-cascade-outage.html Normal file View file

1223

post-mortems/2026-03-16-kured-containerd-cascade-outage.html Normal file

View file