infra/stacks/platform
Viktor Barzin fb66676d7b post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-16 22:06:10 +00:00
..
modules post-mortem: kured + containerd cascade outage — alerts + report 2026-03-16 22:06:10 +00:00
.gitkeep [ci skip] Add Terragrunt directory skeleton and root config 2026-02-22 13:01:37 +00:00
.terraform.lock.hcl Woodpecker CI deploy commit [CI SKIP] 2026-03-15 02:38:30 +00:00
backend.tf Woodpecker CI deploy commit [CI SKIP] 2026-03-15 21:43:40 +00:00
main.tf fix platform stack: k8s_users.domains and sensitive for_each errors [ci skip] 2026-03-15 23:36:46 +00:00
providers.tf migrate consuming stacks to ESO + remove k8s-dashboard static token 2026-03-15 19:05:04 +00:00
redis-25.3.2.tgz [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache 2026-03-06 23:55:57 +00:00
secrets [ci skip] Migrate 22 platform service states to stacks/platform 2026-02-22 13:35:10 +00:00
terragrunt.hcl fix: resolve HCL semicolons and vault-platform dependency cycle 2026-03-14 17:37:25 +00:00
tiers.tf [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk 2026-02-28 19:08:06 +00:00