infra/stacks/platform
Viktor Barzin 12918dd491 post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-18 08:04:04 +00:00
..
modules post-mortem: kured + containerd cascade outage — alerts + report 2026-03-18 08:04:04 +00:00
.gitkeep [ci skip] Add Terragrunt directory skeleton and root config 2026-02-22 13:01:37 +00:00
.terraform.lock.hcl Woodpecker CI deploy commit [CI SKIP] 2026-03-18 08:04:00 +00:00
backend.tf Woodpecker CI deploy commit [CI SKIP] 2026-03-18 08:04:03 +00:00
main.tf fix platform stack: k8s_users.domains and sensitive for_each errors [ci skip] 2026-03-18 08:04:03 +00:00
providers.tf migrate consuming stacks to ESO + remove k8s-dashboard static token 2026-03-18 08:04:02 +00:00
redis-25.3.2.tgz [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache 2026-03-06 23:55:57 +00:00
secrets [ci skip] Migrate 22 platform service states to stacks/platform 2026-02-22 13:35:10 +00:00
terragrunt.hcl fix: resolve HCL semicolons and vault-platform dependency cycle 2026-03-18 08:03:59 +00:00
tiers.tf [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk 2026-02-28 19:08:06 +00:00