infra/stacks
Viktor Barzin fb66676d7b post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-16 22:06:10 +00:00
..
_template add generic multi-user cluster onboarding system 2026-03-15 22:23:36 +00:00
actualbudget migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
affine migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
audiobookshelf migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
blog add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
calibre migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
changedetection scale up f1-stream and changedetection [ci skip] 2026-03-16 07:06:09 +00:00
city-guesser add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
claude-memory right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
coturn migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
cyberchef add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
dashy add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
dawarich right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
descheduler migrate all secrets from SOPS to Vault KV 2026-03-14 17:15:48 +00:00
diun regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
ebook2audiobook right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
echo add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
excalidraw add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
external-secrets regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
f1-stream scale up f1-stream and changedetection [ci skip] 2026-03-16 07:06:09 +00:00
forgejo migrate consuming stacks to ESO + remove k8s-dashboard static token 2026-03-15 19:05:04 +00:00
freedify migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
freshrss migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
frigate right-size cluster memory: reduce overprovisioned, fix under-provisioned services 2026-03-15 15:30:18 +00:00
grampsweb migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
hackmd fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
health right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
homepage add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
immich regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
infra fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
isponsorblocktv add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
jsoncrack migrate all secrets from SOPS to Vault KV 2026-03-14 17:15:48 +00:00
k8s-dashboard migrate consuming stacks to ESO + remove k8s-dashboard static token 2026-03-15 19:05:04 +00:00
kms add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
linkwarden right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
matrix regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
meshcentral add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
n8n regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
navidrome right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
netbox regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
networking-toolbox add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
nextcloud fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
novelapp migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
ntfy right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
ollama fix ollama: remove conditional count on basicAuth (incompatible with ESO data source) 2026-03-15 22:24:36 +00:00
onlyoffice add pod dependency management via Kyverno init container injection 2026-03-15 19:17:57 +00:00
openclaw migrate consuming stacks to ESO + remove k8s-dashboard static token 2026-03-15 19:05:04 +00:00
osm_routing add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
owntracks migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
paperless-ngx regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
platform post-mortem: kured + containerd cascade outage — alerts + report 2026-03-16 22:06:10 +00:00
plotting-book fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
poison-fountain regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
privatebin right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
real-estate-crawler migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00
reloader [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
resume regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
rybbit right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
send add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
servarr right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
shadowsocks regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
speedtest fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
stirling-pdf right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
tandoor fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
terminal regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
tor-proxy add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
trading-bot fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
travel_blog add vaultwarden daily backup CronJob to NFS 2026-03-15 00:03:59 +00:00
tuya-bridge regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
url fix DB password desync + migrate remaining tfvars to Vault 2026-03-15 21:39:45 +00:00
vault add generic multi-user cluster onboarding system 2026-03-15 22:23:36 +00:00
wealthfolio regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
webhook_handler regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
whisper right-size 14 services and scale down GPU-heavy workloads [ci skip] 2026-03-15 23:00:49 +00:00
woodpecker fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret 2026-03-16 19:12:01 +00:00
ytdlp migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret 2026-03-15 22:06:39 +00:00