SEV 1 Resolved

Kured + Containerd Cascade Outage

Owner: Viktor Barzin  •  Duration: ~26 hours  •  Cluster: viktorbarzin.me k8s  •  Date: March 2026

What Broke

Containerd's overlayfs snapshotter corrupted after kernel update reboots. Image pulls failed, calico networking broke, cascading node-by-node outage.

Why It Took So Long

Kured had no health gating — kept rebooting nodes even as the cluster degraded. No alert existed for image pull errors (stage 3 in the cascade). Reboot window config used wrong Helm keys.

How It Was Fixed

Manually cleaned containerd state on each node. Deployed sentinel gate DaemonSet to block reboots when cluster is unhealthy. Added 6 new Prometheus alerts covering the detection gap.

26h
Total Outage
~2h
Time to Detect
~26h
Time to Mitigate
5
Nodes Affected

Failure Cascade

Stage 1
Kernel Update
Stage 2
Kured Reboot
Stage 3
Snapshotter Corrupt
Stage 4
Calico Down
Stage 5
Node NotReady
Stage 6
Pods Cascade Fail

Incident Timeline

T+0
unattended-upgrades installs kernel update
Automatic kernel update applied to all 5 nodes. /var/run/reboot-required created on each host.
T+0 to T+2h
Kured begins rebooting nodes
Kured detects sentinel files and starts rebooting nodes one by one. No health gating — proceeds regardless of cluster state. Reboot window config was not applied (wrong Helm keys).
T+1h
First node: containerd snapshotter corrupts
After reboot, containerd's overlayfs snapshotter is corrupted by the new kernel. Image pulls start failing on this node.
T+2h
Detection: services failing
Noticed services going down. No Prometheus alert for image pull errors existed — detection was manual observation.
T+2h to T+10h
Cascade accelerates
Kured continues rebooting remaining nodes. Each rebooted node suffers the same containerd corruption. Calico-node pods fail to pull images, networking breaks node by node.
T+10h to T+24h
Manual remediation begins
SSH to each node, clean containerd state, restart containerd + kubelet, drain and uncordon. Process repeated for all 5 nodes.
T+26h
Cluster fully recovered
All nodes Ready, all calico-node pods running, all services restored. Post-mortem remediation work begins.

Root Cause

Primary Root Cause

Containerd's overlayfs snapshotter became corrupted after a kernel update reboot. The new kernel was incompatible with existing overlayfs state, causing all subsequent image pulls to fail. This made calico-node (and all other pods) unable to start, breaking cluster networking.

Contributing Factors

DERP Analysis

Detection

  • No alert for image pull errors — key gap filled by new KubeletImagePullErrors alert
  • Manual detection after ~2h when services started failing
  • Kured Slack notification was the only signal, but didn't indicate problems

Escalation

  • Single operator incident — no formal escalation needed (homelab)
  • Root cause identified by SSH-ing to nodes and checking containerd logs
  • Kured was not stopped quickly enough — continued rebooting during diagnosis

Remediation

  • Cleaned containerd overlayfs state on each node manually
  • Restarted containerd + kubelet on all affected nodes
  • Drained and uncordoned nodes one by one
  • Disabled unattended-upgrades on all nodes

Prevention

  • Sentinel gate DaemonSet: blocks kured unless all nodes Ready + calico healthy + 30m cool-down
  • Fixed kured Helm values: reboot window Mon-Fri 02:00-06:00 London
  • 6 new Prometheus alerts covering node runtime health
  • Containerd cascade inhibition rule to suppress noise

Detection Chain Coverage

Stage What Happens Alert Latency
1. Kernel update reboot-required created none future
2. Kured reboots Slack notification Kured built-in Immediate
3. Snapshotter corrupts Image pull errors KubeletImagePullErrors new ~10m
4. Calico breaks DaemonSet mismatch CalicoNodeNotReady new ~5m
5. Node networking fails Node NotReady NodeNotReady (existing) ~5m
6. Pods cascade fail Replica mismatch DeploymentReplicasMismatch (existing) ~30m

Follow-Up Tasks

Codify sentinel gate DaemonSet in Terraform (currently kubectl-applied)
P0 7d
Investigate KubeletRuntimeOperationsLatency alert currently firing
P0 7d
Disable unattended-upgrades on all k8s nodes
P0 done
Fix kured Helm values — reboot window + gated sentinel
P0 done
Deploy sentinel gate DaemonSet with cluster health checks
P0 done
Add 6 Prometheus alerts for node runtime health
P0 done
Add containerd health check to node provisioning (Terraform/cloud-init)
P1 30d
Add Prometheus alert for /var/run/reboot-required existence (early warning)
P1 30d
Evaluate switching from overlayfs to native snapshotter
P2 90d
Add runbook links to all new alerts
P2 90d