IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply |
||
|---|---|---|
| .. | ||
| .terraform.lock.hcl | ||
| backend.tf | ||
| main.tf | ||
| providers.tf | ||
| terragrunt.hcl | ||