feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident

IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply
2026-03-13 07:16:56 +00:00 · 2026-03-13 07:16:56 +00:00 · 2d5f44d1b3
commit 2d5f44d1b3
parent 761dcb3a72
2 changed files with 38 additions and 8 deletions
--- a/.claude/cluster-health.sh
+++ b/.claude/cluster-health.sh
@ -864,16 +864,21 @@ for node in data["items"]:
        a = parse_storage(es_alloc)
        if c > 0:
            used_pct = ((c - a) / c) * 100
-            if used_pct > 80:
-                level = "FAIL" if used_pct > 90 else "WARN"
+            if used_pct > 70:  # Lower threshold after node2 containerd corruption incident
+                if used_pct > 85:
+                    level = "FAIL"  # Critical: Risk of containerd corruption
+                elif used_pct > 75:
+                    level = "WARN"  # Warning: Monitor closely
+                else:
+                    level = "WARN"  # Early warning
                print(f"{level}:{name}:{used_pct:.0f}")
    except (ValueError, ZeroDivisionError):
        pass
 ' 2>/dev/null) || true

    if [[ -z "$disk_info" ]]; then
-        pass "All nodes below 80% ephemeral-storage usage"
-        json_add "node_disk" "PASS" "All below 80%"
+        pass "All nodes below 70% ephemeral-storage usage"
+        json_add "node_disk" "PASS" "All below 70%"
    else
        [[ "$QUIET" == true ]] && section_always 17 "Node Disk Usage"
        while IFS= read -r line; do