feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident

IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
This commit is contained in:
OpenClaw 2026-03-13 07:16:56 +00:00 committed by Viktor Barzin
parent 761dcb3a72
commit 2d5f44d1b3
2 changed files with 38 additions and 8 deletions

View file

@ -864,16 +864,21 @@ for node in data["items"]:
a = parse_storage(es_alloc)
if c > 0:
used_pct = ((c - a) / c) * 100
if used_pct > 80:
level = "FAIL" if used_pct > 90 else "WARN"
if used_pct > 70: # Lower threshold after node2 containerd corruption incident
if used_pct > 85:
level = "FAIL" # Critical: Risk of containerd corruption
elif used_pct > 75:
level = "WARN" # Warning: Monitor closely
else:
level = "WARN" # Early warning
print(f"{level}:{name}:{used_pct:.0f}")
except (ValueError, ZeroDivisionError):
pass
' 2>/dev/null) || true
if [[ -z "$disk_info" ]]; then
pass "All nodes below 80% ephemeral-storage usage"
json_add "node_disk" "PASS" "All below 80%"
pass "All nodes below 70% ephemeral-storage usage"
json_add "node_disk" "PASS" "All below 70%"
else
[[ "$QUIET" == true ]] && section_always 17 "Node Disk Usage"
while IFS= read -r line; do