feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply
This commit is contained in:
parent
761dcb3a72
commit
2d5f44d1b3
2 changed files with 38 additions and 8 deletions
|
|
@ -864,16 +864,21 @@ for node in data["items"]:
|
|||
a = parse_storage(es_alloc)
|
||||
if c > 0:
|
||||
used_pct = ((c - a) / c) * 100
|
||||
if used_pct > 80:
|
||||
level = "FAIL" if used_pct > 90 else "WARN"
|
||||
if used_pct > 70: # Lower threshold after node2 containerd corruption incident
|
||||
if used_pct > 85:
|
||||
level = "FAIL" # Critical: Risk of containerd corruption
|
||||
elif used_pct > 75:
|
||||
level = "WARN" # Warning: Monitor closely
|
||||
else:
|
||||
level = "WARN" # Early warning
|
||||
print(f"{level}:{name}:{used_pct:.0f}")
|
||||
except (ValueError, ZeroDivisionError):
|
||||
pass
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -z "$disk_info" ]]; then
|
||||
pass "All nodes below 80% ephemeral-storage usage"
|
||||
json_add "node_disk" "PASS" "All below 80%"
|
||||
pass "All nodes below 70% ephemeral-storage usage"
|
||||
json_add "node_disk" "PASS" "All below 70%"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 17 "Node Disk Usage"
|
||||
while IFS= read -r line; do
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue