infra/stacks/infra
OpenClaw 2d5f44d1b3 feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-17 16:51:02 +00:00
..
.terraform.lock.hcl [ci skip] Add infra stack (Proxmox VMs) 2026-02-22 13:04:49 +00:00
backend.tf [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
main.tf feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident 2026-03-17 16:51:02 +00:00
providers.tf [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
terragrunt.hcl Use --queue-ignore-errors for CI (infra stack needs Proxmox SSH) 2026-02-22 18:29:27 +00:00