LimitRange defaults had a 4-8x limit/request ratio causing the scheduler to overpack nodes. When pods burst, nodes OOM-thrashed and became unresponsive (k8s-node3 and k8s-node4 both went down today). Changes: - Increase default memory requests across all tiers (ratio now 2x): - core/cluster: 64Mi → 256Mi request (512Mi limit) - gpu: 256Mi → 1Gi request (2Gi limit) - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit) - Add kubelet memory reservation and eviction thresholds: - systemReserved: 512Mi, kubeReserved: 512Mi - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset) - Applied to all nodes and future node template |
||
|---|---|---|
| .. | ||
| modules | ||
| .gitkeep | ||
| .terraform.lock.hcl | ||
| backend.tf | ||
| main.tf | ||
| providers.tf | ||
| redis-25.3.2.tgz | ||
| secrets | ||
| terragrunt.hcl | ||
| tiers.tf | ||