|
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| modules/monitoring | ||
| imports.tf | ||
| main.tf | ||
| secrets | ||
| terragrunt.hcl | ||