infra

Viktor Barzin 256122ff5b All checks were successful ci/woodpecker/push/default Pipeline was successful Details monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:34:01 +00:00
..
monitoring	monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic	2026-06-29 12:34:01 +00:00

ci/woodpecker/push/default Pipeline was successful

Details

monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic

The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 12:34:01 +00:00

monitoring

monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic

2026-06-29 12:34:01 +00:00