infra

Viktor Barzin 256122ff5b All checks were successful ci/woodpecker/push/default Pipeline was successful Details monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-29 12:34:01 +00:00
..
modules/monitoring	monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic	2026-06-29 12:34:01 +00:00
imports.tf	monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply)	2026-06-16 11:31:17 +00:00
main.tf	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
secrets	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
terragrunt.hcl	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00