From 256122ff5b3d15ad89da0f678febdd636ad3456a Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Mon, 29 Jun 2026 12:34:01 +0000
Subject: [PATCH] monitoring: make ClusterCannotTolerateNonGpuNodeLoss
 topology-agnostic
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../monitoring/prometheus_chart_values.tpl    | 55 ++++++++++++-------
 1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
index 8f4de538..bf59bb97 100755
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@@ -1441,28 +1441,41 @@ serverFiles:
             annotations:
               summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
               description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
-          # N-1 capacity check: if any non-GPU worker (node2/3/4) died, would
-          # its memory requests fit on the remaining Ready workers (incl. node1
-          # GPU node — its taint is PreferNoSchedule, soft)? Fires when the
-          # most-loaded non-GPU worker holds more memory requests than the rest
-          # of the cluster has free.
+          # N-1 capacity check (topology-agnostic — auto-tracks node add/remove/drain).
+          # If the most-loaded non-GPU worker died, would its memory REQUESTS
+          # reschedule onto the remaining Ready + schedulable workers (incl. the GPU
+          # node, whose taint is soft/PreferNoSchedule)? Fires when that worker holds
+          # more memory requests than the rest of the eligible pool has free.
+          # Node selection is dynamic via metrics: GPU node by nvidia_com_gpu capacity,
+          # drained/cordoned by kube_node_spec_unschedulable, down by the Ready
+          # condition. The control-plane is excluded by name (node!~"k8s-master.*")
+          # because this cluster's kube-state-metrics exposes neither kube_node_role
+          # nor node taints/labels — revisit if an HA control-plane is added.
           - alert: ClusterCannotTolerateNonGpuNodeLoss
             expr: |
               max(
-                sum by (node) (
-                  kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[234]"}
+                (
+                  sum by (node) (
+                    kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
+                  )
+                  * on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
                 )
+                unless on(node) (kube_node_spec_unschedulable == 1)
+                unless on(node) (kube_node_status_capacity{resource="nvidia_com_gpu"} > 0)
               )
               >
               sum(
-                clamp_min(
-                  kube_node_status_allocatable{resource="memory",unit="byte",node=~"k8s-node[1234]"}
-                  - on(node) group_left() sum by (node) (
-                      kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[1234]"}
-                    ),
-                  0
+                (
+                  clamp_min(
+                    kube_node_status_allocatable{resource="memory",unit="byte",node!~"k8s-master.*"}
+                    - on(node) group_left() sum by (node) (
+                        kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
+                      ),
+                    0
+                  )
+                  * on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
                 )
-                and on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
+                unless on(node) (kube_node_spec_unschedulable == 1)
               )
             for: 15m
             labels:
@@ -1470,13 +1483,13 @@ serverFiles:
             annotations:
               summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
               description: |
-                The most-loaded non-GPU worker (k8s-node2/3/4) has more memory
-                requests pinned to it than the rest of the workers (incl. node1
-                GPU node) currently have free. If that node went down, its
-                pods would not reschedule and stay Pending.
-                Remediation: right-size top reservers via Goldilocks (immich-server,
-                frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
-                k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
+                The most-loaded non-GPU worker has more memory requests pinned to it
+                than the rest of the eligible worker pool (incl. the GPU node)
+                currently has free. If that node went down, some of its pods would
+                not reschedule and would stay Pending.
+                Remediation: right-size the top memory reservers with `krr` (trim
+                over-provisioned requests — e.g. claude-agent, stirling-pdf, traefik,
+                authentik-worker), or add/return a worker node.
       # Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
       # who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
       # so its health is inferred from kube-state-metrics signals — the trail