monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-29 12:34:01 +00:00
parent 6c3619c9c6
commit 256122ff5b

View file

@ -1441,28 +1441,41 @@ serverFiles:
annotations: annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image" summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error." description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
# N-1 capacity check: if any non-GPU worker (node2/3/4) died, would # N-1 capacity check (topology-agnostic — auto-tracks node add/remove/drain).
# its memory requests fit on the remaining Ready workers (incl. node1 # If the most-loaded non-GPU worker died, would its memory REQUESTS
# GPU node — its taint is PreferNoSchedule, soft)? Fires when the # reschedule onto the remaining Ready + schedulable workers (incl. the GPU
# most-loaded non-GPU worker holds more memory requests than the rest # node, whose taint is soft/PreferNoSchedule)? Fires when that worker holds
# of the cluster has free. # more memory requests than the rest of the eligible pool has free.
# Node selection is dynamic via metrics: GPU node by nvidia_com_gpu capacity,
# drained/cordoned by kube_node_spec_unschedulable, down by the Ready
# condition. The control-plane is excluded by name (node!~"k8s-master.*")
# because this cluster's kube-state-metrics exposes neither kube_node_role
# nor node taints/labels — revisit if an HA control-plane is added.
- alert: ClusterCannotTolerateNonGpuNodeLoss - alert: ClusterCannotTolerateNonGpuNodeLoss
expr: | expr: |
max( max(
sum by (node) ( (
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[234]"} sum by (node) (
kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
)
* on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
) )
unless on(node) (kube_node_spec_unschedulable == 1)
unless on(node) (kube_node_status_capacity{resource="nvidia_com_gpu"} > 0)
) )
> >
sum( sum(
clamp_min( (
kube_node_status_allocatable{resource="memory",unit="byte",node=~"k8s-node[1234]"} clamp_min(
- on(node) group_left() sum by (node) ( kube_node_status_allocatable{resource="memory",unit="byte",node!~"k8s-master.*"}
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[1234]"} - on(node) group_left() sum by (node) (
), kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
0 ),
0
)
* on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
) )
and on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1) unless on(node) (kube_node_spec_unschedulable == 1)
) )
for: 15m for: 15m
labels: labels:
@ -1470,13 +1483,13 @@ serverFiles:
annotations: annotations:
summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest" summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
description: | description: |
The most-loaded non-GPU worker (k8s-node2/3/4) has more memory The most-loaded non-GPU worker has more memory requests pinned to it
requests pinned to it than the rest of the workers (incl. node1 than the rest of the eligible worker pool (incl. the GPU node)
GPU node) currently have free. If that node went down, its currently has free. If that node went down, some of its pods would
pods would not reschedule and stay Pending. not reschedule and would stay Pending.
Remediation: right-size top reservers via Goldilocks (immich-server, Remediation: right-size the top memory reservers with `krr` (trim
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on over-provisioned requests — e.g. claude-agent, stirling-pdf, traefik,
k8s-node2/k8s-node3 from 32GB → 48GB to match node1. authentik-worker), or add/return a worker node.
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable # Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint, # who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
# so its health is inferred from kube-state-metrics signals — the trail # so its health is inferred from kube-state-metrics signals — the trail