monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6c3619c9c6
commit
256122ff5b
1 changed files with 34 additions and 21 deletions
|
|
@ -1441,28 +1441,41 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
|
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
|
||||||
description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
|
description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
|
||||||
# N-1 capacity check: if any non-GPU worker (node2/3/4) died, would
|
# N-1 capacity check (topology-agnostic — auto-tracks node add/remove/drain).
|
||||||
# its memory requests fit on the remaining Ready workers (incl. node1
|
# If the most-loaded non-GPU worker died, would its memory REQUESTS
|
||||||
# GPU node — its taint is PreferNoSchedule, soft)? Fires when the
|
# reschedule onto the remaining Ready + schedulable workers (incl. the GPU
|
||||||
# most-loaded non-GPU worker holds more memory requests than the rest
|
# node, whose taint is soft/PreferNoSchedule)? Fires when that worker holds
|
||||||
# of the cluster has free.
|
# more memory requests than the rest of the eligible pool has free.
|
||||||
|
# Node selection is dynamic via metrics: GPU node by nvidia_com_gpu capacity,
|
||||||
|
# drained/cordoned by kube_node_spec_unschedulable, down by the Ready
|
||||||
|
# condition. The control-plane is excluded by name (node!~"k8s-master.*")
|
||||||
|
# because this cluster's kube-state-metrics exposes neither kube_node_role
|
||||||
|
# nor node taints/labels — revisit if an HA control-plane is added.
|
||||||
- alert: ClusterCannotTolerateNonGpuNodeLoss
|
- alert: ClusterCannotTolerateNonGpuNodeLoss
|
||||||
expr: |
|
expr: |
|
||||||
max(
|
max(
|
||||||
sum by (node) (
|
(
|
||||||
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[234]"}
|
sum by (node) (
|
||||||
|
kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
|
||||||
|
)
|
||||||
|
* on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
|
||||||
)
|
)
|
||||||
|
unless on(node) (kube_node_spec_unschedulable == 1)
|
||||||
|
unless on(node) (kube_node_status_capacity{resource="nvidia_com_gpu"} > 0)
|
||||||
)
|
)
|
||||||
>
|
>
|
||||||
sum(
|
sum(
|
||||||
clamp_min(
|
(
|
||||||
kube_node_status_allocatable{resource="memory",unit="byte",node=~"k8s-node[1234]"}
|
clamp_min(
|
||||||
- on(node) group_left() sum by (node) (
|
kube_node_status_allocatable{resource="memory",unit="byte",node!~"k8s-master.*"}
|
||||||
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[1234]"}
|
- on(node) group_left() sum by (node) (
|
||||||
),
|
kube_pod_container_resource_requests{resource="memory",unit="byte",node!~"k8s-master.*"}
|
||||||
0
|
),
|
||||||
|
0
|
||||||
|
)
|
||||||
|
* on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
|
||||||
)
|
)
|
||||||
and on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
|
unless on(node) (kube_node_spec_unschedulable == 1)
|
||||||
)
|
)
|
||||||
for: 15m
|
for: 15m
|
||||||
labels:
|
labels:
|
||||||
|
|
@ -1470,13 +1483,13 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
|
summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
|
||||||
description: |
|
description: |
|
||||||
The most-loaded non-GPU worker (k8s-node2/3/4) has more memory
|
The most-loaded non-GPU worker has more memory requests pinned to it
|
||||||
requests pinned to it than the rest of the workers (incl. node1
|
than the rest of the eligible worker pool (incl. the GPU node)
|
||||||
GPU node) currently have free. If that node went down, its
|
currently has free. If that node went down, some of its pods would
|
||||||
pods would not reschedule and stay Pending.
|
not reschedule and would stay Pending.
|
||||||
Remediation: right-size top reservers via Goldilocks (immich-server,
|
Remediation: right-size the top memory reservers with `krr` (trim
|
||||||
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
|
over-provisioned requests — e.g. claude-agent, stirling-pdf, traefik,
|
||||||
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
|
authentik-worker), or add/return a worker node.
|
||||||
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
|
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
|
||||||
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
|
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
|
||||||
# so its health is inferred from kube-state-metrics signals — the trail
|
# so its health is inferred from kube-state-metrics signals — the trail
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue