monitoring: alert when cluster can't tolerate losing a non-GPU worker
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it than the rest of the workers (incl. node1 GPU node) currently have free. If that node went down, its pods would not fit elsewhere and would stay Pending — exactly what happened today (2026-05-26) with node4 NotReady: 4 kyverno pods + woodpecker PVCs + several deployments stuck Pending because node2/node3 were at 99% memory-request saturation. Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0)) over Ready workers. node1 included on the right because its taint is PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure. Currently fires with a 33.96 GiB shortage. Remediation: right-size top reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus 4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on k8s-node2/k8s-node3 from 32GB → 48GB to match node1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
467fa1631d
commit
cdbb418f45
1 changed files with 36 additions and 0 deletions
|
|
@ -1290,6 +1290,42 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
|
||||
description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
|
||||
# N-1 capacity check: if any non-GPU worker (node2/3/4) died, would
|
||||
# its memory requests fit on the remaining Ready workers (incl. node1
|
||||
# GPU node — its taint is PreferNoSchedule, soft)? Fires when the
|
||||
# most-loaded non-GPU worker holds more memory requests than the rest
|
||||
# of the cluster has free.
|
||||
- alert: ClusterCannotTolerateNonGpuNodeLoss
|
||||
expr: |
|
||||
max(
|
||||
sum by (node) (
|
||||
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[234]"}
|
||||
)
|
||||
)
|
||||
>
|
||||
sum(
|
||||
clamp_min(
|
||||
kube_node_status_allocatable{resource="memory",unit="byte",node=~"k8s-node[1234]"}
|
||||
- on(node) group_left() sum by (node) (
|
||||
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[1234]"}
|
||||
),
|
||||
0
|
||||
)
|
||||
and on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
|
||||
)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
|
||||
description: |
|
||||
The most-loaded non-GPU worker (k8s-node2/3/4) has more memory
|
||||
requests pinned to it than the rest of the workers (incl. node1
|
||||
GPU node) currently have free. If that node went down, its
|
||||
pods would not reschedule and stay Pending.
|
||||
Remediation: right-size top reservers via Goldilocks (immich-server,
|
||||
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
|
||||
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
|
||||
- name: Infrastructure Health
|
||||
rules:
|
||||
- alert: HomeAssistantDown
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue