[monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned
After k8s-node1 was silently cordoned and broke Frigate camera streams, existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the root cause proactively. This alert fires within 5m of the GPU node being cordoned, before any pod restart attempts to schedule and fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
e2146e6916
commit
a4eafafe49
1 changed files with 8 additions and 0 deletions
|
|
@ -750,6 +750,14 @@ serverFiles:
|
|||
severity: critical
|
||||
annotations:
|
||||
summary: "NVIDIA GPU exporter is down - no GPU metrics available"
|
||||
- alert: GPUNodeUnschedulable
|
||||
expr: kube_node_spec_unschedulable{node="k8s-node1"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
subsystem: gpu
|
||||
annotations:
|
||||
summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule"
|
||||
- name: Power
|
||||
rules:
|
||||
- alert: OnBattery
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue