[monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned

After k8s-node1 was silently cordoned and broke Frigate camera streams,
existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the
root cause proactively. This alert fires within 5m of the GPU node being
cordoned, before any pod restart attempts to schedule and fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-22 14:05:12 +00:00
parent e2146e6916
commit a4eafafe49

View file

@ -750,6 +750,14 @@ serverFiles:
severity: critical
annotations:
summary: "NVIDIA GPU exporter is down - no GPU metrics available"
- alert: GPUNodeUnschedulable
expr: kube_node_spec_unschedulable{node="k8s-node1"} == 1
for: 5m
labels:
severity: critical
subsystem: gpu
annotations:
summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule"
- name: Power
rules:
- alert: OnBattery