infra/stacks/monitoring
Viktor Barzin a4eafafe49 [monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned
After k8s-node1 was silently cordoned and broke Frigate camera streams,
existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the
root cause proactively. This alert fires within 5m of the GPU node being
cordoned, before any pod restart attempts to schedule and fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 14:05:12 +00:00
..
modules/monitoring [monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned 2026-04-22 14:05:12 +00:00
main.tf [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery 2026-04-19 17:08:28 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00