fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert

- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
  ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
  metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
This commit is contained in:
Viktor Barzin 2026-03-23 03:04:33 +02:00
parent 20d0404a42
commit 877cd15b45
2 changed files with 65 additions and 1 deletions

View file

@ -615,7 +615,7 @@ resource "kubernetes_manifest" "generate_resourcequota_by_tier" {
spec = {
hard = {
"requests.cpu" = "8"
"requests.memory" = "8Gi"
"requests.memory" = "12Gi"
"limits.memory" = "32Gi"
pods = "40"
}