fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich ML pods scheduling headroom (was at 96% utilization) - Add critical NvidiaExporterDown Prometheus alert that fires when GPU metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
This commit is contained in:
parent
20d0404a42
commit
877cd15b45
2 changed files with 65 additions and 1 deletions
|
|
@ -615,7 +615,7 @@ resource "kubernetes_manifest" "generate_resourcequota_by_tier" {
|
|||
spec = {
|
||||
hard = {
|
||||
"requests.cpu" = "8"
|
||||
"requests.memory" = "8Gi"
|
||||
"requests.memory" = "12Gi"
|
||||
"limits.memory" = "32Gi"
|
||||
pods = "40"
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue