cluster recovery: fix resource limits and node1 memory

- nvidia quota: requests.memory 8Gi → 12Gi (unblock cuda-validator)
- calibre: startup probe initial_delay 60→120s, timeout 1→5s,
  wait_for_rollout=false (DOCKER_MODS install takes 10+ min)
- immich ML: memory 2Gi → 4Gi (OOMKilled loading CLIP models)

Also done outside TF (not in this commit):
- node1 VM: 16 GiB → 24 GiB RAM (Proxmox)
- tigera-operator: kubectl patch 128→256Mi
- nvidia-driver-daemonset: kubectl patch 1→4Gi memory
- kyverno reports-controller: kubectl patch 128→256Mi
- CNPG operator: kubectl rollout restart
This commit is contained in:
Viktor Barzin 2026-03-15 01:44:28 +00:00
parent a3c198e10e
commit 43b49f7f6c
3 changed files with 6 additions and 4 deletions

View file

@ -513,10 +513,10 @@ resource "kubernetes_deployment" "immich-machine-learning" {
resources {
requests = {
cpu = "100m"
memory = "2Gi"
memory = "4Gi"
}
limits = {
memory = "2Gi"
memory = "4Gi"
"nvidia.com/gpu" = "1"
}
}