immich(server,ml): bump server to 4Gi + Recreate strategy on tight quota
Root cause of 502/503/decode errors clustered at 19:20 BST 2026-04-26: immich-server hit its 3500Mi memory limit during a face-detection burst and was OOMKilled (Exit Code 137). VPA upperBound is 3050Mi but real-world bursts crossed it; with the single pod running both API and microservices workers, the OOM took the API down for ~30s of restart, surfacing as PlatformException image decode + 502 on uploads + 503 on ActivityService to the iOS app. Bump immich-server requests=limits to 4096Mi (per CLAUDE.md "upperBound x 1.3 for volatile workloads" rule, with headroom over the OOM mark). Quota math: 9680Mi used - 2000Mi old req + 4096Mi new req = 11776Mi, fits the tier-2-gpu 12Gi cap. Switch both immich-server and immich-machine-learning to Recreate strategy: the namespace tier-2-gpu quota is too tight for RollingUpdate to keep an old + new pod up during apply (transient 13776Mi > 12Gi cap, see "ResourceQuota blocks rolling updates" in CLAUDE.md). With single replicas and Recreate, future memory tweaks no longer require manual scale-to-0 dance. Verified: new pod has limits.memory=4Gi, quota usage stable at 11776Mi/12Gi, immich API serving normally. Note: a pending node_selector drift on immich-machine-learning (gpu=true -> nvidia.com/gpu.present=true) also reconciled in this apply; the canonical NVIDIA operator label already on the GPU node, no scheduling impact.
This commit is contained in:
parent
07bc0098e3
commit
d093aed7f6
1 changed files with 4 additions and 4 deletions
|
|
@ -188,7 +188,7 @@ resource "kubernetes_deployment" "immich_server" {
|
|||
}
|
||||
|
||||
strategy {
|
||||
type = "RollingUpdate"
|
||||
type = "Recreate"
|
||||
}
|
||||
|
||||
template {
|
||||
|
|
@ -311,10 +311,10 @@ resource "kubernetes_deployment" "immich_server" {
|
|||
resources {
|
||||
requests = {
|
||||
cpu = "100m"
|
||||
memory = "2000Mi"
|
||||
memory = "4096Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "3500Mi"
|
||||
memory = "4096Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -579,7 +579,7 @@ resource "kubernetes_deployment" "immich-machine-learning" {
|
|||
}
|
||||
}
|
||||
strategy {
|
||||
type = "RollingUpdate"
|
||||
type = "Recreate"
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue