infra/stacks/immich
Viktor Barzin d093aed7f6 immich(server,ml): bump server to 4Gi + Recreate strategy on tight quota
Root cause of 502/503/decode errors clustered at 19:20 BST 2026-04-26: immich-server
hit its 3500Mi memory limit during a face-detection burst and was OOMKilled (Exit Code
137). VPA upperBound is 3050Mi but real-world bursts crossed it; with the single pod
running both API and microservices workers, the OOM took the API down for ~30s of
restart, surfacing as PlatformException image decode + 502 on uploads + 503 on
ActivityService to the iOS app.

Bump immich-server requests=limits to 4096Mi (per CLAUDE.md "upperBound x 1.3 for
volatile workloads" rule, with headroom over the OOM mark). Quota math: 9680Mi used -
2000Mi old req + 4096Mi new req = 11776Mi, fits the tier-2-gpu 12Gi cap.

Switch both immich-server and immich-machine-learning to Recreate strategy: the
namespace tier-2-gpu quota is too tight for RollingUpdate to keep an old + new pod up
during apply (transient 13776Mi > 12Gi cap, see "ResourceQuota blocks rolling updates"
in CLAUDE.md). With single replicas and Recreate, future memory tweaks no longer
require manual scale-to-0 dance.

Verified: new pod has limits.memory=4Gi, quota usage stable at 11776Mi/12Gi, immich
API serving normally.

Note: a pending node_selector drift on immich-machine-learning (gpu=true ->
nvidia.com/gpu.present=true) also reconciled in this apply; the canonical NVIDIA
operator label already on the GPU node, no scheduling impact.
2026-04-26 19:11:50 +00:00
..
.terraform.lock.hcl [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
backend.tf [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
chart_values.tpl [redis] Migrate live RW consumers off bare redis.redis hostname 2026-04-19 12:42:36 +00:00
frame.tf [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
main.tf immich(server,ml): bump server to 4Gi + Recreate strategy on tight quota 2026-04-26 19:11:50 +00:00
providers.tf [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
secrets [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
terragrunt.hcl migrate all secrets from SOPS to Vault KV 2026-03-14 17:15:48 +00:00