infra/stacks/immich
Viktor Barzin 7045559fee immich: harden against bulk-import load (memory + probe + Job retries)
Mid-flight stability changes from the 2026-05-24 Anca-elements import
that surfaced multiple latent issues under sustained load:

- `immich-postgresql` memory 3Gi → 5Gi. The original limit OOM-killed
  PG once the bulk insert + vector embeddings drove buffer pressure
  past 3 GiB. 5 GiB gives ~60% headroom over the observed steady
  state during ongoing imports.
- `immich-server` startup probe `failure_threshold` 30 → 360 (5min →
  1h). After any PG restart, immich-server reindexes `clip_index` +
  `face_index` (147k + 185k rows at the time of incident) before
  binding the API port. The old 5min budget was too tight, so each
  PG bounce trapped immich-server in a startup crashloop until the
  reindex was killed. 1h gives generous headroom.
- `kubernetes_job_v1.anca_elements_import.backoff_limit` 2 → 20 and
  `--concurrent-tasks` 8 → 20 on the immich-go upload. Short
  cluster blips (PG restart, KCM lease loss) were exhausting the
  Job's 3-attempt budget. 20 attempts + 20 parallel hashers makes
  dedup-on-resume ~2.5x faster and tolerates a much rougher cluster.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 22:14:05 +00:00
..
.terraform.lock.hcl Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
backend.tf Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
chart_values.tpl [redis] Migrate live RW consumers off bare redis.redis hostname 2026-04-19 12:42:36 +00:00
frame.tf infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
main.tf immich: harden against bulk-import load (memory + probe + Job retries) 2026-05-24 22:14:05 +00:00
providers.tf Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
secrets [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
terragrunt.hcl migrate all secrets from SOPS to Vault KV 2026-03-14 17:15:48 +00:00