infra

Author	SHA1	Message	Date
Viktor Barzin	63e714782c	immich: remove one-shot anca-elements-import Job + its PVC All of Anca's photos are imported. The Job was declared as kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of the immich stack re-created it, despite the 2026-05-25 in-code comment saying "After successful completion: REMOVE this resource block + apply again." Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the 6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver. Recovery actions taken before this commit: - Throttled nfsd 64→8 on PVE host to give apiserver headroom - `kubectl delete job -n immich anca-elements-import` + force-delete pod - Restored nfsd to 64; cluster healthy Code change here: - Remove `kubernetes_job_v1.anca_elements_import` block - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` — no live consumer; videos batch deferred per user, source dump remains on PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin) - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob or a runbook-captured `kubectl create job` ad-hoc invocation instead).	2026-06-16 22:11:27 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	0dd4a31eff	docs(immich): cap server-side job concurrency to protect sdc + log recurrence A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2 in the Immich DB system-config; documented in the Immich row and recorded the recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this also commits that previously-untracked post-mortem). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00

4 commits