All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.
Recovery actions taken before this commit:
- Throttled nfsd 64→8 on PVE host to give apiserver headroom
- `kubectl delete job -n immich anca-elements-import` + force-delete pod
- Restored nfsd to 64; cluster healthy
Code change here:
- Remove `kubernetes_job_v1.anca_elements_import` block
- Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
no live consumer; videos batch deferred per user, source dump remains on
PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
- Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
or a runbook-captured `kubectl create job` ad-hoc invocation instead).
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail
backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc
HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident
on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2
in the Immich DB system-config; documented in the Immich row and recorded the
recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this
also commits that previously-untracked post-mortem).
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>