Viktor Barzin 0dd4a31eff docs(immich): cap server-side job concurrency to protect sdc + log recurrence

A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail
backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc
HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident
on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2
in the Immich DB system-config; documented in the Immich row and recorded the
recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this
also commits that previously-untracked post-mortem).

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 15:15:26 +00:00

26 KiB

Raw Blame History

Post-Mortem: Immich anca-elements Bulk Ingest IO Storm → k8s-node1 Reboot

Field	Value
Date	2026-05-25
Duration	~9h from job start (2026-05-24 23:55Z) to node1 reboot (2026-05-25 09:33Z); service restoration ongoing (~1h elapsed at time of writing).
Severity	SEV2 — k8s-node1 went down, ~33 deployments lost their only replica, DNS partially degraded (Technitium primary on node1), Loki down, GPU stack down, backup pipeline timed out for the day. No data loss.
Affected Services	33 deployments, 3 StatefulSets, 9 DaemonSets across ~30 namespaces. Most concentrated on k8s-node1 (the only GPU node and home of several pinned services).
Issue	TBD (no GitHub issue filed yet)
Status	Draft — recovery in progress
Recurrence count	3rd IO-pressure-induced incident in 17 days at time of writing; recurred 2026-05-26 (Alloy log-read storm, mem id=2726) and 2026-06-01 (Immich Duplicate Detection ML/thumbnail backfill — see Update 2026-06-01)

Summary

The anca-elements-import Kubernetes Job — a one-shot bulk import of ~34k photos (770 GB) from /srv/nfs/anca-elements into Immich — ran with immich-go --concurrent-tasks 20 and no CPU/IO limits. The 20 parallel NFS readers combined with the Immich ML pipeline saturated sdc (the 10.7 TB HDD thin pool holding all VM disks) for hours. Sustained disk contention starved the k8s-node1 VM's IO until the VM rebooted at 09:33 UTC. ~60 pods on node1 went zombie; the proxmox CSI driver lost its registration; the NVIDIA driver DaemonSet entered CrashLoopBackOff; the daily-backup pipeline was killed by its 4h systemd timeout while waiting on post-reboot ext4 orphan-inode cleanup.

This is the third time in 17 days that an IO event has taken meaningful slice of the cluster offline. We cannot keep treating each one as a one-off.

Impact

User-facing: DNS resolution degraded (1 of 3 Technitium replicas down on node1). 20 self-hosted apps (changedetection, freshrss, frigate, navidrome, wealthfolio, etc.) returned 502 or hung. GPU-dependent services (Frigate ML, Immich ML, nvidia-exporter) had no GPU available.
Blast radius: ~60 zombie pods on k8s-node1; 33 deployment replicas missing cluster-wide; 1 StatefulSet (Loki) unavailable. Multi-Attach errors on ~8 proxmox-lvm PVCs prevented reschedule onto healthy nodes for ~30 min.
Duration: Initial IO degradation ~01:30Z (job ran ~85 min then ended). VM stayed alive but degraded for ~8 hours after the job ended (likely due to filesystem journal recovery / page cache pressure tail). Hard reboot at 09:33Z. Service restoration began at 10:00Z.
Data loss: None. All PVCs intact; no failed writes detected.
Monitoring gap: We had no alert for "VM is about to crash from sustained IO pressure." NodeHighIOWait fired but didn't escalate, and PVE-host-level IO PSI metrics aren't scraped into Prometheus.

Pattern — this is the third time

Incident	Date	Root IO source	Outcome
1	2026-05-09	Stale NFS kthread to decommissioned TrueNAS (`/usr/local/bin/weekly-backup` artifact) wedged in `rpc_wait_bit_killable`	PVE loadavg ~15 sustained, IO PSI stall on node3, no user-visible outage
2	2026-05-16/17	kured stuck + GPU driver Ubuntu 26.04 mismatch + NFS-CSI Keel upgrade race	Multi-issue cluster degradation; required manual recovery
3	2026-05-25 (this incident)	Immich `anca-elements-import` Job with 20 parallel uncapped readers	k8s-node1 VM reboot, ~33 deployments down, backup pipeline broken
4	2026-05-26	Grafana Alloy DaemonSet read 12.18 TB of logs in ~24h (silently lost its `controller.resources` limit)	sdc 97% util, all VMs + NFS starved (mem id=2726)
5	2026-06-01	Immich library-wide Duplicate Detection → ML/thumbnail backfill read originals at server-side `thumbnailGeneration` concurrency 8	sdc ~100% util, 64 `nfsd` threads D-state, etcd starved → kube-apiserver down ~30 min

Common pattern: a single uncontrolled IO-heavy workload (or a stale connection) saturates the shared sdc thin pool, which hosts all VM disks for the entire cluster. There is currently no IO budget enforcement between workloads, no PVE-level IO QoS between VMs, and no alerting that fires before a node crashes.

We have a single point of contention (sdc). Every storm finds it.

Timeline (UTC)

Time	Event
2026-05-24 23:55	`anca-elements-import` Job starts. immich-go v0.31.0, `--concurrent-tasks 20`, no resource limits. Anca's 770 GB photo archive begins streaming from NFS to immich-server.
2026-05-25 01:21	Job marks `Complete` (85 min runtime). 34k photos uploaded. Immich-server ML pipeline (face detection / thumbnail generation) keeps the IO load going for hours after.
05:02	`daily-backup.service` (systemd timer) starts. It runs LVM thin-snapshot → LUKS-decrypt → mount → rsync per PVC. The competing IO from the still-saturated thin pool stretches every per-PVC step.
08:24	`daily-backup` `matrix-data-proxmox` rsync hits its per-PVC 30-min timeout — first warning.
08:31–09:02	30 LUKS-encrypted PVC mounts log `Failed to mount snapshot` because ext4 orphan-inode cleanup exceeds the 30s `timeout 30 mount` guard (one volume took 109s).
09:02:28	systemd kills `daily-backup` with `TimeoutStartSec=14400` (4h). Script was nowhere near complete — alphabetically still on letter 'm'. Snapshots from today's run are left, but that's the designed 7-day retention pattern.
09:07	`nfs-mirror.service` also times out.
09:33:24	k8s-node1 VM rebooted (cause: best-guess IO starvation triggered Proxmox watchdog or qemu IO timeout; not directly observable). All pods on node1 enter `Unknown`.
09:33:36	node1 kubelet posts Ready. Pods on node1 begin churn: calico-node, csi-node-driver, kured, alloy, loki-canary, nvidia stack all restart. proxmox-csi-plugin-node fails to re-register the `csi.proxmox.sinextra.dev` driver in CSINode.
09:40	User session impacted: `~/.zsh_history` left with 154-byte NUL padding from interrupted write.
09:41	Incident detected by user; `/cluster-health` invoked. Healthcheck reports 33 PASS / 3 WARN / 64 FAIL.
09:50	Force-deleted 47 Failed pods + 22 stuck Terminating + 6 zombie DS pods on node1.
09:55	4 recovery sub-agents dispatched in parallel: csi-recovery, gpu-recovery, dns-monitoring-recovery, backup-recovery.
~10:15	proxmox CSI re-registered on node1 (csi-recovery). Multi-Attach errors clearing. Loki StatefulSet recovers to 1/1. Calico fully back to 5/5.
~10:30	daily-backup re-started manually (currently still running, ETA ~2h).
(ongoing)	nvidia stack recovery ongoing; 20 deployments still recovering.

Root Cause

Direct (the trigger)

anca-elements-import Job in stacks/immich/main.tf runs immich-go upload with --concurrent-tasks 20, no CPU limit, and no IO throttling. Twenty parallel NFS readers against /srv/nfs/anca-elements (mounted from PVE host) plus immich-server's ML pipeline (CUDA-accelerated face detection + thumbnail generation) saturated the read queue on sdc. The job itself only ran for 85 min, but the after-effects (ML processing, filesystem cache eviction, dirty-page writeback) persisted for hours.

Recovery-side cascade (why the cluster stayed broken after node1 booted)

Once node1 rebooted, the kubelet posted Ready within 12s — but csi.proxmox.sinextra.dev failed to re-register, blocking ~30 pods. The actual cascade (discovered by the csi-recovery agent during today's investigation):

Calico CNI on node1 entered a crash loop. The calico-node pod's BIRD BGP daemon takes a few seconds to create /var/run/bird/bird.ctl on startup. The container's liveness probe was killing the process before the socket appeared, restarting it before it could stabilize.
Without functional Calico, no new pod on node1 could reach the kube-apiserver service IP (10.96.0.1:443).
The proxmox-csi-plugin-node pod therefore crash-looped with "no route to host" trying to talk to the apiserver, and never created its CSI socket for kubelet to discover.
node-driver-registrar (sidecar) therefore never registered the driver with kubelet, so CSINode for k8s-node1 lacked csi.proxmox.sinextra.dev.
Every PVC mount on node1 failed with the driver not found error we observed; meanwhile, VolumeAttachments for those PVCs still pointed at node1 from before the reboot, so reschedule onto healthy nodes hit Multi-Attach error.

Fix order matters: Calico first, then CSI, then stale VolumeAttachments. Doing them out of order leaves the cascade broken. This is now a P3 runbook (below).

Second cascade (3.5h into recovery) — Proxmox CSI 30-LUN-per-VM hard cap

During the recovery, a second cascade was discovered that compounded the outage:

k8s-driver-manager init container cordons + drains node1 as part of GPU driver re-install. This evicted GPU-tagged pods (and incidentally triggered descheduler rebalancing) onto other nodes.
Simultaneously the dns-monitoring-recovery agent killed an orphaned containerd holding a boltdb lock on k8s-node4, evicting all node4 pods.
The combined eviction wave scheduled ~60 PVC-using pods through the scheduler in a short window. Many landed on node1 (largest node, least cordoned-ish), pushing its SCSI LUN slot count from ~17 (pre-Immich-import baseline) to 29 of 30 in use.
proxmox-csi-plugin's hard-coded MaxVolumesPerNode = 30 (per the upstream csi-driver-proxmox source: it scans scsi0…scsi30 and errors no free lun found when none are free) blocked further attaches.
vault-0, mysql-standalone-0, claude-memory, grafana, nextcloud etc. all could not start because their PVCs couldn't attach to node1 (their scheduled target). Multi-Attach errors compounded when their previous-node attachments hadn't been released cleanly.
Daily-backup was running concurrently — adding ~120 MB/s read load on sdc, slowing every CSI attach/detach operation by 3-5×, prolonging the queue.

Resolution (manual, 2026-05-25): systemctl stop daily-backup, kubectl cordon k8s-node1, force-delete stuck Pending pods. They rescheduled to nodes with LUN headroom (node2/3/4 had ~12-15 free slots each).

Structural (why it took down a node)

Single shared IO domain: sdc is one LVM thin pool serving all 9 VMs. No Proxmox-level bwlimit or iothrottle between VMs. Any VM can starve the others.
No IO budget at workload level: the K8s job had resources: {}. There is no cluster-wide cgroup-IO budget enforced.
NFS reads bypass per-VM accounting: anca-elements is read via the PVE host's NFS export. The reads happen on the PVE host, charged to the host's IO scheduler, not to the k8s-node1 VM. So even if we capped node1's VM IO, the storm would still happen.
node1 is also the only GPU node — Immich-ML pods are pinned there. The reader (immich-server) and consumer (immich-ml) are both fighting for the same node's resources during ingestion.
ext4 orphan-inode cleanup is unaware of noload: daily-backup.sh uses mount -o ro,noload to skip journal replay, but noload doesn't skip orphan-inode cleanup. When a node reboots with dirty filesystem state on the source PVC, snapshot mounts can take 100+s — exceeding the script's 30s timeout. Confirmed by dmesg: 20+ volumes logged INFO: recovery required on readonly filesystem during today's backup window.
Calico BIRD liveness probe is racy on cold start. The probe doesn't tolerate the 3-5s BIRD initialization window, so any cold-start of Calico tends to crash-loop briefly. Usually it self-recovers on the 2nd or 3rd restart — today it didn't, because the apiserver was unreachable from the brand-new pod (chicken-and-egg).

Contributing Factors

The job's IO profile was never measured before running. immich-go --concurrent-tasks 20 is the upstream default; nobody validated it against our hardware.
No staging window. anca-elements-import is the second of two intentional one-shot ingestion runs (1st was Viktor's library months ago). The first run also caused load — but didn't crash a node, so it was treated as "loud but fine."
Daily-backup overlap. The 05:00 backup timer fired while the IO tail of the Immich job was still in flight. The two competing workloads triggered the LUKS mount timeouts.
No PVE-level IO QoS between VMs (Proxmox supports iops_rd/wr throttle groups on disk specs; we've never set them).
No alert for "node1 about to crash". NodeHighIOWait fires at a fixed threshold but doesn't trigger any automated mitigation or paging.

Detection Gaps

Gap	Impact	Fix
No PVE host IO PSI scraped into Prometheus	We can see node1 IO PSI but not the PVE-host-level pressure that's the actual leading indicator	Add node_exporter PSI scrape on PVE (already running) to Prometheus targets, expose `pressure_io_*`
No alert on sustained sdc utilization > 80%	The IO storm built up for hours without any signal escalating	Add `PVEThinPoolIOSaturated` rule: `irate(node_disk_io_time_seconds_total{device="sdc",instance="pve"}[5m]) > 0.85 for 30m`
No alert on Proxmox-host loadavg > 20	Sustained loadavg 13–15 was visible only through cluster healthcheck #44	Add `PVEHostLoadHigh` rule (1m loadavg > 25 for 10m)
No alert on K8s Job IO throughput	An uncapped K8s job can do unlimited IO without alerting	Add `JobHighIOThroughput`: alert if container_fs_reads_bytes_total rate over 5m > 100 MB/s for >10m
Backup timeout fires silently	systemd kill of daily-backup at 4h didn't alert anyone — we'd have noticed after 48h via the `backup_per_db=FAIL mysql=33h pg=33h` healthcheck	Add Alertmanager rule on daily-backup unit failure (probe systemd unit state via node-exporter textfile collector)
LUKS mount step inflation post-reboot is silent	30 mount failures logged as WARN, no aggregate alert	Add count-of-WARN alert from the daily-backup log

Prevention Plan

P0 — Prevent this exact failure

Priority	Action	Type	Details	Status
P0	Cap concurrency on future `*-elements-import` jobs	Config	In `stacks/immich/main.tf` (`kubernetes_job_v1.anca_elements_import` and future siblings): set `--concurrent-tasks 4` (down from 20). Also set `resources.limits.cpu = "2000m"` and `activeDeadlineSeconds = 21600` (6h cap). Add a `nodeSelector` to keep the job off node1 (move read-side onto a non-GPU node so the GPU node only does ML).	TODO
P0	Bump LUKS mount timeout in `daily-backup`	Config	`infra/scripts/daily-backup.sh` line 243: change `timeout 30 mount …` → `timeout 180 mount …` (covers observed 109s worst case). Add a comment explaining the ext4 orphan-cleanup exception.	TODO
P0	Schedule big-data ingests outside backup window	Config	Forbid Job/CronJob scheduling between 04:30–08:30 EEST (when daily-backup runs). Either via a Kyverno policy on `*-import` named jobs, or a documented convention enforced at PR review.	TODO
P0	Raise Proxmox CSI LUN limit on each k8s-node VM	Architecture	The default `virtio-scsi-pci` controller exposes 30 LUN slots; proxmox-csi hard-caps at this. Resolution path: add a 2nd `virtio-scsi-pci` controller (`scsihw1`) to each k8s-node VM via Proxmox, OR migrate VMs to `virtio-scsi-single` which allows 256+ LUNs per disk. Either requires a brief per-node reboot. Without this, every future cluster-churn event can re-hit "no free lun found" on whichever node ends up overloaded. Permanent fix — must land before next ingest run.	TODO
P0	Document the `MaxVolumesPerNode=30` limit in storage architecture	Runbook	Add to `docs/architecture/storage.md` — currently the 30-LUN cap is invisible to operators until they hit it. Include `kubectl get pods -A --field-selector spec.nodeName=NODE` and the 30-cap as a sizing check before any cluster-wide rebalance / drain operation.	TODO
P0	Add startupProbe to mysql-standalone	Config	`stacks/dbaas/modules/dbaas/main.tf` (the `kubernetes_stateful_set_v1` for `mysql-standalone`): add `startupProbe` with `failureThreshold=120, periodSeconds=15, timeoutSeconds=10` (≈30 min budget for InnoDB recovery). Also bump liveness `initialDelaySeconds=120, failureThreshold=10`. Today MySQL spun in a CrashLoopBackOff for ~30 min — each restart's InnoDB recovery aborted when the existing 30s liveness probe fired, never finishing. Resolved manually via `kubectl patch sts mysql-standalone` — must Terraform-codify.	Done (kubectl, needs TF)
P0	Add startupProbe to goauthentik-server	Config	Similar issue: Authentik Django migrations + clip/face index rebuilds take 5-10 min after PG restart, but the startup probe budget is too short → restart loop. Add `startupProbe: failureThreshold=180, periodSeconds=10` (30 min) on `goauthentik-server`. Source: `stacks/authentik/modules/authentik/main.tf` (or equivalent Helm values).	TODO
P0	Disable daily-backup.timer when manually stopping daily-backup.service	Runbook	During this incident, `systemctl stop daily-backup.service` alone wasn't enough — the timer kept it queued for re-fire. The recovery sequence is: `systemctl stop daily-backup.timer; systemctl stop daily-backup.service`. Document in `docs/runbooks/cluster-recovery.md` (to-create) as the canonical sequence.	TODO

P1 — Reduce blast radius

Priority	Action	Type	Details	Status
P1	Proxmox per-VM IO throttle for the Immich workload host	Architecture	Set `iops_rd=2000,iops_wr=1000,mbps_rd=200,mbps_wr=100` on the k8s-node1 VM disk via Proxmox API. Pick numbers based on baseline `iostat` measurement. Same for non-prod VMs (devvm, registry).	TODO
P1	Move NFS reads off the PVE host hot path	Architecture	Currently the PVE host itself reads `/srv/nfs/anca-elements` when an NFS client mounts that path — but the reads happen on PVE because it's the NFS server. Consider mounting anca-elements via a dedicated NFS export with a `wsize/rsize` cap, OR put bulk-ingest source data on a separate physical disk (sdb SSD has headroom).	Investigation
P1	Add cgroup IOLimit to Kyverno mutating webhook for namespaces	Config	Auto-attach a `cgroupv2 io.max` annotation to pods in known-high-IO namespaces (immich, frigate, ollama). Requires kernel ≥5.13 + cgroupv2 (we have both).	TODO
P1	Separate Immich `library` + `upload` NFS exports onto different LVs	Architecture	Currently `/srv/nfs/immich/{library,upload}` share the `pve/nfs-data` LV. Splitting upload onto its own thinly-provisioned LV would let us throttle upload-side independently. Cost: ~30 min PV churn.	Architecture

P2 — Detect faster

Priority	Action	Type	Details	Status
P2	Alert on sustained PVE sdc utilization	Alert	New PrometheusRule `PVEThinPoolIOSaturated`: `irate(node_disk_io_time_seconds_total{device="sdc",instance=~"pve.*"}[5m]) > 0.85` for 30m, severity=warning.	TODO
P2	Alert on PVE loadavg high	Alert	New PrometheusRule `PVEHostLoadHigh`: `node_load1{instance=~"pve.*"} > 25` for 10m. severity=warning.	TODO
P2	Alert on Kubernetes Job high IO rate	Alert	`JobHighIOThroughput`: `sum by (namespace, pod) (irate(container_fs_reads_bytes_total{container!=""}[5m])) > 10010241024` for 10m → warning.	TODO
P2	Alert on daily-backup systemd unit failure	Alert	Add node-exporter textfile collector entry that runs `systemctl is-failed daily-backup.service` every 1m and writes 0/1 to `/var/lib/node_exporter/textfile/backup_unit_state.prom`. PrometheusRule fires on value=1 for 5m.	TODO
P2	Alert on Multi-Attach VolumeAttachment hung > 5m	Alert	These are the smoking gun whenever a node reboots. New rule on `kube_volumeattachment_status_attached == 0 and time() - kube_volumeattachment_metadata_created > 300`.	TODO

P3 — Improve resilience

Priority	Action	Type	Details	Status
P3	Move k8s-node1 OS disk off sdc	Architecture	If node1 OS disk lived on the SSD (sdb VG `ssd`, 475 GB free), an sdc IO storm wouldn't starve the VM's own root filesystem and we'd avoid the reboot trigger. Cost: VM migration, ~1h downtime for node1.	Architecture
P3	Spread GPU + pinned services off node1	Architecture	Today node1 carries the GPU + Loki + Technitium primary + claude-agent + many app deployments. When it goes down, the blast radius is huge. Re-evaluate pin constraints — only Immich-ML and Frigate genuinely need node1.	Investigation
P3	Document recovery runbook for "node1 hard reboot"	Runbook	New `docs/runbooks/node1-reboot-recovery.md` capturing the strict order discovered today: (1) force-cleanup Failed/Unknown/stuck-Terminating zombies, (2) force-delete the calico-node pod on the rebooted node so BIRD restarts cleanly, (3) wait for calico-node Ready, (4) force-delete the proxmox-csi-plugin-node pod, (5) verify `csi.proxmox.sinextra.dev` appears in `kubectl get csinode <node> -o yaml`, (6) delete stale VolumeAttachments referencing the rebooted node where consumers have already rescheduled, (7) verify nvidia driver recovery (separate cascade).	TODO
P3	Make Calico BIRD liveness probe cold-start tolerant	Config	Bump `initialDelaySeconds` on `calico-node`'s liveness probe by 15s, or switch to an `exec` probe that checks BIRD socket existence rather than HTTP. Prevents the cold-start crash loop after node reboots.	Investigation
P3	Pre-flight script for bulk ingest jobs	Runbook + Config	Wrapper around `immich-go` that (a) checks PVE loadavg < 10, (b) checks sdc IO util < 50%, (c) checks daily-backup not running, before allowing the job to start. Refuses otherwise.	TODO

Lessons Learned

One shared physical disk is one shared failure domain. sdc serves all VMs; any uncapped workload can take down the cluster. We've now hit this three times in 17 days. Continuing to treat each as a one-off is no longer credible — we need IO budget enforcement (P1), not just better alerting (P2).
NFS reads bypass per-VM accounting. We assumed throttling the workload's VM would protect us. It doesn't — the reads physically happen on the PVE host's IO scheduler.
The "complete" state of a Job doesn't mean its IO is gone. anca-elements-import finished in 85 min, but the IO tail (ML pipeline + filesystem cache eviction) ran for hours. Future ingest jobs need to either run during off-hours OR be sized so that even their tail is benign.
Backup pipeline depends on a clean cluster state. When node1 was unhealthy, daily-backup couldn't complete LUKS mounts in time. Backups should be more resilient to upstream IO degradation OR we should treat backup failure as a SEV signal in real time.
The 30s timeout in daily-backup.sh was set without considering post-reboot recovery time. Defaults like this need to be reviewed in light of actual observed worst case.
Recovery requires a known runbook. Today's recovery worked because we knew which order to do things: force-delete zombies → re-register CSI → clear VAs → wait for daemonsets → restart deployments. Codifying that as a runbook means the next incident is 5x faster.

Update 2026-06-01 — recurrence (Immich Duplicate Detection)

5th IO-pressure incident. A user-triggered library-wide Duplicate Detection run on the 163,989-asset Immich library cascaded into ML/thumbnail backfill for the ~5,150 assets missing CLIP embeddings (largely a fresh anca-elements import that had completed ~90 min earlier). The Immich server-side job thumbnailGeneration was set to concurrency 8 (plus metadataExtraction=4, library=4), so the backfill read originals off sdc 8-wide → ~92 MB/s, queue depth ~99, sdc ~100% util. 64 nfsd threads went D-state on folio_wait_bit_common; etcd on k8s-master was starved → kube-apiserver down ~30 min (different blast radius from 2026-05-25's node1 reboot — same root cause: the shared sdc spindle).

New finding: the 2026-05-25 P0 capped only the import-side concurrency (immich-go --concurrent-tasks). The Immich server-side job concurrency (job.*.concurrency in DB system-config) was never capped and had been tuned for speed (8/4/4). So any library-wide operation (dedup, smart-search backfill, thumbnail regen) re-triggers the storm independent of the import job.

Mitigation applied (2026-06-01): capped the HDD-original-reading server jobs to thumbnailGeneration=2, metadataExtraction=2, library=2 in system_metadata system-config JSONB + immich-server recreate. Verified: dedup resumed with sdc at 2–3% util, queue depth ~0.05, apiserver healthy. Documented in infra/.claude/CLAUDE.md Immich row.

Still the real fix (from this PM, still TODO): the P0 import-side cap, and especially the IO-isolation items — move k8s-master etcd + node OS disks off sdc onto SSD (generalize P3), and/or give the Immich library its own spindle (P1). Concurrency caps are a band-aid; sdc remains a single shared failure domain that every storm finds. Tracked in beads (see Follow-up Implementation).

2026-05-09 IO post-mortem: docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md
2026-05-16 kured/anubis post-mortem: docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md
2026-05-17 GPU driver post-mortem: docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
Storage architecture: docs/architecture/storage.md
Backup pipeline: docs/architecture/backup-dr.md
Storage hardware mapping: memory id=464 (sdc thin pool, sda backup, sdb SSD)
3-2-1 backup strategy: memory id=609
Immich storage layout: memory id=674
Memory entries for this incident: 2682-2686 (Immich storm), 2687-2692 (LUKS mount timing)

Follow-up Implementation

This section is auto-populated by the postmortem-todo-resolver agent.

Date	Action	Priority	Type	Commit	Implemented By

26 KiB Raw Blame History Unescape Escape