immich: fix slow context search — prewarm clip_index + latency alert/healthcheck

Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, */5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 21:00:43 +00:00 · 2026-06-03 21:00:43 +00:00 · f201e4573e
commit f201e4573e
parent 38c77048fd
6 changed files with 308 additions and 5 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -153,7 +153,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 | Service | Key Operational Knowledge |
 |---------|--------------------------|
 | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
-| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
+| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
 | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
 | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
 | Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
@ -166,7 +166,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 - Exclude completed CronJob pods from "pod not ready" alerts.
 - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
 - **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
+- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
 - **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.

--- a/.claude/skills/cluster-health/SKILL.md
+++ b/.claude/skills/cluster-health/SKILL.md
@ -7,7 +7,7 @@ description: |
  (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
  (4) User mentions "health check", "cluster status", "cluster health",
  (5) User asks "is everything running" or "any problems".
-  Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
+  Runs 46 cluster-wide checks (nodes, workloads, monitoring, certs,
  backups, external reachability, PVE host thermals + load, HA Sofia
  status dashboard) with safe auto-fix for evicted pods.
 author: Claude Code
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
 bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 ```

-## What It Checks (45 checks)
+## What It Checks (46 checks)

 | # | Check | Notes |
 |---|-------|-------|
@ -116,6 +116,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 | 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
 | 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
 | 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
+| 46 | Immich Smart Search | Live context-search health. Measures a representative random-vector pgvector ANN query latency (in-pod, excludes exec overhead) + the `clip_index` residency in PG shared_buffers via `pg_buffercache`. PASS <0.5s & ≥90% resident; WARN 0.5-1.5s or 50-90% resident; FAIL >1.5s or <50% resident (index evicted from cache → cold reads; check the `clip-index-prewarm` CronJob) |

 ## Safe Auto-Fix Rules

--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -167,6 +167,13 @@ spec:
 - **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
 - **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)

+#### Immich Smart Search Alerts
+- **ImmichSmartSearchSlow**: Representative context-search ANN query >1s for 15m. Root cause is almost always the `clip_index` (vchord, ~665MB) decaying out of PG `shared_buffers` — a cold list read is ~1.8s vs ~4ms warm. Remediation: confirm the `clip-index-prewarm` CronJob (immich ns, `*/5`) is succeeding; manual fix `kubectl exec -n immich -c immich-postgresql <pg-pod> -- psql -U postgres -d immich -c "SELECT pg_prewarm('clip_index')"`.
+- **ImmichClipIndexColdCache**: `clip_index` <50% resident in shared_buffers for 15m (leading indicator; same remediation).
+- **ImmichSearchProbeStale**: `immich-search-probe` hasn't reported in >30m (CronJob broken). Inhibits the two above so frozen Pushgateway gauges don't false-fire.
+
+The Immich smart-search monitoring uses two CronJobs in the `immich` namespace (both `*/5`): `clip-index-prewarm` re-runs `pg_prewarm('clip_index')` to keep the vector index hot during runtime (the `postStart` prewarm only fires at pod start; `pg_prewarm.autoprewarm` only reloads at startup, so the index otherwise decays under job buffer-pressure), and `immich-search-probe` (postgres init-container measures a random-vector ANN latency + `pg_buffercache` residency → curl sidecar pushes `immich_smart_search_db_seconds` / `immich_clip_index_cached_pct` / `immich_smart_search_probe_success` / `immich_smart_search_probe_last_run_timestamp` to the Pushgateway). Also surfaced by cluster-health check #46 (`check_immich_search`). Note this is the **Postgres** half of smart-search warmth; the **ML model** half is kept warm by the separate `clip-keepalive` CronJob.
+
 The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
 1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
 2. Email lands in the `spam@` catch-all mailbox via MX delivery
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=45
+TOTAL_CHECKS=46

 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -2961,6 +2961,57 @@ PYEOF
    fi
 }

+# --- 46. Immich Smart (Context) Search ---
+# Smart search = ML embedding (kept warm by clip-keepalive) + a pgvector ANN
+# query over the vchord clip_index. The index must stay resident in PG
+# shared_buffers (kept warm by clip-index-prewarm); if it decays out of cache a
+# query pays a ~1.8s cold storage read instead of ~4ms warm. We measure both
+# the live ANN latency and the clip_index residency to catch the regression.
+check_immich_search() {
+    section 46 "Immich Smart Search"
+    local pg pct dur_ms dur detail=""
+
+    pg=$($KUBECTL get pods -n immich --no-headers 2>/dev/null | awk '/^immich-postgresql-/ && $3=="Running"{print $1; exit}')
+    if [[ -z "$pg" ]]; then
+        warn "immich-postgresql pod not running — cannot probe smart search"
+        json_add "immich_search" "WARN" "immich-postgresql pod not running"
+        return 0
+    fi
+
+    # clip_index residency in shared_buffers (single-quoted SQL → pass as one arg)
+    pct=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- psql -U postgres -d immich -tAc \
+        "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null | tr -d ' ')
+
+    # Representative random-vector ANN latency, measured in-pod (excludes exec overhead)
+    dur_ms=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- bash -c \
+        's=$(date +%s%3N); psql -U postgres -d immich -tAc "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) x" >/dev/null 2>&1; e=$(date +%s%3N); echo $((e-s))' 2>/dev/null | tr -d ' ')
+
+    if ! [[ "$dur_ms" =~ ^[0-9]+$ ]]; then
+        warn "Smart-search probe query failed (clip_index residency: ${pct:-?}%)"
+        json_add "immich_search" "WARN" "probe query failed; residency=${pct:-?}%"
+        return 0
+    fi
+    dur=$(awk "BEGIN{printf \"%.2f\", $dur_ms/1000}")
+    detail="latency=${dur}s clip_index_resident=${pct:-?}%"
+
+    if (( dur_ms > 1500 )); then
+        [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
+        fail "Smart search SLOW: $detail — clip_index likely evicted; check clip-index-prewarm CronJob"
+        json_add "immich_search" "FAIL" "$detail"
+    elif [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 50)}"; then
+        [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
+        fail "clip_index only ${pct}% resident in PG cache — searches cold ($detail)"
+        json_add "immich_search" "FAIL" "$detail"
+    elif (( dur_ms > 500 )) || { [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 90)}"; }; then
+        [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
+        warn "Smart search degraded: $detail"
+        json_add "immich_search" "WARN" "$detail"
+    else
+        pass "Smart search healthy: $detail"
+        json_add "immich_search" "PASS" "$detail"
+    fi
+}
+
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3029,6 +3080,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
+        check_immich_search
    )

    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/stacks/immich/main.tf
+++ b/stacks/immich/main.tf
@ -853,6 +853,215 @@ resource "kubernetes_cron_job_v1" "clip-keepalive" {
  }
 }

+# Keeps the ~665MB vchord `clip_index` resident in PG shared_buffers.
+# The immich-postgresql postStart hook prewarms it ONCE at pod start, but
+# nothing re-warms it during runtime — pg_prewarm.autoprewarm only reloads at
+# *startup*. Under buffer pressure from thumbnail/OCR/library jobs the index
+# slowly decays out of cache (observed ~33% resident after 9 days uptime). A
+# smart-search ANN probe that lands on an evicted vchord list then pays a
+# ~1.8s cold storage read instead of the ~4ms warm path. This job re-prewarms
+# every 5 min, pinning the whole index hot. Parallel to clip-keepalive (which
+# keeps the ML *model* warm); this keeps the *index* warm — BOTH are needed for
+# fast smart search. immich PG role is a superuser, so it can run pg_prewarm.
+resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
+  metadata {
+    name      = "clip-index-prewarm"
+    namespace = kubernetes_namespace.immich.metadata[0].name
+  }
+  spec {
+    concurrency_policy            = "Forbid"
+    failed_jobs_history_limit     = 3
+    successful_jobs_history_limit = 1
+    schedule                      = "*/5 * * * *"
+    starting_deadline_seconds     = 60
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 1
+        active_deadline_seconds    = 120
+        ttl_seconds_after_finished = 120
+        template {
+          metadata {}
+          spec {
+            restart_policy = "Never"
+            container {
+              name  = "prewarm"
+              image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
+              # command overrides the postgres entrypoint → runs psql directly.
+              command = [
+                "psql", "-v", "ON_ERROR_STOP=1", "-c",
+                "SELECT pg_prewarm('clip_index'); SELECT pg_prewarm('smart_search');",
+              ]
+              env {
+                name  = "PGHOST"
+                value = "immich-postgresql.immich.svc.cluster.local"
+              }
+              env {
+                name  = "PGUSER"
+                value = "immich"
+              }
+              env {
+                name  = "PGDATABASE"
+                value = "immich"
+              }
+              env {
+                name  = "PGCONNECT_TIMEOUT"
+                value = "10"
+              }
+              env {
+                name = "PGPASSWORD"
+                value_from {
+                  secret_key_ref {
+                    name = "immich-secrets"
+                    key  = "db_password"
+                  }
+                }
+              }
+              resources {
+                requests = { cpu = "10m", memory = "32Mi" }
+                limits   = { memory = "64Mi" }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
+
+# Measures real context-search (smart-search) latency for alerting + the
+# cluster-health script. Two stages in one pod: an init container (postgres
+# image, has psql) times a representative random-vector ANN query and reads
+# clip_index residency from pg_buffercache, writing Prometheus exposition text
+# to a shared emptyDir; the main container (curl image) pushes it to the
+# Pushgateway. Stock images only — no apt/pip install at runtime (see the
+# clip-keepalive note). A random probe vector each run samples different vchord
+# lists, so the metric reflects true cache warmth rather than one hot list.
+resource "kubernetes_cron_job_v1" "immich-search-probe" {
+  metadata {
+    name      = "immich-search-probe"
+    namespace = kubernetes_namespace.immich.metadata[0].name
+  }
+  spec {
+    concurrency_policy            = "Forbid"
+    failed_jobs_history_limit     = 3
+    successful_jobs_history_limit = 1
+    schedule                      = "*/5 * * * *"
+    starting_deadline_seconds     = 60
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 1
+        active_deadline_seconds    = 120
+        ttl_seconds_after_finished = 120
+        template {
+          metadata {}
+          spec {
+            restart_policy = "Never"
+            volume {
+              name = "shared"
+              empty_dir {}
+            }
+            init_container {
+              name  = "measure"
+              image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
+              command = ["/bin/bash", "-c", <<-EOT
+                set -uo pipefail
+                OUT=/shared/metrics.prom
+                success=1
+                start=$(date +%s%3N)
+                if ! psql -v ON_ERROR_STOP=1 -tA -c "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) s" >/dev/null 2>/tmp/err; then
+                  success=0
+                  cat /tmp/err >&2
+                fi
+                end=$(date +%s%3N)
+                dur_ms=$((end - start))
+                dur=$(printf '%d.%03d' $((dur_ms/1000)) $((dur_ms%1000)))
+                pct=$(psql -tA -c "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null)
+                if [ -z "$pct" ]; then pct=-1; fi
+                {
+                  echo "# HELP immich_smart_search_db_seconds Wall-clock latency of a representative smart-search ANN query."
+                  echo "# TYPE immich_smart_search_db_seconds gauge"
+                  echo "immich_smart_search_db_seconds $dur"
+                  echo "# HELP immich_clip_index_cached_pct Percent of clip_index vchord index resident in PG shared_buffers."
+                  echo "# TYPE immich_clip_index_cached_pct gauge"
+                  echo "immich_clip_index_cached_pct $pct"
+                  echo "# HELP immich_smart_search_probe_success 1 if the probe ANN query succeeded."
+                  echo "# TYPE immich_smart_search_probe_success gauge"
+                  echo "immich_smart_search_probe_success $success"
+                  echo "# HELP immich_smart_search_probe_last_run_timestamp Unix time of last probe run."
+                  echo "# TYPE immich_smart_search_probe_last_run_timestamp gauge"
+                  echo "immich_smart_search_probe_last_run_timestamp $(date +%s)"
+                } > "$OUT"
+                echo "probe dur=$dur pct=$pct success=$success"
+                exit 0
+              EOT
+              ]
+              env {
+                name  = "PGHOST"
+                value = "immich-postgresql.immich.svc.cluster.local"
+              }
+              env {
+                name  = "PGUSER"
+                value = "immich"
+              }
+              env {
+                name  = "PGDATABASE"
+                value = "immich"
+              }
+              env {
+                name  = "PGCONNECT_TIMEOUT"
+                value = "10"
+              }
+              env {
+                name = "PGPASSWORD"
+                value_from {
+                  secret_key_ref {
+                    name = "immich-secrets"
+                    key  = "db_password"
+                  }
+                }
+              }
+              volume_mount {
+                name       = "shared"
+                mount_path = "/shared"
+              }
+              resources {
+                requests = { cpu = "10m", memory = "32Mi" }
+                limits   = { memory = "64Mi" }
+              }
+            }
+            container {
+              name  = "push"
+              image = "docker.io/curlimages/curl:8.11.1"
+              command = [
+                "curl", "-sf", "-m", "20", "--data-binary", "@/shared/metrics.prom",
+                "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/immich-search-probe",
+              ]
+              volume_mount {
+                name       = "shared"
+                mount_path = "/shared"
+              }
+              resources {
+                requests = { cpu = "10m", memory = "16Mi" }
+                limits   = { memory = "32Mi" }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
+
 module "ingress-immich" {
  source = "../../modules/kubernetes/ingress_factory"
  # auth = "app": Immich has its own user auth + bearer-token API. Authentik
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -135,6 +135,12 @@ alertmanager:
          - alertname = EmailRoundtripFailing
        target_matchers:
          - alertname = EmailRoundtripStale
+      # A stale search probe means its Pushgateway gauges are frozen — don't
+      # let the (now meaningless) latency/cache alerts fire off stale data.
+      - source_matchers:
+          - alertname = ImmichSearchProbeStale
+        target_matchers:
+          - alertname =~ "ImmichSmartSearchSlow|ImmichClipIndexColdCache"
      # Power outage makes on-battery alert redundant
      - source_matchers:
          - alertname = PowerOutage
@ -854,6 +860,34 @@ serverFiles:
              subsystem: gpu
            annotations:
              summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule"
+      - name: Immich Smart Search
+        rules:
+          # Context (smart) search latency. The vchord clip_index must stay
+          # resident in PG shared_buffers; if it decays out of cache an ANN
+          # probe pays a ~1.8s cold storage read vs ~4ms warm. clip-index-prewarm
+          # (immich ns, */5) pins it; immich-search-probe (*/5) measures it and
+          # pushes these gauges to the Pushgateway.
+          - alert: ImmichSmartSearchSlow
+            expr: immich_smart_search_db_seconds{job="immich-search-probe"} > 1
+            for: 15m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Immich context search slow: {{ $value | printf \"%.2f\" }}s (>1s) — clip_index likely evicted; check clip-index-prewarm CronJob"
+          - alert: ImmichClipIndexColdCache
+            expr: immich_clip_index_cached_pct{job="immich-search-probe"} >= 0 and immich_clip_index_cached_pct{job="immich-search-probe"} < 50
+            for: 15m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Immich clip_index only {{ $value | printf \"%.0f\" }}% resident in PG shared_buffers — smart search will be slow (clip-index-prewarm may be failing)"
+          - alert: ImmichSearchProbeStale
+            expr: time() - immich_smart_search_probe_last_run_timestamp{job="immich-search-probe"} > 1800
+            for: 10m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Immich search probe has not reported in {{ $value | printf \"%.0f\" }}s — immich-search-probe CronJob may be broken"
      - name: Power
        rules:
          - alert: OnBattery