From f201e4573ee95a0e3b19dc402ad404f07fb77110 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 3 Jun 2026 21:00:43 +0000 Subject: [PATCH] =?UTF-8?q?immich:=20fix=20slow=20context=20search=20?= =?UTF-8?q?=E2=80=94=20prewarm=20clip=5Findex=20+=20latency=20alert/health?= =?UTF-8?q?check?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, */5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 --- .claude/CLAUDE.md | 4 +- .claude/skills/cluster-health/SKILL.md | 5 +- docs/architecture/monitoring.md | 7 + scripts/cluster_healthcheck.sh | 54 ++++- stacks/immich/main.tf | 209 ++++++++++++++++++ .../monitoring/prometheus_chart_values.tpl | 34 +++ 6 files changed, 308 insertions(+), 5 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index c87259c7..54b51441 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -153,7 +153,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle | Service | Key Operational Knowledge | |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | -| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | +| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding | @@ -166,7 +166,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle - Exclude completed CronJob pods from "pod not ready" alerts. - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). -- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence. +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). - **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. diff --git a/.claude/skills/cluster-health/SKILL.md b/.claude/skills/cluster-health/SKILL.md index 79281480..3e387f24 100644 --- a/.claude/skills/cluster-health/SKILL.md +++ b/.claude/skills/cluster-health/SKILL.md @@ -7,7 +7,7 @@ description: | (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff, (4) User mentions "health check", "cluster status", "cluster health", (5) User asks "is everything running" or "any problems". - Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs, + Runs 46 cluster-wide checks (nodes, workloads, monitoring, certs, backups, external reachability, PVE host thermals + load, HA Sofia status dashboard) with safe auto-fix for evicted pods. author: Claude Code @@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config ``` -## What It Checks (45 checks) +## What It Checks (46 checks) | # | Check | Notes | |---|-------|-------| @@ -116,6 +116,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config | 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) | | 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads | | 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) | +| 46 | Immich Smart Search | Live context-search health. Measures a representative random-vector pgvector ANN query latency (in-pod, excludes exec overhead) + the `clip_index` residency in PG shared_buffers via `pg_buffercache`. PASS <0.5s & ≥90% resident; WARN 0.5-1.5s or 50-90% resident; FAIL >1.5s or <50% resident (index evicted from cache → cold reads; check the `clip-index-prewarm` CronJob) | ## Safe Auto-Fix Rules diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 521540c9..39b437c0 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -167,6 +167,13 @@ spec: - **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken) - **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down) +#### Immich Smart Search Alerts +- **ImmichSmartSearchSlow**: Representative context-search ANN query >1s for 15m. Root cause is almost always the `clip_index` (vchord, ~665MB) decaying out of PG `shared_buffers` — a cold list read is ~1.8s vs ~4ms warm. Remediation: confirm the `clip-index-prewarm` CronJob (immich ns, `*/5`) is succeeding; manual fix `kubectl exec -n immich -c immich-postgresql -- psql -U postgres -d immich -c "SELECT pg_prewarm('clip_index')"`. +- **ImmichClipIndexColdCache**: `clip_index` <50% resident in shared_buffers for 15m (leading indicator; same remediation). +- **ImmichSearchProbeStale**: `immich-search-probe` hasn't reported in >30m (CronJob broken). Inhibits the two above so frozen Pushgateway gauges don't false-fire. + +The Immich smart-search monitoring uses two CronJobs in the `immich` namespace (both `*/5`): `clip-index-prewarm` re-runs `pg_prewarm('clip_index')` to keep the vector index hot during runtime (the `postStart` prewarm only fires at pod start; `pg_prewarm.autoprewarm` only reloads at startup, so the index otherwise decays under job buffer-pressure), and `immich-search-probe` (postgres init-container measures a random-vector ANN latency + `pg_buffercache` residency → curl sidecar pushes `immich_smart_search_db_seconds` / `immich_clip_index_cached_pct` / `immich_smart_search_probe_success` / `immich_smart_search_probe_last_run_timestamp` to the Pushgateway). Also surfaced by cluster-health check #46 (`check_immich_search`). Note this is the **Postgres** half of smart-search warmth; the **ML model** half is kept warm by the separate `clip-keepalive` CronJob. + The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that: 1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me` 2. Email lands in the `spam@` catch-all mailbox via MX delivery diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index ab8a6125..c3656bae 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" KUBECTL="" JSON_RESULTS=() -TOTAL_CHECKS=45 +TOTAL_CHECKS=46 # Parallel execution settings. Each check function is self-contained — it # only reads cluster state and mutates the in-memory counters / JSON_RESULTS @@ -2961,6 +2961,57 @@ PYEOF fi } +# --- 46. Immich Smart (Context) Search --- +# Smart search = ML embedding (kept warm by clip-keepalive) + a pgvector ANN +# query over the vchord clip_index. The index must stay resident in PG +# shared_buffers (kept warm by clip-index-prewarm); if it decays out of cache a +# query pays a ~1.8s cold storage read instead of ~4ms warm. We measure both +# the live ANN latency and the clip_index residency to catch the regression. +check_immich_search() { + section 46 "Immich Smart Search" + local pg pct dur_ms dur detail="" + + pg=$($KUBECTL get pods -n immich --no-headers 2>/dev/null | awk '/^immich-postgresql-/ && $3=="Running"{print $1; exit}') + if [[ -z "$pg" ]]; then + warn "immich-postgresql pod not running — cannot probe smart search" + json_add "immich_search" "WARN" "immich-postgresql pod not running" + return 0 + fi + + # clip_index residency in shared_buffers (single-quoted SQL → pass as one arg) + pct=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- psql -U postgres -d immich -tAc \ + "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null | tr -d ' ') + + # Representative random-vector ANN latency, measured in-pod (excludes exec overhead) + dur_ms=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- bash -c \ + 's=$(date +%s%3N); psql -U postgres -d immich -tAc "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) x" >/dev/null 2>&1; e=$(date +%s%3N); echo $((e-s))' 2>/dev/null | tr -d ' ') + + if ! [[ "$dur_ms" =~ ^[0-9]+$ ]]; then + warn "Smart-search probe query failed (clip_index residency: ${pct:-?}%)" + json_add "immich_search" "WARN" "probe query failed; residency=${pct:-?}%" + return 0 + fi + dur=$(awk "BEGIN{printf \"%.2f\", $dur_ms/1000}") + detail="latency=${dur}s clip_index_resident=${pct:-?}%" + + if (( dur_ms > 1500 )); then + [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search" + fail "Smart search SLOW: $detail — clip_index likely evicted; check clip-index-prewarm CronJob" + json_add "immich_search" "FAIL" "$detail" + elif [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 50)}"; then + [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search" + fail "clip_index only ${pct}% resident in PG cache — searches cold ($detail)" + json_add "immich_search" "FAIL" "$detail" + elif (( dur_ms > 500 )) || { [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 90)}"; }; then + [[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search" + warn "Smart search degraded: $detail" + json_add "immich_search" "WARN" "$detail" + else + pass "Smart search healthy: $detail" + json_add "immich_search" "PASS" "$detail" + fi +} + # --- Summary --- print_summary() { if [[ "$JSON" == true ]]; then @@ -3029,6 +3080,7 @@ main() { check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_external_replicas check_external_divergence check_pve_thermals check_pve_load check_external_traefik_5xx check_ha_status_dashboard + check_immich_search ) # Auto-fix mutates cluster state inside individual checks — keep that diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index 7bdcf769..8ee67b11 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -853,6 +853,215 @@ resource "kubernetes_cron_job_v1" "clip-keepalive" { } } +# Keeps the ~665MB vchord `clip_index` resident in PG shared_buffers. +# The immich-postgresql postStart hook prewarms it ONCE at pod start, but +# nothing re-warms it during runtime — pg_prewarm.autoprewarm only reloads at +# *startup*. Under buffer pressure from thumbnail/OCR/library jobs the index +# slowly decays out of cache (observed ~33% resident after 9 days uptime). A +# smart-search ANN probe that lands on an evicted vchord list then pays a +# ~1.8s cold storage read instead of the ~4ms warm path. This job re-prewarms +# every 5 min, pinning the whole index hot. Parallel to clip-keepalive (which +# keeps the ML *model* warm); this keeps the *index* warm — BOTH are needed for +# fast smart search. immich PG role is a superuser, so it can run pg_prewarm. +resource "kubernetes_cron_job_v1" "clip-index-prewarm" { + metadata { + name = "clip-index-prewarm" + namespace = kubernetes_namespace.immich.metadata[0].name + } + spec { + concurrency_policy = "Forbid" + failed_jobs_history_limit = 3 + successful_jobs_history_limit = 1 + schedule = "*/5 * * * *" + starting_deadline_seconds = 60 + job_template { + metadata {} + spec { + backoff_limit = 1 + active_deadline_seconds = 120 + ttl_seconds_after_finished = 120 + template { + metadata {} + spec { + restart_policy = "Never" + container { + name = "prewarm" + image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0" + # command overrides the postgres entrypoint → runs psql directly. + command = [ + "psql", "-v", "ON_ERROR_STOP=1", "-c", + "SELECT pg_prewarm('clip_index'); SELECT pg_prewarm('smart_search');", + ] + env { + name = "PGHOST" + value = "immich-postgresql.immich.svc.cluster.local" + } + env { + name = "PGUSER" + value = "immich" + } + env { + name = "PGDATABASE" + value = "immich" + } + env { + name = "PGCONNECT_TIMEOUT" + value = "10" + } + env { + name = "PGPASSWORD" + value_from { + secret_key_ref { + name = "immich-secrets" + key = "db_password" + } + } + } + resources { + requests = { cpu = "10m", memory = "32Mi" } + limits = { memory = "64Mi" } + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +# Measures real context-search (smart-search) latency for alerting + the +# cluster-health script. Two stages in one pod: an init container (postgres +# image, has psql) times a representative random-vector ANN query and reads +# clip_index residency from pg_buffercache, writing Prometheus exposition text +# to a shared emptyDir; the main container (curl image) pushes it to the +# Pushgateway. Stock images only — no apt/pip install at runtime (see the +# clip-keepalive note). A random probe vector each run samples different vchord +# lists, so the metric reflects true cache warmth rather than one hot list. +resource "kubernetes_cron_job_v1" "immich-search-probe" { + metadata { + name = "immich-search-probe" + namespace = kubernetes_namespace.immich.metadata[0].name + } + spec { + concurrency_policy = "Forbid" + failed_jobs_history_limit = 3 + successful_jobs_history_limit = 1 + schedule = "*/5 * * * *" + starting_deadline_seconds = 60 + job_template { + metadata {} + spec { + backoff_limit = 1 + active_deadline_seconds = 120 + ttl_seconds_after_finished = 120 + template { + metadata {} + spec { + restart_policy = "Never" + volume { + name = "shared" + empty_dir {} + } + init_container { + name = "measure" + image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0" + command = ["/bin/bash", "-c", <<-EOT + set -uo pipefail + OUT=/shared/metrics.prom + success=1 + start=$(date +%s%3N) + if ! psql -v ON_ERROR_STOP=1 -tA -c "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) s" >/dev/null 2>/tmp/err; then + success=0 + cat /tmp/err >&2 + fi + end=$(date +%s%3N) + dur_ms=$((end - start)) + dur=$(printf '%d.%03d' $((dur_ms/1000)) $((dur_ms%1000))) + pct=$(psql -tA -c "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null) + if [ -z "$pct" ]; then pct=-1; fi + { + echo "# HELP immich_smart_search_db_seconds Wall-clock latency of a representative smart-search ANN query." + echo "# TYPE immich_smart_search_db_seconds gauge" + echo "immich_smart_search_db_seconds $dur" + echo "# HELP immich_clip_index_cached_pct Percent of clip_index vchord index resident in PG shared_buffers." + echo "# TYPE immich_clip_index_cached_pct gauge" + echo "immich_clip_index_cached_pct $pct" + echo "# HELP immich_smart_search_probe_success 1 if the probe ANN query succeeded." + echo "# TYPE immich_smart_search_probe_success gauge" + echo "immich_smart_search_probe_success $success" + echo "# HELP immich_smart_search_probe_last_run_timestamp Unix time of last probe run." + echo "# TYPE immich_smart_search_probe_last_run_timestamp gauge" + echo "immich_smart_search_probe_last_run_timestamp $(date +%s)" + } > "$OUT" + echo "probe dur=$dur pct=$pct success=$success" + exit 0 + EOT + ] + env { + name = "PGHOST" + value = "immich-postgresql.immich.svc.cluster.local" + } + env { + name = "PGUSER" + value = "immich" + } + env { + name = "PGDATABASE" + value = "immich" + } + env { + name = "PGCONNECT_TIMEOUT" + value = "10" + } + env { + name = "PGPASSWORD" + value_from { + secret_key_ref { + name = "immich-secrets" + key = "db_password" + } + } + } + volume_mount { + name = "shared" + mount_path = "/shared" + } + resources { + requests = { cpu = "10m", memory = "32Mi" } + limits = { memory = "64Mi" } + } + } + container { + name = "push" + image = "docker.io/curlimages/curl:8.11.1" + command = [ + "curl", "-sf", "-m", "20", "--data-binary", "@/shared/metrics.prom", + "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/immich-search-probe", + ] + volume_mount { + name = "shared" + mount_path = "/shared" + } + resources { + requests = { cpu = "10m", memory = "16Mi" } + limits = { memory = "32Mi" } + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + module "ingress-immich" { source = "../../modules/kubernetes/ingress_factory" # auth = "app": Immich has its own user auth + bearer-token API. Authentik diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 444aab66..91145a33 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -135,6 +135,12 @@ alertmanager: - alertname = EmailRoundtripFailing target_matchers: - alertname = EmailRoundtripStale + # A stale search probe means its Pushgateway gauges are frozen — don't + # let the (now meaningless) latency/cache alerts fire off stale data. + - source_matchers: + - alertname = ImmichSearchProbeStale + target_matchers: + - alertname =~ "ImmichSmartSearchSlow|ImmichClipIndexColdCache" # Power outage makes on-battery alert redundant - source_matchers: - alertname = PowerOutage @@ -854,6 +860,34 @@ serverFiles: subsystem: gpu annotations: summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule" + - name: Immich Smart Search + rules: + # Context (smart) search latency. The vchord clip_index must stay + # resident in PG shared_buffers; if it decays out of cache an ANN + # probe pays a ~1.8s cold storage read vs ~4ms warm. clip-index-prewarm + # (immich ns, */5) pins it; immich-search-probe (*/5) measures it and + # pushes these gauges to the Pushgateway. + - alert: ImmichSmartSearchSlow + expr: immich_smart_search_db_seconds{job="immich-search-probe"} > 1 + for: 15m + labels: + severity: warning + annotations: + summary: "Immich context search slow: {{ $value | printf \"%.2f\" }}s (>1s) — clip_index likely evicted; check clip-index-prewarm CronJob" + - alert: ImmichClipIndexColdCache + expr: immich_clip_index_cached_pct{job="immich-search-probe"} >= 0 and immich_clip_index_cached_pct{job="immich-search-probe"} < 50 + for: 15m + labels: + severity: warning + annotations: + summary: "Immich clip_index only {{ $value | printf \"%.0f\" }}% resident in PG shared_buffers — smart search will be slow (clip-index-prewarm may be failing)" + - alert: ImmichSearchProbeStale + expr: time() - immich_smart_search_probe_last_run_timestamp{job="immich-search-probe"} > 1800 + for: 10m + labels: + severity: warning + annotations: + summary: "Immich search probe has not reported in {{ $value | printf \"%.0f\" }}s — immich-search-probe CronJob may be broken" - name: Power rules: - alert: OnBattery