immich: fix slow context search — prewarm clip_index + latency alert/healthcheck
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.
- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
38c77048fd
commit
f201e4573e
6 changed files with 308 additions and 5 deletions
|
|
@ -153,7 +153,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
| Service | Key Operational Knowledge |
|
||||
|---------|--------------------------|
|
||||
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
|
||||
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
|
||||
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
|
||||
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
|
||||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
|
||||
|
|
@ -166,7 +166,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
- Exclude completed CronJob pods from "pod not ready" alerts.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ description: |
|
|||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
Runs 46 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability, PVE host thermals + load, HA Sofia
|
||||
status dashboard) with safe auto-fix for evicted pods.
|
||||
author: Claude Code
|
||||
|
|
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
|||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
## What It Checks (45 checks)
|
||||
## What It Checks (46 checks)
|
||||
|
||||
| # | Check | Notes |
|
||||
|---|-------|-------|
|
||||
|
|
@ -116,6 +116,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
|||
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
|
||||
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
|
||||
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
|
||||
| 46 | Immich Smart Search | Live context-search health. Measures a representative random-vector pgvector ANN query latency (in-pod, excludes exec overhead) + the `clip_index` residency in PG shared_buffers via `pg_buffercache`. PASS <0.5s & ≥90% resident; WARN 0.5-1.5s or 50-90% resident; FAIL >1.5s or <50% resident (index evicted from cache → cold reads; check the `clip-index-prewarm` CronJob) |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
|
|
|
|||
|
|
@ -167,6 +167,13 @@ spec:
|
|||
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
|
||||
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
|
||||
|
||||
#### Immich Smart Search Alerts
|
||||
- **ImmichSmartSearchSlow**: Representative context-search ANN query >1s for 15m. Root cause is almost always the `clip_index` (vchord, ~665MB) decaying out of PG `shared_buffers` — a cold list read is ~1.8s vs ~4ms warm. Remediation: confirm the `clip-index-prewarm` CronJob (immich ns, `*/5`) is succeeding; manual fix `kubectl exec -n immich -c immich-postgresql <pg-pod> -- psql -U postgres -d immich -c "SELECT pg_prewarm('clip_index')"`.
|
||||
- **ImmichClipIndexColdCache**: `clip_index` <50% resident in shared_buffers for 15m (leading indicator; same remediation).
|
||||
- **ImmichSearchProbeStale**: `immich-search-probe` hasn't reported in >30m (CronJob broken). Inhibits the two above so frozen Pushgateway gauges don't false-fire.
|
||||
|
||||
The Immich smart-search monitoring uses two CronJobs in the `immich` namespace (both `*/5`): `clip-index-prewarm` re-runs `pg_prewarm('clip_index')` to keep the vector index hot during runtime (the `postStart` prewarm only fires at pod start; `pg_prewarm.autoprewarm` only reloads at startup, so the index otherwise decays under job buffer-pressure), and `immich-search-probe` (postgres init-container measures a random-vector ANN latency + `pg_buffercache` residency → curl sidecar pushes `immich_smart_search_db_seconds` / `immich_clip_index_cached_pct` / `immich_smart_search_probe_success` / `immich_smart_search_probe_last_run_timestamp` to the Pushgateway). Also surfaced by cluster-health check #46 (`check_immich_search`). Note this is the **Postgres** half of smart-search warmth; the **ML model** half is kept warm by the separate `clip-keepalive` CronJob.
|
||||
|
||||
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
|
||||
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
|
||||
2. Email lands in the `spam@` catch-all mailbox via MX delivery
|
||||
|
|
|
|||
|
|
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
|
|||
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
|
||||
KUBECTL=""
|
||||
JSON_RESULTS=()
|
||||
TOTAL_CHECKS=45
|
||||
TOTAL_CHECKS=46
|
||||
|
||||
# Parallel execution settings. Each check function is self-contained — it
|
||||
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
|
||||
|
|
@ -2961,6 +2961,57 @@ PYEOF
|
|||
fi
|
||||
}
|
||||
|
||||
# --- 46. Immich Smart (Context) Search ---
|
||||
# Smart search = ML embedding (kept warm by clip-keepalive) + a pgvector ANN
|
||||
# query over the vchord clip_index. The index must stay resident in PG
|
||||
# shared_buffers (kept warm by clip-index-prewarm); if it decays out of cache a
|
||||
# query pays a ~1.8s cold storage read instead of ~4ms warm. We measure both
|
||||
# the live ANN latency and the clip_index residency to catch the regression.
|
||||
check_immich_search() {
|
||||
section 46 "Immich Smart Search"
|
||||
local pg pct dur_ms dur detail=""
|
||||
|
||||
pg=$($KUBECTL get pods -n immich --no-headers 2>/dev/null | awk '/^immich-postgresql-/ && $3=="Running"{print $1; exit}')
|
||||
if [[ -z "$pg" ]]; then
|
||||
warn "immich-postgresql pod not running — cannot probe smart search"
|
||||
json_add "immich_search" "WARN" "immich-postgresql pod not running"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# clip_index residency in shared_buffers (single-quoted SQL → pass as one arg)
|
||||
pct=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- psql -U postgres -d immich -tAc \
|
||||
"SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null | tr -d ' ')
|
||||
|
||||
# Representative random-vector ANN latency, measured in-pod (excludes exec overhead)
|
||||
dur_ms=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- bash -c \
|
||||
's=$(date +%s%3N); psql -U postgres -d immich -tAc "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) x" >/dev/null 2>&1; e=$(date +%s%3N); echo $((e-s))' 2>/dev/null | tr -d ' ')
|
||||
|
||||
if ! [[ "$dur_ms" =~ ^[0-9]+$ ]]; then
|
||||
warn "Smart-search probe query failed (clip_index residency: ${pct:-?}%)"
|
||||
json_add "immich_search" "WARN" "probe query failed; residency=${pct:-?}%"
|
||||
return 0
|
||||
fi
|
||||
dur=$(awk "BEGIN{printf \"%.2f\", $dur_ms/1000}")
|
||||
detail="latency=${dur}s clip_index_resident=${pct:-?}%"
|
||||
|
||||
if (( dur_ms > 1500 )); then
|
||||
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
|
||||
fail "Smart search SLOW: $detail — clip_index likely evicted; check clip-index-prewarm CronJob"
|
||||
json_add "immich_search" "FAIL" "$detail"
|
||||
elif [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 50)}"; then
|
||||
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
|
||||
fail "clip_index only ${pct}% resident in PG cache — searches cold ($detail)"
|
||||
json_add "immich_search" "FAIL" "$detail"
|
||||
elif (( dur_ms > 500 )) || { [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 90)}"; }; then
|
||||
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
|
||||
warn "Smart search degraded: $detail"
|
||||
json_add "immich_search" "WARN" "$detail"
|
||||
else
|
||||
pass "Smart search healthy: $detail"
|
||||
json_add "immich_search" "PASS" "$detail"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Summary ---
|
||||
print_summary() {
|
||||
if [[ "$JSON" == true ]]; then
|
||||
|
|
@ -3029,6 +3080,7 @@ main() {
|
|||
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
|
||||
check_external_replicas check_external_divergence check_pve_thermals
|
||||
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
|
||||
check_immich_search
|
||||
)
|
||||
|
||||
# Auto-fix mutates cluster state inside individual checks — keep that
|
||||
|
|
|
|||
|
|
@ -853,6 +853,215 @@ resource "kubernetes_cron_job_v1" "clip-keepalive" {
|
|||
}
|
||||
}
|
||||
|
||||
# Keeps the ~665MB vchord `clip_index` resident in PG shared_buffers.
|
||||
# The immich-postgresql postStart hook prewarms it ONCE at pod start, but
|
||||
# nothing re-warms it during runtime — pg_prewarm.autoprewarm only reloads at
|
||||
# *startup*. Under buffer pressure from thumbnail/OCR/library jobs the index
|
||||
# slowly decays out of cache (observed ~33% resident after 9 days uptime). A
|
||||
# smart-search ANN probe that lands on an evicted vchord list then pays a
|
||||
# ~1.8s cold storage read instead of the ~4ms warm path. This job re-prewarms
|
||||
# every 5 min, pinning the whole index hot. Parallel to clip-keepalive (which
|
||||
# keeps the ML *model* warm); this keeps the *index* warm — BOTH are needed for
|
||||
# fast smart search. immich PG role is a superuser, so it can run pg_prewarm.
|
||||
resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
|
||||
metadata {
|
||||
name = "clip-index-prewarm"
|
||||
namespace = kubernetes_namespace.immich.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 1
|
||||
schedule = "*/5 * * * *"
|
||||
starting_deadline_seconds = 60
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
active_deadline_seconds = 120
|
||||
ttl_seconds_after_finished = 120
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "Never"
|
||||
container {
|
||||
name = "prewarm"
|
||||
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
|
||||
# command overrides the postgres entrypoint → runs psql directly.
|
||||
command = [
|
||||
"psql", "-v", "ON_ERROR_STOP=1", "-c",
|
||||
"SELECT pg_prewarm('clip_index'); SELECT pg_prewarm('smart_search');",
|
||||
]
|
||||
env {
|
||||
name = "PGHOST"
|
||||
value = "immich-postgresql.immich.svc.cluster.local"
|
||||
}
|
||||
env {
|
||||
name = "PGUSER"
|
||||
value = "immich"
|
||||
}
|
||||
env {
|
||||
name = "PGDATABASE"
|
||||
value = "immich"
|
||||
}
|
||||
env {
|
||||
name = "PGCONNECT_TIMEOUT"
|
||||
value = "10"
|
||||
}
|
||||
env {
|
||||
name = "PGPASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "immich-secrets"
|
||||
key = "db_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# Measures real context-search (smart-search) latency for alerting + the
|
||||
# cluster-health script. Two stages in one pod: an init container (postgres
|
||||
# image, has psql) times a representative random-vector ANN query and reads
|
||||
# clip_index residency from pg_buffercache, writing Prometheus exposition text
|
||||
# to a shared emptyDir; the main container (curl image) pushes it to the
|
||||
# Pushgateway. Stock images only — no apt/pip install at runtime (see the
|
||||
# clip-keepalive note). A random probe vector each run samples different vchord
|
||||
# lists, so the metric reflects true cache warmth rather than one hot list.
|
||||
resource "kubernetes_cron_job_v1" "immich-search-probe" {
|
||||
metadata {
|
||||
name = "immich-search-probe"
|
||||
namespace = kubernetes_namespace.immich.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 1
|
||||
schedule = "*/5 * * * *"
|
||||
starting_deadline_seconds = 60
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
active_deadline_seconds = 120
|
||||
ttl_seconds_after_finished = 120
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "Never"
|
||||
volume {
|
||||
name = "shared"
|
||||
empty_dir {}
|
||||
}
|
||||
init_container {
|
||||
name = "measure"
|
||||
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
|
||||
command = ["/bin/bash", "-c", <<-EOT
|
||||
set -uo pipefail
|
||||
OUT=/shared/metrics.prom
|
||||
success=1
|
||||
start=$(date +%s%3N)
|
||||
if ! psql -v ON_ERROR_STOP=1 -tA -c "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) s" >/dev/null 2>/tmp/err; then
|
||||
success=0
|
||||
cat /tmp/err >&2
|
||||
fi
|
||||
end=$(date +%s%3N)
|
||||
dur_ms=$((end - start))
|
||||
dur=$(printf '%d.%03d' $((dur_ms/1000)) $((dur_ms%1000)))
|
||||
pct=$(psql -tA -c "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null)
|
||||
if [ -z "$pct" ]; then pct=-1; fi
|
||||
{
|
||||
echo "# HELP immich_smart_search_db_seconds Wall-clock latency of a representative smart-search ANN query."
|
||||
echo "# TYPE immich_smart_search_db_seconds gauge"
|
||||
echo "immich_smart_search_db_seconds $dur"
|
||||
echo "# HELP immich_clip_index_cached_pct Percent of clip_index vchord index resident in PG shared_buffers."
|
||||
echo "# TYPE immich_clip_index_cached_pct gauge"
|
||||
echo "immich_clip_index_cached_pct $pct"
|
||||
echo "# HELP immich_smart_search_probe_success 1 if the probe ANN query succeeded."
|
||||
echo "# TYPE immich_smart_search_probe_success gauge"
|
||||
echo "immich_smart_search_probe_success $success"
|
||||
echo "# HELP immich_smart_search_probe_last_run_timestamp Unix time of last probe run."
|
||||
echo "# TYPE immich_smart_search_probe_last_run_timestamp gauge"
|
||||
echo "immich_smart_search_probe_last_run_timestamp $(date +%s)"
|
||||
} > "$OUT"
|
||||
echo "probe dur=$dur pct=$pct success=$success"
|
||||
exit 0
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
name = "PGHOST"
|
||||
value = "immich-postgresql.immich.svc.cluster.local"
|
||||
}
|
||||
env {
|
||||
name = "PGUSER"
|
||||
value = "immich"
|
||||
}
|
||||
env {
|
||||
name = "PGDATABASE"
|
||||
value = "immich"
|
||||
}
|
||||
env {
|
||||
name = "PGCONNECT_TIMEOUT"
|
||||
value = "10"
|
||||
}
|
||||
env {
|
||||
name = "PGPASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "immich-secrets"
|
||||
key = "db_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume_mount {
|
||||
name = "shared"
|
||||
mount_path = "/shared"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "push"
|
||||
image = "docker.io/curlimages/curl:8.11.1"
|
||||
command = [
|
||||
"curl", "-sf", "-m", "20", "--data-binary", "@/shared/metrics.prom",
|
||||
"http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/immich-search-probe",
|
||||
]
|
||||
volume_mount {
|
||||
name = "shared"
|
||||
mount_path = "/shared"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "16Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress-immich" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
# auth = "app": Immich has its own user auth + bearer-token API. Authentik
|
||||
|
|
|
|||
|
|
@ -135,6 +135,12 @@ alertmanager:
|
|||
- alertname = EmailRoundtripFailing
|
||||
target_matchers:
|
||||
- alertname = EmailRoundtripStale
|
||||
# A stale search probe means its Pushgateway gauges are frozen — don't
|
||||
# let the (now meaningless) latency/cache alerts fire off stale data.
|
||||
- source_matchers:
|
||||
- alertname = ImmichSearchProbeStale
|
||||
target_matchers:
|
||||
- alertname =~ "ImmichSmartSearchSlow|ImmichClipIndexColdCache"
|
||||
# Power outage makes on-battery alert redundant
|
||||
- source_matchers:
|
||||
- alertname = PowerOutage
|
||||
|
|
@ -854,6 +860,34 @@ serverFiles:
|
|||
subsystem: gpu
|
||||
annotations:
|
||||
summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule"
|
||||
- name: Immich Smart Search
|
||||
rules:
|
||||
# Context (smart) search latency. The vchord clip_index must stay
|
||||
# resident in PG shared_buffers; if it decays out of cache an ANN
|
||||
# probe pays a ~1.8s cold storage read vs ~4ms warm. clip-index-prewarm
|
||||
# (immich ns, */5) pins it; immich-search-probe (*/5) measures it and
|
||||
# pushes these gauges to the Pushgateway.
|
||||
- alert: ImmichSmartSearchSlow
|
||||
expr: immich_smart_search_db_seconds{job="immich-search-probe"} > 1
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Immich context search slow: {{ $value | printf \"%.2f\" }}s (>1s) — clip_index likely evicted; check clip-index-prewarm CronJob"
|
||||
- alert: ImmichClipIndexColdCache
|
||||
expr: immich_clip_index_cached_pct{job="immich-search-probe"} >= 0 and immich_clip_index_cached_pct{job="immich-search-probe"} < 50
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Immich clip_index only {{ $value | printf \"%.0f\" }}% resident in PG shared_buffers — smart search will be slow (clip-index-prewarm may be failing)"
|
||||
- alert: ImmichSearchProbeStale
|
||||
expr: time() - immich_smart_search_probe_last_run_timestamp{job="immich-search-probe"} > 1800
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Immich search probe has not reported in {{ $value | printf \"%.0f\" }}s — immich-search-probe CronJob may be broken"
|
||||
- name: Power
|
||||
rules:
|
||||
- alert: OnBattery
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue