immich: fix slow context search — prewarm clip_index + latency alert/healthcheck

Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.

- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
  whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
  reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
  ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-03 21:00:43 +00:00
parent 38c77048fd
commit f201e4573e
6 changed files with 308 additions and 5 deletions

View file

@ -153,7 +153,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Service | Key Operational Knowledge |
|---------|--------------------------|
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
@ -166,7 +166,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- Exclude completed CronJob pods from "pod not ready" alerts.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security``#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.

View file

@ -7,7 +7,7 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
Runs 46 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load, HA Sofia
status dashboard) with safe auto-fix for evicted pods.
author: Claude Code
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (45 checks)
## What It Checks (46 checks)
| # | Check | Notes |
|---|-------|-------|
@ -116,6 +116,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
| 46 | Immich Smart Search | Live context-search health. Measures a representative random-vector pgvector ANN query latency (in-pod, excludes exec overhead) + the `clip_index` residency in PG shared_buffers via `pg_buffercache`. PASS <0.5s & 90% resident; WARN 0.5-1.5s or 50-90% resident; FAIL >1.5s or <50% resident (index evicted from cache cold reads; check the `clip-index-prewarm` CronJob) |
## Safe Auto-Fix Rules

View file

@ -167,6 +167,13 @@ spec:
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
#### Immich Smart Search Alerts
- **ImmichSmartSearchSlow**: Representative context-search ANN query >1s for 15m. Root cause is almost always the `clip_index` (vchord, ~665MB) decaying out of PG `shared_buffers` — a cold list read is ~1.8s vs ~4ms warm. Remediation: confirm the `clip-index-prewarm` CronJob (immich ns, `*/5`) is succeeding; manual fix `kubectl exec -n immich -c immich-postgresql <pg-pod> -- psql -U postgres -d immich -c "SELECT pg_prewarm('clip_index')"`.
- **ImmichClipIndexColdCache**: `clip_index` <50% resident in shared_buffers for 15m (leading indicator; same remediation).
- **ImmichSearchProbeStale**: `immich-search-probe` hasn't reported in >30m (CronJob broken). Inhibits the two above so frozen Pushgateway gauges don't false-fire.
The Immich smart-search monitoring uses two CronJobs in the `immich` namespace (both `*/5`): `clip-index-prewarm` re-runs `pg_prewarm('clip_index')` to keep the vector index hot during runtime (the `postStart` prewarm only fires at pod start; `pg_prewarm.autoprewarm` only reloads at startup, so the index otherwise decays under job buffer-pressure), and `immich-search-probe` (postgres init-container measures a random-vector ANN latency + `pg_buffercache` residency → curl sidecar pushes `immich_smart_search_db_seconds` / `immich_clip_index_cached_pct` / `immich_smart_search_probe_success` / `immich_smart_search_probe_last_run_timestamp` to the Pushgateway). Also surfaced by cluster-health check #46 (`check_immich_search`). Note this is the **Postgres** half of smart-search warmth; the **ML model** half is kept warm by the separate `clip-keepalive` CronJob.
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
2. Email lands in the `spam@` catch-all mailbox via MX delivery

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=45
TOTAL_CHECKS=46
# Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -2961,6 +2961,57 @@ PYEOF
fi
}
# --- 46. Immich Smart (Context) Search ---
# Smart search = ML embedding (kept warm by clip-keepalive) + a pgvector ANN
# query over the vchord clip_index. The index must stay resident in PG
# shared_buffers (kept warm by clip-index-prewarm); if it decays out of cache a
# query pays a ~1.8s cold storage read instead of ~4ms warm. We measure both
# the live ANN latency and the clip_index residency to catch the regression.
check_immich_search() {
section 46 "Immich Smart Search"
local pg pct dur_ms dur detail=""
pg=$($KUBECTL get pods -n immich --no-headers 2>/dev/null | awk '/^immich-postgresql-/ && $3=="Running"{print $1; exit}')
if [[ -z "$pg" ]]; then
warn "immich-postgresql pod not running — cannot probe smart search"
json_add "immich_search" "WARN" "immich-postgresql pod not running"
return 0
fi
# clip_index residency in shared_buffers (single-quoted SQL → pass as one arg)
pct=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- psql -U postgres -d immich -tAc \
"SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null | tr -d ' ')
# Representative random-vector ANN latency, measured in-pod (excludes exec overhead)
dur_ms=$($KUBECTL exec -n immich -c immich-postgresql "$pg" -- bash -c \
's=$(date +%s%3N); psql -U postgres -d immich -tAc "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) x" >/dev/null 2>&1; e=$(date +%s%3N); echo $((e-s))' 2>/dev/null | tr -d ' ')
if ! [[ "$dur_ms" =~ ^[0-9]+$ ]]; then
warn "Smart-search probe query failed (clip_index residency: ${pct:-?}%)"
json_add "immich_search" "WARN" "probe query failed; residency=${pct:-?}%"
return 0
fi
dur=$(awk "BEGIN{printf \"%.2f\", $dur_ms/1000}")
detail="latency=${dur}s clip_index_resident=${pct:-?}%"
if (( dur_ms > 1500 )); then
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
fail "Smart search SLOW: $detail — clip_index likely evicted; check clip-index-prewarm CronJob"
json_add "immich_search" "FAIL" "$detail"
elif [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 50)}"; then
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
fail "clip_index only ${pct}% resident in PG cache — searches cold ($detail)"
json_add "immich_search" "FAIL" "$detail"
elif (( dur_ms > 500 )) || { [[ "$pct" =~ ^[0-9.]+$ ]] && awk "BEGIN{exit !($pct < 90)}"; }; then
[[ "$QUIET" == true ]] && section_always 46 "Immich Smart Search"
warn "Smart search degraded: $detail"
json_add "immich_search" "WARN" "$detail"
else
pass "Smart search healthy: $detail"
json_add "immich_search" "PASS" "$detail"
fi
}
# --- Summary ---
print_summary() {
if [[ "$JSON" == true ]]; then
@ -3029,6 +3080,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search
)
# Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -853,6 +853,215 @@ resource "kubernetes_cron_job_v1" "clip-keepalive" {
}
}
# Keeps the ~665MB vchord `clip_index` resident in PG shared_buffers.
# The immich-postgresql postStart hook prewarms it ONCE at pod start, but
# nothing re-warms it during runtime pg_prewarm.autoprewarm only reloads at
# *startup*. Under buffer pressure from thumbnail/OCR/library jobs the index
# slowly decays out of cache (observed ~33% resident after 9 days uptime). A
# smart-search ANN probe that lands on an evicted vchord list then pays a
# ~1.8s cold storage read instead of the ~4ms warm path. This job re-prewarms
# every 5 min, pinning the whole index hot. Parallel to clip-keepalive (which
# keeps the ML *model* warm); this keeps the *index* warm BOTH are needed for
# fast smart search. immich PG role is a superuser, so it can run pg_prewarm.
resource "kubernetes_cron_job_v1" "clip-index-prewarm" {
metadata {
name = "clip-index-prewarm"
namespace = kubernetes_namespace.immich.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 1
schedule = "*/5 * * * *"
starting_deadline_seconds = 60
job_template {
metadata {}
spec {
backoff_limit = 1
active_deadline_seconds = 120
ttl_seconds_after_finished = 120
template {
metadata {}
spec {
restart_policy = "Never"
container {
name = "prewarm"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
# command overrides the postgres entrypoint runs psql directly.
command = [
"psql", "-v", "ON_ERROR_STOP=1", "-c",
"SELECT pg_prewarm('clip_index'); SELECT pg_prewarm('smart_search');",
]
env {
name = "PGHOST"
value = "immich-postgresql.immich.svc.cluster.local"
}
env {
name = "PGUSER"
value = "immich"
}
env {
name = "PGDATABASE"
value = "immich"
}
env {
name = "PGCONNECT_TIMEOUT"
value = "10"
}
env {
name = "PGPASSWORD"
value_from {
secret_key_ref {
name = "immich-secrets"
key = "db_password"
}
}
}
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Measures real context-search (smart-search) latency for alerting + the
# cluster-health script. Two stages in one pod: an init container (postgres
# image, has psql) times a representative random-vector ANN query and reads
# clip_index residency from pg_buffercache, writing Prometheus exposition text
# to a shared emptyDir; the main container (curl image) pushes it to the
# Pushgateway. Stock images only no apt/pip install at runtime (see the
# clip-keepalive note). A random probe vector each run samples different vchord
# lists, so the metric reflects true cache warmth rather than one hot list.
resource "kubernetes_cron_job_v1" "immich-search-probe" {
metadata {
name = "immich-search-probe"
namespace = kubernetes_namespace.immich.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 1
schedule = "*/5 * * * *"
starting_deadline_seconds = 60
job_template {
metadata {}
spec {
backoff_limit = 1
active_deadline_seconds = 120
ttl_seconds_after_finished = 120
template {
metadata {}
spec {
restart_policy = "Never"
volume {
name = "shared"
empty_dir {}
}
init_container {
name = "measure"
image = "ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0"
command = ["/bin/bash", "-c", <<-EOT
set -uo pipefail
OUT=/shared/metrics.prom
success=1
start=$(date +%s%3N)
if ! psql -v ON_ERROR_STOP=1 -tA -c "SELECT count(*) FROM (SELECT \"assetId\" FROM smart_search ORDER BY embedding <=> (SELECT embedding FROM smart_search ORDER BY random() LIMIT 1) LIMIT 100) s" >/dev/null 2>/tmp/err; then
success=0
cat /tmp/err >&2
fi
end=$(date +%s%3N)
dur_ms=$((end - start))
dur=$(printf '%d.%03d' $((dur_ms/1000)) $((dur_ms%1000)))
pct=$(psql -tA -c "SELECT COALESCE(round(100.0*count(*)*8192/greatest(pg_relation_size('clip_index'::regclass),1),1),0) FROM pg_buffercache b JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) WHERE c.relname='clip_index'" 2>/dev/null)
if [ -z "$pct" ]; then pct=-1; fi
{
echo "# HELP immich_smart_search_db_seconds Wall-clock latency of a representative smart-search ANN query."
echo "# TYPE immich_smart_search_db_seconds gauge"
echo "immich_smart_search_db_seconds $dur"
echo "# HELP immich_clip_index_cached_pct Percent of clip_index vchord index resident in PG shared_buffers."
echo "# TYPE immich_clip_index_cached_pct gauge"
echo "immich_clip_index_cached_pct $pct"
echo "# HELP immich_smart_search_probe_success 1 if the probe ANN query succeeded."
echo "# TYPE immich_smart_search_probe_success gauge"
echo "immich_smart_search_probe_success $success"
echo "# HELP immich_smart_search_probe_last_run_timestamp Unix time of last probe run."
echo "# TYPE immich_smart_search_probe_last_run_timestamp gauge"
echo "immich_smart_search_probe_last_run_timestamp $(date +%s)"
} > "$OUT"
echo "probe dur=$dur pct=$pct success=$success"
exit 0
EOT
]
env {
name = "PGHOST"
value = "immich-postgresql.immich.svc.cluster.local"
}
env {
name = "PGUSER"
value = "immich"
}
env {
name = "PGDATABASE"
value = "immich"
}
env {
name = "PGCONNECT_TIMEOUT"
value = "10"
}
env {
name = "PGPASSWORD"
value_from {
secret_key_ref {
name = "immich-secrets"
key = "db_password"
}
}
}
volume_mount {
name = "shared"
mount_path = "/shared"
}
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
container {
name = "push"
image = "docker.io/curlimages/curl:8.11.1"
command = [
"curl", "-sf", "-m", "20", "--data-binary", "@/shared/metrics.prom",
"http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/immich-search-probe",
]
volume_mount {
name = "shared"
mount_path = "/shared"
}
resources {
requests = { cpu = "10m", memory = "16Mi" }
limits = { memory = "32Mi" }
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
module "ingress-immich" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": Immich has its own user auth + bearer-token API. Authentik

View file

@ -135,6 +135,12 @@ alertmanager:
- alertname = EmailRoundtripFailing
target_matchers:
- alertname = EmailRoundtripStale
# A stale search probe means its Pushgateway gauges are frozen — don't
# let the (now meaningless) latency/cache alerts fire off stale data.
- source_matchers:
- alertname = ImmichSearchProbeStale
target_matchers:
- alertname =~ "ImmichSmartSearchSlow|ImmichClipIndexColdCache"
# Power outage makes on-battery alert redundant
- source_matchers:
- alertname = PowerOutage
@ -854,6 +860,34 @@ serverFiles:
subsystem: gpu
annotations:
summary: "GPU node {{ $labels.node }} is cordoned — Frigate and GPU workloads cannot schedule"
- name: Immich Smart Search
rules:
# Context (smart) search latency. The vchord clip_index must stay
# resident in PG shared_buffers; if it decays out of cache an ANN
# probe pays a ~1.8s cold storage read vs ~4ms warm. clip-index-prewarm
# (immich ns, */5) pins it; immich-search-probe (*/5) measures it and
# pushes these gauges to the Pushgateway.
- alert: ImmichSmartSearchSlow
expr: immich_smart_search_db_seconds{job="immich-search-probe"} > 1
for: 15m
labels:
severity: warning
annotations:
summary: "Immich context search slow: {{ $value | printf \"%.2f\" }}s (>1s) — clip_index likely evicted; check clip-index-prewarm CronJob"
- alert: ImmichClipIndexColdCache
expr: immich_clip_index_cached_pct{job="immich-search-probe"} >= 0 and immich_clip_index_cached_pct{job="immich-search-probe"} < 50
for: 15m
labels:
severity: warning
annotations:
summary: "Immich clip_index only {{ $value | printf \"%.0f\" }}% resident in PG shared_buffers — smart search will be slow (clip-index-prewarm may be failing)"
- alert: ImmichSearchProbeStale
expr: time() - immich_smart_search_probe_last_run_timestamp{job="immich-search-probe"} > 1800
for: 10m
labels:
severity: warning
annotations:
summary: "Immich search probe has not reported in {{ $value | printf \"%.0f\" }}s — immich-search-probe CronJob may be broken"
- name: Power
rules:
- alert: OnBattery