infra/stacks/monitoring/modules/monitoring
Viktor Barzin f201e4573e immich: fix slow context search — prewarm clip_index + latency alert/healthcheck
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.

- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
  whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
  reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
  ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:07 +00:00
..
dashboards job-hunter dashboard: role panels now respect the $location filter 2026-06-02 23:35:25 +00:00
server-power-cycle Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
alloy.yaml alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm 2026-05-26 02:08:35 +00:00
authentik_walloff_probe.tf Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" 2026-06-03 10:24:25 +00:00
Dockerfile extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
goflow2.tf monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] 2026-05-31 15:33:30 +00:00
grafana.tf fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard 2026-05-22 14:15:38 +00:00
grafana_chart_values.yaml monitoring: protect grafana ingress with authentik + disable anonymous 2026-05-10 17:01:50 +00:00
idrac.tf monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] 2026-05-31 15:33:30 +00:00
k8s-monitoring-values.yaml cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
loki.tf monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] 2026-05-31 15:33:30 +00:00
loki.yaml monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit) 2026-05-24 01:10:55 +00:00
main.tf cluster-health: emergency-stop Keel + roll back image downgrades + quota raises 2026-05-26 18:48:50 +00:00
prometheus.tf Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" 2026-06-03 10:24:25 +00:00
prometheus_chart_values.tpl immich: fix slow context search — prewarm clip_index + latency alert/healthcheck 2026-06-05 09:19:07 +00:00
prometheus_snmp_chart_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
pve_exporter.tf monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] 2026-05-31 15:33:30 +00:00
snmp_exporter.tf monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] 2026-05-31 15:33:30 +00:00
ups_snmp_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00