cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash

Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real
virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the
attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost"
disks left by failed detaches (query-pci QMP timeouts) that the scheduler's
28-LUN guard can't see — exactly the drift that wedged the MAM grabber on
node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24;
FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks.
Single `qm list` + one `qm config` per VM keeps it ~3s (the naive
once-per-VM version timed out the parallel runner).

Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by
138894cd): `pct=$(kubectl exec … | tr -d ' ')` and the dur_ms probe were
unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and
tripped `set -e`, killing the check before json_add. It silently dropped
from every parallel report and broke --serial entirely (whole run aborted).
Guarded both substitutions with `|| true`; the existing `=~` numeric checks
already handle the empty case. immich_search now reports PASS/WARN instead
of vanishing.
This commit is contained in:
Viktor Barzin 2026-06-04 04:06:12 +00:00
parent ea1e4f793b
commit b64d8d6168
2 changed files with 122 additions and 9 deletions

View file

@ -7,9 +7,10 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 46 cluster-wide checks (nodes, workloads, monitoring, certs,
Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load, HA Sofia
status dashboard) with safe auto-fix for evicted pods.
status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift)
with safe auto-fix for evicted pods.
author: Claude Code
version: 2.0.0
date: 2026-04-19
@ -67,7 +68,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (46 checks)
## What It Checks (47 checks)
| # | Check | Notes |
|---|-------|-------|
@ -116,7 +117,8 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
| 46 | Immich Smart Search | Live context-search health. Measures a representative random-vector pgvector ANN query latency (in-pod, excludes exec overhead) + the `clip_index` residency in PG shared_buffers via `pg_buffercache`. PASS <0.5s & 90% resident; WARN 0.5-1.5s or 50-90% resident; FAIL >1.5s or <50% resident (index evicted from cache cold reads; check the `clip-index-prewarm` CronJob) |
| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache check `clip-index-prewarm` CronJob |
| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config <vmid>` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set <vmid> --delete scsiN` (frees slot, retains LV) |
## Safe Auto-Fix Rules