check_ha_integrations counted any config entry with state=not_loaded as a
problem, but HA marks intentionally-off entries that way too: disabled_by
set (user/integration disabled it) and source=="ignore" (a discovered
integration the user chose to ignore — never meant to load). On ha-sofia
2026-06-04 this false-WARNed on 6 entries that are all intentional —
wyoming faster-whisper/piper + ollama (disabled_by=user) and
mass_queue/dlna_dms(EMO-LAPTOP2)/yalexs_ble (source=ignore).
Skip disabled/ignored entries; only genuine setup_error/setup_retry/
not_loaded (without disabled/ignore) now flag. Verified: check #27 -> PASS
"All 96 integrations loaded".
check_uptime_kuma flagged a monitor as down whenever its last heartbeat
status != 1, and treated "no beats" as down too. But uptime-kuma status 2 =
PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no
data. So a monitor caught in a momentary pending/retry state at check time
produced a false "internal/external down(N)" WARN — observed twice on
2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged
ZERO downs against over 24h (0/2880 and 0/288 beats).
Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending,
maintenance, and no-data are not-down. Real outages still flag (uptime-kuma
persists status==0 beats for genuine downs).
Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real
virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the
attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost"
disks left by failed detaches (query-pci QMP timeouts) that the scheduler's
28-LUN guard can't see — exactly the drift that wedged the MAM grabber on
node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24;
FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks.
Single `qm list` + one `qm config` per VM keeps it ~3s (the naive
once-per-VM version timed out the parallel runner).
Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by
138894cd): `pct=$(kubectl exec … | tr -d ' ')` and the dur_ms probe were
unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and
tripped `set -e`, killing the check before json_add. It silently dropped
from every parallel report and broke --serial entirely (whole run aborted).
Guarded both substitutions with `|| true`; the existing `=~` numeric checks
already handle the empty case. immich_search now reports PASS/WARN instead
of vanishing.
The NFS connectivity check fell through to `nc -z -G 3 192.168.1.127 2049`
when `showmount` is absent (the DevVM ships no nfs-common). But `-G` is a
macOS/Darwin-only connect-timeout flag — OpenBSD/GNU nc on Linux rejects it
with "invalid option -- 'G'", so the elif failed and the check reported
"NFS unreachable" on every Linux run even though port 2049 was wide open
(confirmed via /dev/tcp). All deployment/PVC/statefulset checks were green
throughout — a real PVE NFS outage would have taken down 30+ services.
Fix: use the portable `-w` timeout flag, and add a final bash /dev/tcp
fallback so the probe is correct even on hosts with neither showmount nor a
usable nc.
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.
- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirrors the verdict of emo's curated Барзини → Статус Lovelace view
(dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template
cards). Pulls the dashboard config via the HA WebSocket API (one-shot,
shared cache), batch-renders every card's secondary Jinja template
against /api/template in a single POST, and classifies the rendered
text per card:
FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data"
WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" /
"Пълен резервоар" / "Грешка" / "attention" / "Внимание"
Roll-up is a single check with a per-section breakdown
(Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet
non-JSON path lists each offending card with its rendered status line.
Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб
спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced
correctly in both human and JSON output.
Each check function only reads cluster state and mutates in-memory
counters; that makes it safe to isolate each one in a subshell, write
stdout to a per-check temp file, and replay outputs in original order
after all jobs finish. Counters/JSON_RESULTS replicated through marker
lines (###HCK###PASS:N etc.) so the aggregate state matches the serial
run exactly.
Pre-fetch the HA Sofia cache once in the parent so the four HA checks
share a single API round-trip instead of each subshell re-fetching.
Auto-fix mode forces --serial so mutation order stays deterministic.
New flags: --parallel N (default 12, env HEALTHCHECK_PARALLEL_JOBS),
--serial. Diminishing returns past ~12 workers.
Benchmark (--quiet, 44 checks): 53s serial -> 18s parallel-12.
Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.
Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.
Updated:
PASS < 65 C (within 55-65 baseline)
WARN 65-82 C (elevated; check top kvm processes for the culprit)
FAIL >= 83 C (at/above TjMax — throttling imminent)
Verified live: 67 C now WARN (was PASS under the 75 C threshold).
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.
#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
Thresholds tuned to the live TjMax=83 / Tcrit=93:
PASS < 75 °C package
WARN 75-82 °C (approaching max, action time)
FAIL >= 83 °C (at/above TjMax, throttling imminent)
Reports hottest core label too so a single hot core doesn't hide in
the package average.
#44 PVE Host Load — load avg vs 44-thread capacity
Reads /proc/loadavg, compares 5-min to thread count (44):
PASS load_5 < 30 (< 70% threads busy)
WARN 30-37 (oversubscribed but not saturating)
FAIL >= 38 (~85%+ threads busy — scheduler saturation)
Uses 5-min so brief work spikes don't false-fail.
Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.
Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.
Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":
1. prometheus_alerts: only count severity=warning|critical. Info-level
alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
alert rule itself sets severity; the script should respect it.
2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
Lets Encrypt wildcard renews weekly; <14d is the only window where
human attention is genuinely useful.
3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
event/image/update domains (transient by design), skip friendly
names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
only count entities whose last_changed > 24h. Was 431/1470,
most of which were "phone in standby" noise.
4. ha_automations: only flag DISABLED automations as abandoned if
they've also been untouched (last_changed) for >180 days; raise
stale threshold 30d → 180d. Was flagging seasonal/holiday-only
automations as broken.
5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
CronJob retry leftovers (Error/Failed phase pods that K8s keeps
around for log inspection) aren't problematic at the cluster level.
6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
shot failures were a recurring false-positive even though the
service was healthy.
Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.
Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when
HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL =
https://ha-sofia.viktorbarzin.me.
- cert-manager: add cert_manager_installed() probe (kubectl get crd
certificates.cert-manager.io). When not installed — which is our current
state — report PASS "N/A" instead of noisy WARN "CRDs unavailable".
- LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use
underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check
always WARN'd. Fixed to `grep _snap`.
After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real
HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
+authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
lines) and the openclaw cluster_healthcheck CronJob (local CLI is
now the single authoritative runner). Keep the healthcheck SA +
Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
the script first, refreshes the 42-check table, drops stale
CronJob/Slack/post-mortem sections, documents the monorepo-canonical
+ hardlink layout. File is hardlinked to
/home/wizard/code/.claude/skills/cluster-health/SKILL.md for
dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and
tuya-bridge pods are running, plus checks Prometheus scrape targets
(snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.
- Healthcheck: add entity availability, integration health, automation
status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
get VPA Off mode (recommend only, no evictions) to prevent downtime
on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
kubectl exec, since MetalLB virtual IPs aren't reachable from outside
the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
not mounted by kured DaemonSet)
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.