cluster-health: add check #45 — HA Sofia Status Dashboard

Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.
2026-06-03 11:56:41 +00:00 · 2026-06-03 11:56:41 +00:00 · a2fa912b44
commit a2fa912b44
parent 98f29edf34
2 changed files with 233 additions and 7 deletions
--- a/.claude/skills/cluster-health/SKILL.md
+++ b/.claude/skills/cluster-health/SKILL.md
@ -7,9 +7,9 @@ description: |
  (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
  (4) User mentions "health check", "cluster status", "cluster health",
  (5) User asks "is everything running" or "any problems".
-  Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
-  backups, external reachability, PVE host thermals + load) with safe
-  auto-fix for evicted pods.
+  Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
+  backups, external reachability, PVE host thermals + load, HA Sofia
+  status dashboard) with safe auto-fix for evicted pods.
 author: Claude Code
 version: 2.0.0
 date: 2026-04-19
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
 bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 ```

-## What It Checks (44 checks)
+## What It Checks (45 checks)

 | # | Check | Notes |
 |---|-------|-------|
@ -115,6 +115,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 | 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
 | 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
 | 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
+| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |

 ## Safe Auto-Fix Rules