cluster-health: add check #45 — HA Sofia Status Dashboard

Mirrors the verdict of emo's curated Барзини → Статус Lovelace view
(dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template
cards). Pulls the dashboard config via the HA WebSocket API (one-shot,
shared cache), batch-renders every card's secondary Jinja template
against /api/template in a single POST, and classifies the rendered
text per card:

  FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data"
  WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" /
         "Пълен резервоар" / "Грешка" / "attention" / "Внимание"

Roll-up is a single check with a per-section breakdown
(Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet
non-JSON path lists each offending card with its rendered status line.

Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб
спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced
correctly in both human and JSON output.
This commit is contained in:
Viktor Barzin 2026-06-03 11:56:41 +00:00
parent 98f29edf34
commit a2fa912b44
2 changed files with 233 additions and 7 deletions

View file

@ -7,9 +7,9 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load) with safe
auto-fix for evicted pods.
Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load, HA Sofia
status dashboard) with safe auto-fix for evicted pods.
author: Claude Code
version: 2.0.0
date: 2026-04-19
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (44 checks)
## What It Checks (45 checks)
| # | Check | Notes |
|---|-------|-------|
@ -115,6 +115,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
## Safe Auto-Fix Rules