cluster-health: add check #45 — HA Sofia Status Dashboard
Mirrors the verdict of emo's curated Барзини → Статус Lovelace view
(dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template
cards). Pulls the dashboard config via the HA WebSocket API (one-shot,
shared cache), batch-renders every card's secondary Jinja template
against /api/template in a single POST, and classifies the rendered
text per card:
FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data"
WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" /
"Пълен резервоар" / "Грешка" / "attention" / "Внимание"
Roll-up is a single check with a per-section breakdown
(Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet
non-JSON path lists each offending card with its rendered status line.
Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб
спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced
correctly in both human and JSON output.
This commit is contained in:
parent
98f29edf34
commit
a2fa912b44
2 changed files with 233 additions and 7 deletions
|
|
@ -7,9 +7,9 @@ description: |
|
|||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability, PVE host thermals + load) with safe
|
||||
auto-fix for evicted pods.
|
||||
Runs 45 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability, PVE host thermals + load, HA Sofia
|
||||
status dashboard) with safe auto-fix for evicted pods.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-04-19
|
||||
|
|
@ -67,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
|||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
## What It Checks (44 checks)
|
||||
## What It Checks (45 checks)
|
||||
|
||||
| # | Check | Notes |
|
||||
|---|-------|-------|
|
||||
|
|
@ -115,6 +115,7 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
|||
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
|
||||
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
|
||||
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
|
||||
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue