emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT ATS, the Барзини → Статус dashboard), and only about the cluster when it's breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused, read-only variant: - leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only; - fixes the script path to emo's own clone (/home/emo/code) — he can't read wizard's tree — and runs it --no-fix (he's cluster read-only); - loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45) actually run for him; documents the host-SSH/Vault checks that skip; - triages: cluster FAIL/WARN matters only if on his chain; everything else is a one-line "admin's area"; escalate via /file-issue since he can't fix. This snapshot copy is now an emo-specific variant, intentionally diverged from the canonical 47-check admin skill — README updated to say "do not re-sync from canonical". Token: a dedicated long-lived HA token (client_name emo-cluster-health) was minted on ha-sofia via the admin account and stored emo-readable at /home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope (HA only mints tokens for the authenticating account); independently revocable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.5 KiB
| name | description | author | version | date |
|---|---|---|---|---|
| cluster-health | Personalized for emo. Check whether the homelab Kubernetes cluster is affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, the MPPT ATS, lights, climate, security, irrigation). Use when: (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", (2) "is the cluster affecting Sofia / my devices", (3) "check the cluster", "cluster health", "is everything running", (4) a device on the Барзини → Статус dashboard looks offline. Runs the cluster-wide healthcheck read-only and triages it by what ha-sofia actually depends on; the rest of the cluster is the admin's area. | Claude Code | 3.0.0-emo | 2026-06-26 |
Cluster Health — personalized for emo (ha-sofia focus)
What you actually care about
You care about ha-sofia and the Sofia smart-home devices it runs — the Tuya devices, the MPPT ATS, and the lights / climate / security / irrigation on your Барзини → Статус dashboard. The wider Kubernetes cluster matters to you only when it's breaking something ha-sofia or your devices depend on. Anything else is the admin's (wizard's) area — note it in one line and move on; don't chase it.
You have read-only cluster access. You can SEE everything but change nothing — so when something on your chain is broken, the job is to confirm it and hand it off, not to repair it.
How ha-sofia depends on the cluster
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — not in the cluster. The cluster reaches it through exactly two things:
- tuya-bridge (namespace
tuya-bridge) — the REST API ha-sofia calls for every Tuya device and the MPPT ATS. If it's unhealthy, your Tuya devices- ATS stop responding. This is the #1 thing to check.
- The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
reachable: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
for
tuya-bridge.viktorbarzin.meandha-sofia.viktorbarzin.me, plus Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and you can't reach ha-sofia remotely.
Everything else in the cluster is unrelated to you unless it's hosting one of those pods.
Step 1 — run the healthcheck (read-only, with your HA token)
Your account can't read Vault, so load your own ha-sofia token first (it was
minted for you and lives at ~/.config/cluster-health/haos_token). Then run
the script from YOUR clone, read-only:
cd /home/emo/code
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
bash scripts/cluster_healthcheck.sh --no-fix --quiet
# machine-readable instead:
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
- Never pass
--fix— it deletes pods (a write); you're read-only and it will fail. - Exit codes:
0healthy,1warnings,2failures.
With the token exported, the ha-sofia checks run for you: 26 Entity Availability · 27 Integration Health · 28 Automation Status · 29 System Resources · 45 Status Dashboard — your Барзини → Статус view, classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also covers the tuya exporter.
Step 2 — triage the output by relevance to YOU
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
- On your chain → this is what matters. Anything touching:
tuya-bridge,cloudflared,traefik, DNS (check 21), the TLS cert / ingress for your two hosts (checks 12, 22, 31, 32), or a node hosting those pods — plus all the ha-sofia checks (26–29, 45) and the tuya exporter (30). - Not on your chain → one line, then drop it. Summarise as "N unrelated cluster issues (admin's area)" and don't investigate.
Step 3 — read-only checks for your chain
All of these work with your read-only access:
# tuya-bridge — your devices + the ATS
kubectl get pods -n tuya-bridge
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
# the reachability path ha-sofia uses
kubectl get pods -n cloudflared
kubectl get pods -n traefik
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
# whole external path in one shot (DNS + tunnel + Traefik + cert):
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
# broken -> curl: timeout / could not resolve host
The fastest device-level signal is your own dashboard: open https://ha-sofia.viktorbarzin.me → Барзини → Статус. If devices show Offline / Разкачен / ⚠️ but tuya-bridge is healthy, the problem is at the house (device power / Wi-Fi / the Sofia TP-Link network) — not the cluster.
Step 4 — if something on your chain is broken
You can't fix the cluster (read-only), so capture + hand off:
kubectl describe pod -n tuya-bridge <pod>
kubectl logs -n tuya-bridge <pod> --previous --tail=200
Then file it for the admin with the /file-issue skill — e.g. "ha-sofia
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping" with the output
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
alerting is already firing, but file it so it's tracked from your side too.
What will skip for you (expected — not failures)
A few checks need access your account doesn't have. They warn/skip — that's normal, and none of them are on your ha-sofia chain:
- Uptime Kuma (14) — needs an admin password from Vault.
- PVE host checks — 36 (LVM snapshots), 43 (host thermals), 44 (host load), and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
--fix— pod deletion (a write); not available to you.
(The ha-sofia checks are not in this list — your token makes them work.)
Your ha-sofia token
- Stored at
~/.config/cluster-health/haos_token(yours, mode 600). - It's a dedicated long-lived token, named
emo-cluster-healthunder ha-sofia → your profile → Long-Lived Access Tokens. Revoking it there affects only you. - It currently carries admin-level HA scope (Home Assistant only lets a token be minted for the account that created it, and it was minted via the admin account). If it ever stops working, tell wizard and a fresh one can be minted.