ci/woodpecker/push/default Pipeline was successful

Details

devvm: personalize emo's cluster-health skill for ha-sofia

emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
  cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
  wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
  actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
  a one-line "admin's area"; escalate via /file-issue since he can't fix.

This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".

Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-26 16:03:14 +00:00

6.5 KiB

Raw Blame History

name	description	author	version	date
cluster-health	Personalized for emo. Check whether the homelab Kubernetes cluster is affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, the MPPT ATS, lights, climate, security, irrigation). Use when: (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", (2) "is the cluster affecting Sofia / my devices", (3) "check the cluster", "cluster health", "is everything running", (4) a device on the Барзини → Статус dashboard looks offline. Runs the cluster-wide healthcheck read-only and triages it by what ha-sofia actually depends on; the rest of the cluster is the admin's area.	Claude Code	3.0.0-emo	2026-06-26

Cluster Health — personalized for emo (ha-sofia focus)

What you actually care about

You care about ha-sofia and the Sofia smart-home devices it runs — the Tuya devices, the MPPT ATS, and the lights / climate / security / irrigation on your Барзини → Статус dashboard. The wider Kubernetes cluster matters to you only when it's breaking something ha-sofia or your devices depend on. Anything else is the admin's (wizard's) area — note it in one line and move on; don't chase it.

You have read-only cluster access. You can SEE everything but change nothing — so when something on your chain is broken, the job is to confirm it and hand it off, not to repair it.

How ha-sofia depends on the cluster

ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — not in the cluster. The cluster reaches it through exactly two things:

tuya-bridge (namespace tuya-bridge) — the REST API ha-sofia calls for every Tuya device and the MPPT ATS. If it's unhealthy, your Tuya devices
- ATS stop responding. This is the #1 thing to check.
The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia reachable: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert for tuya-bridge.viktorbarzin.me and ha-sofia.viktorbarzin.me, plus Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and you can't reach ha-sofia remotely.

Everything else in the cluster is unrelated to you unless it's hosting one of those pods.

Step 1 — run the healthcheck (read-only, with your HA token)

Your account can't read Vault, so load your own ha-sofia token first (it was minted for you and lives at ~/.config/cluster-health/haos_token). Then run the script from YOUR clone, read-only:

cd /home/emo/code
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
bash scripts/cluster_healthcheck.sh --no-fix --quiet
# machine-readable instead:
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json

Never pass --fix — it deletes pods (a write); you're read-only and it will fail.
Exit codes: 0 healthy, 1 warnings, 2 failures.

With the token exported, the ha-sofia checks run for you: 26 Entity Availability · 27 Integration Health · 28 Automation Status · 29 System Resources · 45 Status Dashboard — your Барзини → Статус view, classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also covers the tuya exporter.

Step 2 — triage the output by relevance to YOU

Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:

On your chain → this is what matters. Anything touching: tuya-bridge, cloudflared, traefik, DNS (check 21), the TLS cert / ingress for your two hosts (checks 12, 22, 31, 32), or a node hosting those pods — plus all the ha-sofia checks (26–29, 45) and the tuya exporter (30).
Not on your chain → one line, then drop it. Summarise as "N unrelated cluster issues (admin's area)" and don't investigate.

Step 3 — read-only checks for your chain

All of these work with your read-only access:

# tuya-bridge — your devices + the ATS
kubectl get pods -n tuya-bridge
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50

# the reachability path ha-sofia uses
kubectl get pods -n cloudflared
kubectl get pods -n traefik
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'

# whole external path in one shot (DNS + tunnel + Traefik + cert):
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
#   reachable  -> HTTP/2 200 / 401 / 403  (any HTTP response = path is up)
#   broken     -> curl: timeout / could not resolve host

The fastest device-level signal is your own dashboard: open https://ha-sofia.viktorbarzin.me → Барзини → Статус. If devices show Offline / Разкачен / ⚠️ but tuya-bridge is healthy, the problem is at the house (device power / Wi-Fi / the Sofia TP-Link network) — not the cluster.

Step 4 — if something on your chain is broken

You can't fix the cluster (read-only), so capture + hand off:

kubectl describe pod -n tuya-bridge <pod>
kubectl logs -n tuya-bridge <pod> --previous --tail=200

Then file it for the admin with the /file-issue skill — e.g. "ha-sofia Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping" with the output above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's alerting is already firing, but file it so it's tracked from your side too.

What will skip for you (expected — not failures)

A few checks need access your account doesn't have. They warn/skip — that's normal, and none of them are on your ha-sofia chain:

Uptime Kuma (14) — needs an admin password from Vault.
PVE host checks — 36 (LVM snapshots), 43 (host thermals), 44 (host load), and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
--fix — pod deletion (a write); not available to you.

(The ha-sofia checks are not in this list — your token makes them work.)

Your ha-sofia token

Stored at ~/.config/cluster-health/haos_token (yours, mode 600).
It's a dedicated long-lived token, named emo-cluster-health under ha-sofia → your profile → Long-Lived Access Tokens. Revoking it there affects only you.
It currently carries admin-level HA scope (Home Assistant only lets a token be minted for the account that created it, and it was minted via the admin account). If it ever stops working, tell wizard and a fresh one can be minted.

6.5 KiB Raw Blame History Unescape Escape