--- name: cluster-health description: | Personalized for emo. Check whether the homelab Kubernetes cluster is affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, the MPPT ATS, lights, climate, security, irrigation). Use when: (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", (2) "is the cluster affecting Sofia / my devices", (3) "check the cluster", "cluster health", "is everything running", (4) a device on the Барзини → Статус dashboard looks offline. Runs the cluster-wide healthcheck read-only and triages it by what ha-sofia actually depends on; the rest of the cluster is the admin's area. author: Claude Code version: 3.0.0-emo date: 2026-06-26 --- # Cluster Health — personalized for emo (ha-sofia focus) ## What you actually care about You care about **ha-sofia** and the **Sofia smart-home devices** it runs — the Tuya devices, the **MPPT ATS**, and the lights / climate / security / irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes cluster matters to you **only when it's breaking something ha-sofia or your devices depend on.** Anything else is the admin's (wizard's) area — note it in one line and move on; don't chase it. You have **read-only** cluster access. You can SEE everything but change nothing — so when something on your chain is broken, the job is to confirm it and hand it off, not to repair it. ## How ha-sofia depends on the cluster ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — **not** in the cluster. The cluster reaches it through exactly two things: 1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices + ATS stop responding. **This is the #1 thing to check.** 2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and you can't reach ha-sofia remotely. Everything else in the cluster is unrelated to you unless it's hosting one of those pods. ## Step 1 — run the healthcheck (read-only, with your HA token) Your account can't read Vault, so load your own ha-sofia token first (it was minted for you and lives at `~/.config/cluster-health/haos_token`). Then run the script from YOUR clone, read-only: ```bash cd /home/emo/code export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)" bash scripts/cluster_healthcheck.sh --no-fix --quiet # machine-readable instead: # bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json ``` - **Never pass `--fix`** — it deletes pods (a write); you're read-only and it will fail. - Exit codes: `0` healthy, `1` warnings, `2` failures. With the token exported, the **ha-sofia checks run for you**: 26 Entity Availability · 27 Integration Health · 28 Automation Status · 29 System Resources · **45 Status Dashboard** — your Барзини → Статус view, classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also covers the **tuya** exporter. ## Step 2 — triage the output by relevance to YOU Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two: - **On your chain → this is what matters.** Anything touching: `tuya-bridge`, `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30). - **Not on your chain → one line, then drop it.** Summarise as "N unrelated cluster issues (admin's area)" and don't investigate. ## Step 3 — read-only checks for your chain All of these work with your read-only access: ```bash # tuya-bridge — your devices + the ATS kubectl get pods -n tuya-bridge kubectl rollout status deploy/tuya-bridge -n tuya-bridge kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50 # the reachability path ha-sofia uses kubectl get pods -n cloudflared kubectl get pods -n traefik kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia' # whole external path in one shot (DNS + tunnel + Traefik + cert): curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1 # reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up) # broken -> curl: timeout / could not resolve host ``` The fastest **device-level** signal is your own dashboard: open **https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster. ## Step 4 — if something on your chain is broken You can't fix the cluster (read-only), so **capture + hand off**: ```bash kubectl describe pod -n tuya-bridge kubectl logs -n tuya-bridge --previous --tail=200 ``` Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's alerting is already firing, but file it so it's tracked from your side too. ## What will skip for you (expected — not failures) A few checks need access your account doesn't have. They warn/skip — that's normal, and **none of them are on your ha-sofia chain**: - **Uptime Kuma (14)** — needs an admin password from Vault. - **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load), and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host. - **`--fix`** — pod deletion (a write); not available to you. (The ha-sofia checks are **not** in this list — your token makes them work.) ## Your ha-sofia token - Stored at `~/.config/cluster-health/haos_token` (yours, mode 600). - It's a **dedicated** long-lived token, named `emo-cluster-health` under ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there affects only you. - It currently carries admin-level HA scope (Home Assistant only lets a token be minted for the account that created it, and it was minted via the admin account). If it ever stops working, tell wizard and a fresh one can be minted.