All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT ATS, the Барзини → Статус dashboard), and only about the cluster when it's breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused, read-only variant: - leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only; - fixes the script path to emo's own clone (/home/emo/code) — he can't read wizard's tree — and runs it --no-fix (he's cluster read-only); - loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45) actually run for him; documents the host-SSH/Vault checks that skip; - triages: cluster FAIL/WARN matters only if on his chain; everything else is a one-line "admin's area"; escalate via /file-issue since he can't fix. This snapshot copy is now an emo-specific variant, intentionally diverged from the canonical 47-check admin skill — README updated to say "do not re-sync from canonical". Token: a dedicated long-lived HA token (client_name emo-cluster-health) was minted on ha-sofia via the admin account and stored emo-readable at /home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope (HA only mints tokens for the authenticating account); independently revocable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
146 lines
6.5 KiB
Markdown
146 lines
6.5 KiB
Markdown
---
|
||
name: cluster-health
|
||
description: |
|
||
Personalized for emo. Check whether the homelab Kubernetes cluster is
|
||
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
|
||
the MPPT ATS, lights, climate, security, irrigation). Use when:
|
||
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
|
||
(2) "is the cluster affecting Sofia / my devices",
|
||
(3) "check the cluster", "cluster health", "is everything running",
|
||
(4) a device on the Барзини → Статус dashboard looks offline.
|
||
Runs the cluster-wide healthcheck read-only and triages it by what
|
||
ha-sofia actually depends on; the rest of the cluster is the admin's area.
|
||
author: Claude Code
|
||
version: 3.0.0-emo
|
||
date: 2026-06-26
|
||
---
|
||
|
||
# Cluster Health — personalized for emo (ha-sofia focus)
|
||
|
||
## What you actually care about
|
||
|
||
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
|
||
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
|
||
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
|
||
cluster matters to you **only when it's breaking something ha-sofia or your
|
||
devices depend on.** Anything else is the admin's (wizard's) area — note it in
|
||
one line and move on; don't chase it.
|
||
|
||
You have **read-only** cluster access. You can SEE everything but change
|
||
nothing — so when something on your chain is broken, the job is to confirm it
|
||
and hand it off, not to repair it.
|
||
|
||
## How ha-sofia depends on the cluster
|
||
|
||
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
|
||
**not** in the cluster. The cluster reaches it through exactly two things:
|
||
|
||
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
|
||
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
|
||
+ ATS stop responding. **This is the #1 thing to check.**
|
||
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
|
||
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
|
||
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
|
||
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
|
||
you can't reach ha-sofia remotely.
|
||
|
||
Everything else in the cluster is unrelated to you unless it's hosting one of
|
||
those pods.
|
||
|
||
## Step 1 — run the healthcheck (read-only, with your HA token)
|
||
|
||
Your account can't read Vault, so load your own ha-sofia token first (it was
|
||
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
|
||
the script from YOUR clone, read-only:
|
||
|
||
```bash
|
||
cd /home/emo/code
|
||
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
|
||
bash scripts/cluster_healthcheck.sh --no-fix --quiet
|
||
# machine-readable instead:
|
||
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
|
||
```
|
||
|
||
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
|
||
will fail.
|
||
- Exit codes: `0` healthy, `1` warnings, `2` failures.
|
||
|
||
With the token exported, the **ha-sofia checks run for you**:
|
||
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
|
||
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
|
||
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
|
||
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
|
||
covers the **tuya** exporter.
|
||
|
||
## Step 2 — triage the output by relevance to YOU
|
||
|
||
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
|
||
|
||
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
|
||
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
|
||
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
|
||
**ha-sofia** checks (26–29, 45) and the **tuya** exporter (30).
|
||
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
|
||
cluster issues (admin's area)" and don't investigate.
|
||
|
||
## Step 3 — read-only checks for your chain
|
||
|
||
All of these work with your read-only access:
|
||
|
||
```bash
|
||
# tuya-bridge — your devices + the ATS
|
||
kubectl get pods -n tuya-bridge
|
||
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
|
||
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
|
||
|
||
# the reachability path ha-sofia uses
|
||
kubectl get pods -n cloudflared
|
||
kubectl get pods -n traefik
|
||
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
|
||
|
||
# whole external path in one shot (DNS + tunnel + Traefik + cert):
|
||
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
|
||
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
|
||
# broken -> curl: timeout / could not resolve host
|
||
```
|
||
|
||
The fastest **device-level** signal is your own dashboard: open
|
||
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
|
||
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
|
||
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
|
||
|
||
## Step 4 — if something on your chain is broken
|
||
|
||
You can't fix the cluster (read-only), so **capture + hand off**:
|
||
|
||
```bash
|
||
kubectl describe pod -n tuya-bridge <pod>
|
||
kubectl logs -n tuya-bridge <pod> --previous --tail=200
|
||
```
|
||
|
||
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
|
||
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
|
||
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
|
||
alerting is already firing, but file it so it's tracked from your side too.
|
||
|
||
## What will skip for you (expected — not failures)
|
||
|
||
A few checks need access your account doesn't have. They warn/skip — that's
|
||
normal, and **none of them are on your ha-sofia chain**:
|
||
|
||
- **Uptime Kuma (14)** — needs an admin password from Vault.
|
||
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
|
||
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
|
||
- **`--fix`** — pod deletion (a write); not available to you.
|
||
|
||
(The ha-sofia checks are **not** in this list — your token makes them work.)
|
||
|
||
## Your ha-sofia token
|
||
|
||
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
|
||
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
|
||
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
|
||
affects only you.
|
||
- It currently carries admin-level HA scope (Home Assistant only lets a token
|
||
be minted for the account that created it, and it was minted via the admin
|
||
account). If it ever stops working, tell wizard and a fresh one can be minted.
|