diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md index 816cbcb7..07df2eb9 100644 --- a/scripts/workstation/claude-skills/README.md +++ b/scripts/workstation/claude-skills/README.md @@ -19,13 +19,24 @@ unpinned-CLI dependencies out of the hourly **root** reconcile. - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` +- **homelab-local** — `cluster-health` is vendored from this repo's own + `.claude/skills/cluster-health/` (the canonical copy, a project skill in the + infra clone). It is NOT in `~/.agents/skills/`, so the `cp -a` refresh below + does NOT update it — re-copy it explicitly when the canonical skill changes + (see Refreshing). ## Refreshing -Re-snapshot from a current install and commit the diff: +Re-snapshot the upstream skills from a current install and commit the diff: ```sh cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ ``` -Snapshot taken 2026-06-23. +Re-sync the homelab-local skill(s) from their canonical in-repo copy: + +```sh +cp -a .claude/skills/cluster-health scripts/workstation/claude-skills/ +``` + +Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26. diff --git a/scripts/workstation/claude-skills/cluster-health/SKILL.md b/scripts/workstation/claude-skills/cluster-health/SKILL.md new file mode 100644 index 00000000..6772bf99 --- /dev/null +++ b/scripts/workstation/claude-skills/cluster-health/SKILL.md @@ -0,0 +1,454 @@ +--- +name: cluster-health +description: | + Check Kubernetes cluster health and fix common issues. Use when: + (1) User asks to check the cluster, check health, or "what's wrong", + (2) User asks about pod status, node health, or deployment issues, + (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff, + (4) User mentions "health check", "cluster status", "cluster health", + (5) User asks "is everything running" or "any problems". + Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs, + backups, external reachability, PVE host thermals + load, HA Sofia + status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift) + with safe auto-fix for evicted pods. +author: Claude Code +version: 2.0.0 +date: 2026-04-19 +--- + +# Cluster Health Check + +## MANDATORY: Run the script first + +When this skill is invoked, your **first action** must be to run the +cluster health check script and reason over its output before doing +anything else. Do not improvise individual `kubectl` calls — the +script is the authoritative surface. + +```bash +cd /home/wizard/code +bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json +``` + +If the session is rooted elsewhere, fall back to the absolute path: + +```bash +bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json +``` + +Then: + +1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict. +2. Iterate every FAIL and WARN check, describe what tripped, and propose + the remediation path (use the recipes below). +3. Only reach for ad-hoc `kubectl` commands when investigating a + specific failure beyond what the script reported. + +Exit codes: `0` = healthy, `1` = warnings only, `2` = failures. + +## Quick flags + +```bash +# Human-readable report (default), no auto-fix +bash infra/scripts/cluster_healthcheck.sh + +# Machine-readable JSON summary +bash infra/scripts/cluster_healthcheck.sh --json + +# Only show WARN + FAIL (suppress PASS noise) +bash infra/scripts/cluster_healthcheck.sh --quiet + +# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods) +bash infra/scripts/cluster_healthcheck.sh --fix + +# Combined: quiet JSON without auto-fix +bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json + +# Custom kubeconfig +bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config +``` + +## What It Checks (47 checks) + +| # | Check | Notes | +|---|-------|-------| +| 1 | Node Status | NotReady nodes, version drift | +| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) | +| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure | +| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff | +| 5 | Evicted/Failed Pods | `status.phase=Failed` | +| 6 | DaemonSets | desired == ready | +| 7 | Deployments | ready == desired replicas | +| 8 | PVC Status | all Bound | +| 9 | HPA Health | targets not ``, utilization <100% | +| 10 | CronJob Failures | job conditions `Failed=True` in last 24h | +| 11 | CrowdSec Agents | all pods Running | +| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB | +| 13 | Prometheus Alerts | count of firing alerts | +| 14 | Uptime Kuma Monitors | internal + external monitors up | +| 15 | ResourceQuota Pressure | any quota >80% used | +| 16 | StatefulSets | ready == desired | +| 17 | Node Disk Usage | ephemeral-storage <80% | +| 18 | Helm Release Health | all `deployed` (no `pending-*`) | +| 19 | Kyverno Policy Engine | all pods Running | +| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 | +| 21 | DNS Resolution | Technitium resolves internal + external | +| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid | +| 23 | GPU Health | nvidia namespace + device-plugin Running | +| 24 | Cloudflare Tunnel | pods Running | +| 25 | Resource Usage | node CPU/mem headroom | +| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count | +| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded | +| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations | +| 29 | HA Sofia — System Resources | HA CPU / mem / disk | +| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes | +| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` | +| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d | +| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` | +| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h | +| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h | +| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) | +| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running | +| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` | +| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready | +| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready | +| 41 | External — ExternalAccessDivergence Alert | alert not firing | +| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx | +| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) | +| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads | +| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) | +| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache → check `clip-index-prewarm` CronJob | +| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config ` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set --delete scsiN` (frees slot, retains LV) | + +## Safe Auto-Fix Rules + +`--fix` only performs operations that are genuinely reversible and +observable. Nothing here rewrites Terraform state or mutates the cluster +beyond "delete pod". + +### Done automatically by `--fix` + +- **Evicted / Failed pods** — delete them; the controller recreates. + ```bash + kubectl delete pods -A --field-selector=status.phase=Failed + ``` +- **CrashLoopBackOff pods with >10 restarts** — delete once to reset + backoff timer. + +### NEVER auto-fix (requires human investigation) + +- NotReady nodes +- MemoryPressure / DiskPressure / PIDPressure +- ImagePullBackOff (usually a bad tag / registry credential) +- Deployment ready-replica mismatch +- Pending PVCs +- Node CPU/memory >90% +- CronJob failures +- DaemonSet desired != ready +- Vault sealed +- ClusterSecretStore not Ready +- cert-manager Certificate failures +- Backup freshness regressions +- Any external-reachability failure + +## Deep-investigation recipes per failure mode + +### Node Issues (checks 1, 3, 17, 25) + +```bash +kubectl describe node +kubectl top nodes +kubectl get events --field-selector involvedObject.name= --sort-by='.lastTimestamp' +# SSH to the node +ssh root@10.0.20.10X +systemctl status kubelet +journalctl -u kubelet --since "30 minutes ago" | tail -100 +df -h ; free -h +``` + +Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2, +`.103` node3, `.104` node4. + +### Pod Issues (checks 4, 5, 11, 19) + +```bash +kubectl describe pod -n +kubectl logs -n --tail=200 +kubectl logs -n --previous --tail=200 +kubectl get events -n --sort-by='.lastTimestamp' | tail -20 +``` + +Common failure causes: OOMKilled (raise mem limit in Terraform), bad +config / missing env var, DB connection failure (check `dbaas` pods), +NFS mount failure (`showmount -e 192.168.1.127`), stale +imagePullSecret. + +### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16) + +```bash +kubectl describe deployment -n +kubectl rollout status deployment -n +kubectl rollout history deployment -n +kubectl get rs -n -l app= +``` + +### PVC (check 8) + +```bash +kubectl describe pvc -n +kubectl get events -n --field-selector reason=FailedMount --sort-by='.lastTimestamp' +kubectl get pv | grep +showmount -e 192.168.1.127 +``` + +### cert-manager (checks 31, 32, 33) + +```bash +kubectl get certificate -A +kubectl describe certificate -n +kubectl get certificaterequest -A +kubectl describe certificaterequest -n +kubectl logs -n cert-manager deploy/cert-manager | tail -50 +``` + +Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing +DNS provider secret, rate-limit from Let's Encrypt. + +### Backups (checks 34, 35, 36) + +```bash +# Per-DB dumps (inside the DB pod) +kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/ +kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/ + +# Pushgateway metrics +kubectl exec -n monitoring deploy/prometheus-server -- \ + wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \ + grep backup_last_success_timestamp + +# LVM snapshots on PVE host +ssh -o BatchMode=yes root@192.168.1.127 \ + 'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap' +``` + +If offsite sync is stale, the common cause is the +`offsite-sync-backup.service` systemd unit on the PVE host failing. +`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`. + +### Monitoring stack (checks 37, 38, 39) + +```bash +# Prometheus +kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready +kubectl logs -n monitoring deploy/prometheus-server --tail=100 + +# Alertmanager +kubectl get pods -n monitoring | grep alertmanager +kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100 + +# Vault +kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status' +# If sealed: check raft peers with `vault operator raft list-peers` and unseal. + +# ClusterSecretStore +kubectl get clustersecretstore +kubectl describe clustersecretstore vault-kv vault-database +kubectl logs -n external-secrets deploy/external-secrets --tail=100 +``` + +### External reachability (checks 40, 41, 42) + +```bash +# Cloudflared +kubectl get pods -n cloudflared +kubectl logs -n cloudflared -l app=cloudflared --tail=100 + +# Authentik (Helm chart names the deployment goauthentik-server) +kubectl get deployment -n authentik goauthentik-server +kubectl logs -n authentik deploy/goauthentik-server --tail=100 + +# ExternalAccessDivergence alert +kubectl exec -n monitoring deploy/prometheus-server -- \ + wget -qO- 'http://localhost:9090/api/v1/alerts' | \ + python3 -m json.tool | grep -A 5 ExternalAccessDivergence + +# Traefik 5xx — find the hot service +kubectl exec -n monitoring deploy/prometheus-server -- \ + wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \ + | python3 -m json.tool +``` + +### OOMKilled remediation + +1. `kubectl describe pod -n | grep -A 5 Limits` +2. Edit `infra/modules/kubernetes//main.tf` and raise + `resources.limits.memory`. +3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or + `terraform apply -target=module.` as appropriate. + +### ImagePullBackOff remediation + +1. `kubectl describe pod -n | grep -A 5 Events` +2. Verify tag exists on the source registry. +3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`. +4. Update the image tag in Terraform + re-apply. + +### Persistent CrashLoopBackOff after auto-fix + +1. `kubectl logs -n --previous --tail=200` +2. `kubectl describe pod -n ` and check Last State: + - `OOMKilled` → raise memory limit + - Exit code 137 → OOM or probe killed + - Exit code 143 → SIGTERM / graceful shutdown failed +3. Cross-check dbaas + NFS + secrets are healthy. + +## Performance forensics — top consumers + optimization hints + +When the cluster is healthy (script returns 0) but the host is hot or load +is elevated, switch from "what broke?" to "what's expensive?". Run these +in order; stop as soon as the root cause is obvious. + +### Step 1 — Snapshot top consumers cluster-wide + +```bash +# Top 15 pods by current CPU +kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15 + +# Top 5 nodes by CPU + memory pressure +kubectl top nodes + +# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes) +kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ + "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \ + | python3 -m json.tool | head -80 +``` + +### Step 2 — For each suspect pod, get the WHY + +For every pod in the top-N, gather these BEFORE proposing a fix: + +```bash +NS=; POD=; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}') + +# What it does (image + command) +kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}' + +# Resource limits + current usage +kubectl -n $NS top pod $POD --containers +kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}' + +# Recent logs filtered for reconcile loops, watch storms, slow queries +kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \ + | grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20 + +# Restart count + recent OOM +kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason' + +# Self-exported metrics (for apps that publish on /metrics) +kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:/metrics 2>/dev/null | head -50 +``` + +### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot) + +```bash +# Top request producers by verb+resource (last 30 min) +kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ + "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \ + | python3 -m json.tool + +# Top user agents (which clients are hammering) +kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ + "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \ + | python3 -m json.tool + +# Long-running requests (WATCH / CONNECT — log streams, pod-watchers) +kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ + "http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \ + | python3 -m json.tool + +# etcd write rate + DB size +kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \ + "http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \ + | python3 -m json.tool +``` + +### Step 4 — PVE host specific deep-dive (when temp / load is high) + +Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL +thresholds — that's the first stop. When those WARN or FAIL, the +follow-up commands below trace which VM / process is the source: + +```bash +# Per-core temps (broader than the package summary in check 43) +ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do + base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}") + val=$(cat "$f"); echo " $label: $((val/1000))°C" +done' + +# Per-VM CPU (each VM = one kvm process) +ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10' + +# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000 +ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l' + +# Stale snapshots (any '_pre-*' that survived past their rollback window) +ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20' +``` + +### Step 5 — Optimization decision + +For each consumer in the top-N, fill in a row: + +| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort | +|---|---|---|---|---|---|---| + +Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.** + +### Common causes + tunables (catalogue) + +| Symptom | Likely cause | Tunable | +|---|---|---| +| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count | +| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` | +| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers | +| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF | +| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) | +| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` | +| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master | +| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources | + +### What NOT to touch + +- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness. +- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime. +- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut. + +### Source-of-truth notes + +- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks//main.tf` or chart values. +- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`. +- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs. + +## Notes on the canonical / hardlink setup + +The authoritative copy of this SKILL.md lives at +`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink +at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md` +points to the same inode so infra-rooted sessions also discover the +skill. + +To verify the hardlink is intact: + +```bash +stat -c '%i %n' \ + /home/wizard/code/.claude/skills/cluster-health/SKILL.md \ + /home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md +``` + +Both should print the same inode number. If they diverge (e.g. `git +checkout` replaced the file rather than updating it), re-link: + +```bash +ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \ + /home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md +```