Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a
signal a VM/workload is using too much CPU and warrants investigation.
Previous thresholds were calibrated to the hardware's TjMax (75/83 C) —
that was too lax, since cluster-load-driven elevation arrives a long
time before throttling. The 65 C cutoff matches the live Prometheus
baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the
session-observed correlation: above 65 C means the cluster is doing
sustained work that should be looked at, even if hardware is still
nowhere near its limit.
Updated:
PASS < 65 C (within 55-65 baseline)
WARN 65-82 C (elevated; check top kvm processes for the culprit)
FAIL >= 83 C (at/above TjMax — throttling imminent)
Verified live: 67 C now WARN (was PASS under the 75 C threshold).
Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL
via the standard healthcheck output + JSON. They run alongside the
existing 42 checks and surface the same alerts the 2026-05-20/21
optimization session had to gather by hand.
#43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps
Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip.
Thresholds tuned to the live TjMax=83 / Tcrit=93:
PASS < 75 °C package
WARN 75-82 °C (approaching max, action time)
FAIL >= 83 °C (at/above TjMax, throttling imminent)
Reports hottest core label too so a single hot core doesn't hide in
the package average.
#44 PVE Host Load — load avg vs 44-thread capacity
Reads /proc/loadavg, compares 5-min to thread count (44):
PASS load_5 < 30 (< 70% threads busy)
WARN 30-37 (oversubscribed but not saturating)
FAIL >= 38 (~85%+ threads busy — scheduler saturation)
Uses 5-min so brief work spikes don't false-fail.
Both gracefully WARN-degrade if SSH BatchMode fails, matching the
existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped
42 -> 44 and the dispatcher updated.
The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which
sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO
handshake the uptime-kuma-api library uses, and the library can't
complete the OAuth flow, so every healthcheck reported "Connection
failed" even though the pod was healthy and serving 225 monitors.
Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the
uptime-kuma namespace for the duration of the check, connect the
library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the
port-forward on the way out. The disown is to suppress bash's "Killed"
job notification on stderr, which corrupted stdout when stderr was
merged for JSON parsing.
Verified end-to-end: healthcheck now reports the real signal —
"external down(3): www, xray-vless, hermes-agent" — the same 3
Cloudflare-facing endpoints flagging in the uptime-kuma logs.
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":
1. prometheus_alerts: only count severity=warning|critical. Info-level
alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
alert rule itself sets severity; the script should respect it.
2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
Lets Encrypt wildcard renews weekly; <14d is the only window where
human attention is genuinely useful.
3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
event/image/update domains (transient by design), skip friendly
names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
only count entities whose last_changed > 24h. Was 431/1470,
most of which were "phone in standby" noise.
4. ha_automations: only flag DISABLED automations as abandoned if
they've also been untouched (last_changed) for >180 days; raise
stale threshold 30d → 180d. Was flagging seasonal/holiday-only
automations as broken.
5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
CronJob retry leftovers (Error/Failed phase pods that K8s keeps
around for log inspection) aren't problematic at the cluster level.
6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
shot failures were a recurring false-positive even though the
service was healthy.
Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous default of $(pwd)/config required running the script from
the infra/ directory or always passing --kubeconfig. From a parent
shell or any other working directory, the lookup hit a non-existent
file and kubectl returned a stale-token error, masking real check
results.
Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to
$(pwd)/config for backwards compatibility.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when
HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL =
https://ha-sofia.viktorbarzin.me.
- cert-manager: add cert_manager_installed() probe (kubectl get crd
certificates.cert-manager.io). When not installed — which is our current
state — report PASS "N/A" instead of noisy WARN "CRDs unavailable".
- LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use
underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check
always WARN'd. Fixed to `grep _snap`.
After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real
HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
+authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
lines) and the openclaw cluster_healthcheck CronJob (local CLI is
now the single authoritative runner). Keep the healthcheck SA +
Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
the script first, refreshes the 42-check table, drops stale
CronJob/Slack/post-mortem sections, documents the monorepo-canonical
+ hardlink layout. File is hardlinked to
/home/wizard/code/.claude/skills/cluster-health/SKILL.md for
dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and
tuya-bridge pods are running, plus checks Prometheus scrape targets
(snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.
- Healthcheck: add entity availability, integration health, automation
status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
get VPA Off mode (recommend only, no evictions) to prevent downtime
on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
kubectl exec, since MetalLB virtual IPs aren't reachable from outside
the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
not mounted by kured DaemonSet)
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.