Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":
1. prometheus_alerts: only count severity=warning|critical. Info-level
alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
alert rule itself sets severity; the script should respect it.
2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
Lets Encrypt wildcard renews weekly; <14d is the only window where
human attention is genuinely useful.
3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
event/image/update domains (transient by design), skip friendly
names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
only count entities whose last_changed > 24h. Was 431/1470,
most of which were "phone in standby" noise.
4. ha_automations: only flag DISABLED automations as abandoned if
they've also been untouched (last_changed) for >180 days; raise
stale threshold 30d → 180d. Was flagging seasonal/holiday-only
automations as broken.
5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
CronJob retry leftovers (Error/Failed phase pods that K8s keeps
around for log inspection) aren't problematic at the cluster level.
6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
shot failures were a recurring false-positive even though the
service was healthy.
Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>