infra/stacks/nvidia/modules
Viktor Barzin a168277213 healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
..
nvidia healthcheck: tune noise filters + nvidia-exporter auth=none 2026-05-22 14:16:43 +00:00