healthcheck: tune noise filters + nvidia-exporter auth=none

Six tuning changes to cluster_healthcheck.sh so PASS sections actually
reflect "nothing to act on":

1. prometheus_alerts: only count severity=warning|critical. Info-level
   alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the
   alert rule itself sets severity; the script should respect it.

2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert
   auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the
   Lets Encrypt wildcard renews weekly; <14d is the only window where
   human attention is genuinely useful.

3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/
   event/image/update domains (transient by design), skip friendly
   names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and
   only count entities whose last_changed > 24h. Was 431/1470,
   most of which were "phone in standby" noise.

4. ha_automations: only flag DISABLED automations as abandoned if
   they've also been untouched (last_changed) for >180 days; raise
   stale threshold 30d → 180d. Was flagging seasonal/holiday-only
   automations as broken.

5. problematic_pods + evicted_pods: exclude pods owned by Jobs.
   CronJob retry leftovers (Error/Failed phase pods that K8s keeps
   around for log inspection) aren't problematic at the cluster level.

6. uptime_kuma: retry the WebSocket login 3x with backoff. Single-
   shot failures were a recurring false-positive even though the
   service was healthy.

Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's
nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll
/metrics and got 302'd to Authentik like the idrac/snmp ones did.
Same fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-10 22:26:22 +00:00
parent 7d7c7e4b7f
commit 2f0e8c88a9
2 changed files with 158 additions and 24 deletions

View file

@ -217,8 +217,11 @@ resource "kubernetes_service" "nvidia-exporter" {
module "ingress" {
source = "../../../../modules/kubernetes/ingress_factory"
auth = "required"
source = "../../../../modules/kubernetes/ingress_factory"
# Auth disabled HA Sofia REST sensors poll /metrics; the OIDC flow
# would 302 every request. Same pattern as idrac-redfish-exporter +
# snmp-exporter (commit 5c594291).
auth = "none"
namespace = kubernetes_namespace.nvidia.metadata[0].name
name = "nvidia-exporter"
root_domain = "viktorbarzin.lan"