healthcheck: tune noise filters + nvidia-exporter auth=none
Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
7d7c7e4b7f
commit
2f0e8c88a9
2 changed files with 158 additions and 24 deletions
|
|
@ -217,8 +217,11 @@ resource "kubernetes_service" "nvidia-exporter" {
|
|||
|
||||
|
||||
module "ingress" {
|
||||
source = "../../../../modules/kubernetes/ingress_factory"
|
||||
auth = "required"
|
||||
source = "../../../../modules/kubernetes/ingress_factory"
|
||||
# Auth disabled — HA Sofia REST sensors poll /metrics; the OIDC flow
|
||||
# would 302 every request. Same pattern as idrac-redfish-exporter +
|
||||
# snmp-exporter (commit 5c594291).
|
||||
auth = "none"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
name = "nvidia-exporter"
|
||||
root_domain = "viktorbarzin.lan"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue