healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL

The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO handshake the uptime-kuma-api library uses, and the library can't complete the OAuth flow, so every healthcheck reported "Connection failed" even though the pod was healthy and serving 225 monitors. Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the uptime-kuma namespace for the duration of the check, connect the library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the port-forward on the way out. The disown is to suppress bash's "Killed" job notification on stderr, which corrupted stdout when stderr was merged for JSON parsing. Verified end-to-end: healthcheck now reports the real signal — "external down(3): www, xray-vless, hermes-agent" — the same 3 Cloudflare-facing endpoints flagging in the uptime-kuma logs.
2026-05-11 20:02:57 +00:00 · 2026-05-11 20:02:57 +00:00 · 30eff178e9
commit 30eff178e9
parent a699d5bedf
1 changed files with 32 additions and 6 deletions
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -641,7 +641,28 @@ check_uptime_kuma() {
        return 0
    fi

-    result=$(UPTIME_KUMA_PASSWORD="$uk_pass" ~/.venvs/claude/bin/python3 -c '
+    # Connect via kubectl port-forward to the internal Service. The public
+    # URL (uptime.viktorbarzin.me) is behind Authentik forward-auth, which
+    # 302-redirects the Socket.IO handshake the library uses — there's no
+    # way for an unauthenticated script to complete the OAuth dance.
+    # Port-forward gives us a direct path to the in-cluster ClusterIP
+    # service and works from any host with kubectl access.
+    local pf_port=18444 pf_pid
+    $KUBECTL port-forward -n uptime-kuma svc/uptime-kuma "$pf_port:80" >/dev/null 2>&1 &
+    pf_pid=$!
+    # Detach from job control so bash doesn't print "Killed" to stderr
+    # when we SIGKILL the port-forward at the end of this check — that
+    # message corrupts stdout when stderr is merged for JSON parsing.
+    disown "$pf_pid" 2>/dev/null || true
+    # Wait up to 5s for the local listener to come up.
+    local i
+    for i in 1 2 3 4 5; do
+        if (echo >"/dev/tcp/127.0.0.1/$pf_port") 2>/dev/null; then break; fi
+        sleep 1
+    done
+
+    result=$(UPTIME_KUMA_PASSWORD="$uk_pass" UK_URL="http://127.0.0.1:$pf_port" \
+        ~/.venvs/claude/bin/python3 -c '
 import sys, os, time
 try:
    from uptime_kuma_api import UptimeKumaApi
@ -649,15 +670,13 @@ except ImportError:
    print("ERROR:uptime-kuma-api not installed")
    sys.exit(0)

-# The uptime-kuma WebSocket login is intermittently flaky — single-shot
-# failures showed up repeatedly in healthchecks even though the service
-# was healthy. Retry up to 3 times with backoff before declaring connect
-# failure.
+# Retry up to 3 times — the Socket.IO handshake is occasionally flaky
+# even against the internal service during cluster churn.
 last_exc = None
 api = None
 for attempt in range(3):
    try:
-        api = UptimeKumaApi("https://uptime.viktorbarzin.me", timeout=120, wait_events=0.2)
+        api = UptimeKumaApi(os.environ["UK_URL"], timeout=120, wait_events=0.2)
        api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
        break
    except Exception as e:
@ -725,6 +744,13 @@ except Exception as e:
    print(f"CONN_ERROR:{e}")
 ' 2>/dev/null) || result="CONN_ERROR:python execution failed"

+    # Always tear down the port-forward. Use SIGKILL directly — kubectl
+    # port-forward sometimes ignores SIGTERM during teardown and we don't
+    # need a graceful exit for a localhost listener. Skip `wait` because
+    # in `set -m` mode the backgrounded child may not be reapable here,
+    # causing the script to hang indefinitely; the shell reaps it on exit.
+    kill -9 "$pf_pid" 2>/dev/null || true
+
    if [[ "$result" == "ERROR:"* ]]; then
        [[ "$QUIET" == true ]] && section_always 14 "Uptime Kuma Monitors"
        warn "Uptime Kuma: ${result#ERROR:}"