From f70796809111a7e86e97320cff133864193a7abf Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Sun, 19 Apr 2026 00:03:54 +0000
Subject: [PATCH] [mailserver] Retry probe Pushgateway + Uptime Kuma pushes
 with backoff
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Context
The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently
wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)`
in bare `try/except` that only prints "Failed to push ..." on error. If
Pushgateway is transiently unreachable (e.g., during a Prometheus Helm
upgrade / HPA scale-down / brief network blip) metrics silently drop and
downstream detection relies entirely on `EmailRoundtripStale` firing after
60 min of staleness. Single transient failures masquerade as data-plane
breakage for up to an hour.

Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes.

## This change
- Extracts a `push_with_retry(label, func, url)` helper that performs 3
  attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as
  success, everything else as failure. On final failure, logs an explicit
  `ERROR:` line to stderr with the URL and either the last HTTP status or
  the exception repr — matches the existing `print(...)` logging style
  used throughout the heredoc (no stdlib `logging` dependency added).
- Replaces the two inline `try/requests.put/except print` blocks with
  calls to the helper. Pushgateway runs unconditionally; Uptime Kuma
  still only runs on round-trip success (same as before).
- Makes exit code responsive to push outcome: probe exits non-zero when
  the round-trip itself failed (unchanged), OR when BOTH pushes failed
  all retries on the success path. Single-endpoint push failure with the
  other succeeding keeps exit 0 — partial observability is preferred
  over noisy pod restarts from Kubernetes' Job controller.

## Behavior matrix

```
roundtrip | pushgw | kuma | exit | rationale
----------+--------+------+------+-------------------------------
success   | ok     | ok   |  0   | happy path (unchanged)
success   | fail   | ok   |  0   | one endpoint still has telemetry
success   | ok     | fail |  0   | one endpoint still has telemetry
success   | fail   | fail |  1   | NEW — total observability loss
fail      | ok     | -    |  1   | roundtrip failed (unchanged, Kuma skipped)
fail      | fail   | -    |  1   | roundtrip failed (unchanged, Kuma skipped)
```

## What is NOT in this change
- Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of
  scope per the task description.
- `logging` stdlib adoption — rest of heredoc uses `print`, staying
  consistent.
- Moving the heredoc out of `main.tf` into a sidecar Python file —
  separate refactor.

## Reproduce locally
1. Point PUSHGATEWAY at a black hole:
   `kubectl -n mailserver set env cronjob/email-roundtrip-monitor \`
   `PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test`
2. Trigger a one-shot job:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test`
3. Expected in logs:
   - 3 attempts, each ~1s/2s/4s apart
   - `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...`
   - Uptime Kuma push still succeeds (round-trip ok) → exit 0
4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect
   exit 1 + two ERROR lines.

## Automated
- `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK
  (heredoc extracts cleanly).
- `terraform fmt -check -recursive modules/mailserver/` → no diff.

Closes: code-n5l

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 stacks/mailserver/modules/mailserver/main.tf | 51 +++++++++++++++-----
 1 file changed, 40 insertions(+), 11 deletions(-)

diff --git a/stacks/mailserver/modules/mailserver/main.tf b/stacks/mailserver/modules/mailserver/main.tf
index f10e03e6..f4ee2c8a 100644
--- a/stacks/mailserver/modules/mailserver/main.tf
+++ b/stacks/mailserver/modules/mailserver/main.tf
@@ -740,21 +740,50 @@ email_roundtrip_duration_seconds {duration:.2f}
 # TYPE email_roundtrip_last_success_timestamp gauge
 email_roundtrip_last_success_timestamp {int(time.time()) if success else 0}
 """
-try:
-    requests.put(PUSHGATEWAY, data=metrics, timeout=10)
-    print("Pushed metrics to Pushgateway")
-except Exception as e:
-    print(f"Failed to push metrics: {e}")
+UPTIME_KUMA_URL = "http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration))
+
+def push_with_retry(label, func, url):
+    # 3 attempts with exponential backoff (1s, 2s, 4s). Returns True on success, False otherwise.
+    # Final failure logs ERROR with URL + status code (or exception) so the pod log surfaces the drop.
+    last_status = None
+    last_exc = None
+    for attempt in range(3):
+        try:
+            resp = func()
+            last_status = resp.status_code
+            if 200 <= resp.status_code < 300:
+                print(f"Pushed to {label} (attempt {attempt+1}, status {resp.status_code})")
+                return True
+            last_exc = None
+        except Exception as e:
+            last_exc = e
+            last_status = None
+        if attempt < 2:
+            time.sleep(2 ** attempt)
+    detail = f"status={last_status}" if last_exc is None else f"exception={last_exc!r}"
+    print(f"ERROR: Failed to push to {label} after 3 attempts: url={url} {detail}", file=sys.stderr)
+    return False
+
+pushgateway_ok = push_with_retry(
+    "Pushgateway",
+    lambda: requests.put(PUSHGATEWAY, data=metrics, timeout=10),
+    PUSHGATEWAY,
+)
 
 # Push to Uptime Kuma on success
+uptime_kuma_ok = True
 if success:
-    try:
-        requests.get("http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration)), timeout=10)
-        print("Pushed to Uptime Kuma")
-    except Exception as e:
-        print(f"Failed to push to Uptime Kuma: {e}")
+    uptime_kuma_ok = push_with_retry(
+        "Uptime Kuma",
+        lambda: requests.get(UPTIME_KUMA_URL, timeout=10),
+        UPTIME_KUMA_URL,
+    )
 
-sys.exit(0 if success else 1)
+# Exit non-zero when the round-trip itself failed, OR when BOTH push endpoints
+# failed after all retries (only possible on the success path — on failure we
+# only attempt Pushgateway, and the round-trip failure already dominates exit).
+both_pushes_failed = success and (not pushgateway_ok) and (not uptime_kuma_ok)
+sys.exit(0 if (success and not both_pushes_failed) else 1)
 '
               EOT
               ]