[mailserver] Retry probe Pushgateway + Uptime Kuma pushes with backoff
## Context The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)` in bare `try/except` that only prints "Failed to push ..." on error. If Pushgateway is transiently unreachable (e.g., during a Prometheus Helm upgrade / HPA scale-down / brief network blip) metrics silently drop and downstream detection relies entirely on `EmailRoundtripStale` firing after 60 min of staleness. Single transient failures masquerade as data-plane breakage for up to an hour. Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes. ## This change - Extracts a `push_with_retry(label, func, url)` helper that performs 3 attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as success, everything else as failure. On final failure, logs an explicit `ERROR:` line to stderr with the URL and either the last HTTP status or the exception repr — matches the existing `print(...)` logging style used throughout the heredoc (no stdlib `logging` dependency added). - Replaces the two inline `try/requests.put/except print` blocks with calls to the helper. Pushgateway runs unconditionally; Uptime Kuma still only runs on round-trip success (same as before). - Makes exit code responsive to push outcome: probe exits non-zero when the round-trip itself failed (unchanged), OR when BOTH pushes failed all retries on the success path. Single-endpoint push failure with the other succeeding keeps exit 0 — partial observability is preferred over noisy pod restarts from Kubernetes' Job controller. ## Behavior matrix ``` roundtrip | pushgw | kuma | exit | rationale ----------+--------+------+------+------------------------------- success | ok | ok | 0 | happy path (unchanged) success | fail | ok | 0 | one endpoint still has telemetry success | ok | fail | 0 | one endpoint still has telemetry success | fail | fail | 1 | NEW — total observability loss fail | ok | - | 1 | roundtrip failed (unchanged, Kuma skipped) fail | fail | - | 1 | roundtrip failed (unchanged, Kuma skipped) ``` ## What is NOT in this change - Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of scope per the task description. - `logging` stdlib adoption — rest of heredoc uses `print`, staying consistent. - Moving the heredoc out of `main.tf` into a sidecar Python file — separate refactor. ## Reproduce locally 1. Point PUSHGATEWAY at a black hole: `kubectl -n mailserver set env cronjob/email-roundtrip-monitor \` `PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test` 2. Trigger a one-shot job: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test` 3. Expected in logs: - 3 attempts, each ~1s/2s/4s apart - `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...` - Uptime Kuma push still succeeds (round-trip ok) → exit 0 4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect exit 1 + two ERROR lines. ## Automated - `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK (heredoc extracts cleanly). - `terraform fmt -check -recursive modules/mailserver/` → no diff. Closes: code-n5l Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f568e7d2bf
commit
f707968091
1 changed files with 40 additions and 11 deletions
|
|
@ -740,21 +740,50 @@ email_roundtrip_duration_seconds {duration:.2f}
|
|||
# TYPE email_roundtrip_last_success_timestamp gauge
|
||||
email_roundtrip_last_success_timestamp {int(time.time()) if success else 0}
|
||||
"""
|
||||
try:
|
||||
requests.put(PUSHGATEWAY, data=metrics, timeout=10)
|
||||
print("Pushed metrics to Pushgateway")
|
||||
except Exception as e:
|
||||
print(f"Failed to push metrics: {e}")
|
||||
UPTIME_KUMA_URL = "http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration))
|
||||
|
||||
def push_with_retry(label, func, url):
|
||||
# 3 attempts with exponential backoff (1s, 2s, 4s). Returns True on success, False otherwise.
|
||||
# Final failure logs ERROR with URL + status code (or exception) so the pod log surfaces the drop.
|
||||
last_status = None
|
||||
last_exc = None
|
||||
for attempt in range(3):
|
||||
try:
|
||||
resp = func()
|
||||
last_status = resp.status_code
|
||||
if 200 <= resp.status_code < 300:
|
||||
print(f"Pushed to {label} (attempt {attempt+1}, status {resp.status_code})")
|
||||
return True
|
||||
last_exc = None
|
||||
except Exception as e:
|
||||
last_exc = e
|
||||
last_status = None
|
||||
if attempt < 2:
|
||||
time.sleep(2 ** attempt)
|
||||
detail = f"status={last_status}" if last_exc is None else f"exception={last_exc!r}"
|
||||
print(f"ERROR: Failed to push to {label} after 3 attempts: url={url} {detail}", file=sys.stderr)
|
||||
return False
|
||||
|
||||
pushgateway_ok = push_with_retry(
|
||||
"Pushgateway",
|
||||
lambda: requests.put(PUSHGATEWAY, data=metrics, timeout=10),
|
||||
PUSHGATEWAY,
|
||||
)
|
||||
|
||||
# Push to Uptime Kuma on success
|
||||
uptime_kuma_ok = True
|
||||
if success:
|
||||
try:
|
||||
requests.get("http://uptime-kuma.uptime-kuma.svc.cluster.local/api/push/hLtyRKgeZO?status=up&msg=OK&ping=" + str(int(duration)), timeout=10)
|
||||
print("Pushed to Uptime Kuma")
|
||||
except Exception as e:
|
||||
print(f"Failed to push to Uptime Kuma: {e}")
|
||||
uptime_kuma_ok = push_with_retry(
|
||||
"Uptime Kuma",
|
||||
lambda: requests.get(UPTIME_KUMA_URL, timeout=10),
|
||||
UPTIME_KUMA_URL,
|
||||
)
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
# Exit non-zero when the round-trip itself failed, OR when BOTH push endpoints
|
||||
# failed after all retries (only possible on the success path — on failure we
|
||||
# only attempt Pushgateway, and the round-trip failure already dominates exit).
|
||||
both_pushes_failed = success and (not pushgateway_ok) and (not uptime_kuma_ok)
|
||||
sys.exit(0 if (success and not both_pushes_failed) else 1)
|
||||
'
|
||||
EOT
|
||||
]
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue