## Context
The e2e email-roundtrip probe (CronJob `email-roundtrip-monitor`) currently
wraps `requests.put(PUSHGATEWAY, ...)` and `requests.get(UPTIME_KUMA, ...)`
in bare `try/except` that only prints "Failed to push ..." on error. If
Pushgateway is transiently unreachable (e.g., during a Prometheus Helm
upgrade / HPA scale-down / brief network blip) metrics silently drop and
downstream detection relies entirely on `EmailRoundtripStale` firing after
60 min of staleness. Single transient failures masquerade as data-plane
breakage for up to an hour.
Target task: `code-n5l` — Add retry to probe Pushgateway + Uptime Kuma pushes.
## This change
- Extracts a `push_with_retry(label, func, url)` helper that performs 3
attempts with exponential backoff (1s, 2s, 4s). Treats HTTP 2xx as
success, everything else as failure. On final failure, logs an explicit
`ERROR:` line to stderr with the URL and either the last HTTP status or
the exception repr — matches the existing `print(...)` logging style
used throughout the heredoc (no stdlib `logging` dependency added).
- Replaces the two inline `try/requests.put/except print` blocks with
calls to the helper. Pushgateway runs unconditionally; Uptime Kuma
still only runs on round-trip success (same as before).
- Makes exit code responsive to push outcome: probe exits non-zero when
the round-trip itself failed (unchanged), OR when BOTH pushes failed
all retries on the success path. Single-endpoint push failure with the
other succeeding keeps exit 0 — partial observability is preferred
over noisy pod restarts from Kubernetes' Job controller.
## Behavior matrix
```
roundtrip | pushgw | kuma | exit | rationale
----------+--------+------+------+-------------------------------
success | ok | ok | 0 | happy path (unchanged)
success | fail | ok | 0 | one endpoint still has telemetry
success | ok | fail | 0 | one endpoint still has telemetry
success | fail | fail | 1 | NEW — total observability loss
fail | ok | - | 1 | roundtrip failed (unchanged, Kuma skipped)
fail | fail | - | 1 | roundtrip failed (unchanged, Kuma skipped)
```
## What is NOT in this change
- Alert thresholds (`EmailRoundtripStale` still 60m) — explicitly out of
scope per the task description.
- `logging` stdlib adoption — rest of heredoc uses `print`, staying
consistent.
- Moving the heredoc out of `main.tf` into a sidecar Python file —
separate refactor.
## Reproduce locally
1. Point PUSHGATEWAY at a black hole:
`kubectl -n mailserver set env cronjob/email-roundtrip-monitor \`
`PUSHGATEWAY=http://nope.invalid:9091/metrics/job/test`
2. Trigger a one-shot job:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-test`
3. Expected in logs:
- 3 attempts, each ~1s/2s/4s apart
- `ERROR: Failed to push to Pushgateway after 3 attempts: url=... exception=...`
- Uptime Kuma push still succeeds (round-trip ok) → exit 0
4. Flip UPTIME_KUMA_URL to also fail (edit heredoc or DNS-poison): expect
exit 1 + two ERROR lines.
## Automated
- `python3 -c "import ast; ast.parse(open('/tmp/probe.py').read())"` → OK
(heredoc extracts cleanly).
- `terraform fmt -check -recursive modules/mailserver/` → no diff.
Closes: code-n5l
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>