[mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout
## Context
After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.
The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.
Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.
## This change
Two edits to the embedded probe script inside the cronjob resource:
```
- # Step 2: Wait for delivery, retry IMAP up to 3 min
+ # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
...
- for attempt in range(9):
+ for attempt in range(15):
...
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```
Flow (before):
```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```
Flow (after):
```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
│
└─ timeout ─► log, continue to next loop
```
## What is NOT in this change
- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
3600s + for: 10m. Those fire only on sustained multi-hour issues and
should not be loosened — they would mask future regressions. Probe
success rate is now expected to recover to ≥95% from the two upstream
fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
cleanup of stale e2e-probe-* messages.
## Test Plan
### Automated
`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:
```
# module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
- for attempt in range(9):
+ for attempt in range(15):
- imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+ imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```
`scripts/tg apply`:
```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
### Manual Verification
1. Trigger the probe manually:
`kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
`kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
successful run should still complete in < 60s now that postscreen
is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
Prometheus — expect ≥95% (was ~65% before all three fixes).
## Reproduce locally
1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.
Closes: code-18e
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e2516b07a3
commit
345ba2182f
1 changed files with 3 additions and 3 deletions
|
|
@ -622,15 +622,15 @@ try:
|
|||
resp.raise_for_status()
|
||||
print(f"Sent test email via Brevo: {resp.status_code} marker={marker}")
|
||||
|
||||
# Step 2: Wait for delivery, retry IMAP up to 3 min
|
||||
# Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
|
||||
ctx = ssl.create_default_context()
|
||||
ctx.check_hostname = False
|
||||
ctx.verify_mode = ssl.CERT_NONE
|
||||
found = False
|
||||
for attempt in range(9):
|
||||
for attempt in range(15):
|
||||
time.sleep(20)
|
||||
try:
|
||||
imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
|
||||
imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
|
||||
imap.login(IMAP_USER, IMAP_PASS)
|
||||
imap.select("INBOX")
|
||||
_, msg_ids = imap.search(None, "SUBJECT", marker)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue