From 345ba2182fc6f639fd9fa88a0dc0e40362d23679 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 18 Apr 2026 21:33:56 +0000 Subject: [PATCH] =?UTF-8?q?[mailserver]=20Widen=20email-roundtrip=20probe?= =?UTF-8?q?=20IMAP=20window=20180s=20=E2=86=92=20300s=20+=20per-attempt=20?= =?UTF-8?q?timeout?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Context After fixing the two mail-server-side root causes of probe false-failures (Dovecot userdb duplicates, postscreen btree lock contention), the probe is expected to succeed well under 120s. This commit is defence in depth against residual SMTP relay variance and against a future scenario where Dovecot is transiently unresponsive during IMAP login. The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's queueing, DNS variance, and general SMTP retry backoff can easily exceed that on a bad day. Widening to 5 minutes gives plenty of headroom while still remaining well within the CronJob's 20-minute schedule interval. Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake hang), the connect call can block indefinitely and the probe hangs without ever looping to the next attempt. Adding `timeout=10` caps each connect at 10s so the retry loop keeps making forward progress. ## This change Two edits to the embedded probe script inside the cronjob resource: ``` - # Step 2: Wait for delivery, retry IMAP up to 3 min + # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s) ... - for attempt in range(9): + for attempt in range(15): ... - imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx) + imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10) ``` Flow (before): ``` send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total ``` Flow (after): ``` send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total │ └─ timeout ─► log, continue to next loop ``` ## What is NOT in this change - Probe frequency stays at `*/20 * * * *`. - The `EmailRoundtripStale` alert thresholds are intentionally left at 3600s + for: 10m. Those fire only on sustained multi-hour issues and should not be loosened — they would mask future regressions. Probe success rate is now expected to recover to ≥95% from the two upstream fixes; if it doesn't, alert tuning gets revisited separately. - No change to the Brevo send step, the success-metrics push, or the cleanup of stale e2e-probe-* messages. ## Test Plan ### Automated `scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`: ``` # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place - for attempt in range(9): + for attempt in range(15): - imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx) + imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10) Plan: 0 to add, 1 to change, 0 to destroy. ``` `scripts/tg apply`: ``` Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` ### Manual Verification 1. Trigger the probe manually: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` 2. Tail its logs: `kubectl -n mailserver logs job/probe-verify- -f` 3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical successful run should still complete in < 60s now that postscreen is no longer stalling. 4. Watch the 48-hour window on the `email_roundtrip_success` gauge in Prometheus — expect ≥95% (was ~65% before all three fixes). ## Reproduce locally 1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"` 2. Expect: `range(15)` and `timeout=10` 3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` 4. `kubectl -n mailserver logs -f job/probe-verify-` 5. Expect: eventual `Round-trip SUCCESS in s` message and exit 0. Closes: code-18e Co-Authored-By: Claude Opus 4.7 (1M context) --- stacks/mailserver/modules/mailserver/main.tf | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/stacks/mailserver/modules/mailserver/main.tf b/stacks/mailserver/modules/mailserver/main.tf index 16cbcc8a..68eb8b1d 100644 --- a/stacks/mailserver/modules/mailserver/main.tf +++ b/stacks/mailserver/modules/mailserver/main.tf @@ -622,15 +622,15 @@ try: resp.raise_for_status() print(f"Sent test email via Brevo: {resp.status_code} marker={marker}") - # Step 2: Wait for delivery, retry IMAP up to 3 min + # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s) ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE found = False - for attempt in range(9): + for attempt in range(15): time.sleep(20) try: - imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx) + imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10) imap.login(IMAP_USER, IMAP_PASS) imap.select("INBOX") _, msg_ids = imap.search(None, "SUBJECT", marker)