infra/stacks/mailserver
Viktor Barzin 345ba2182f [mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout
## Context

After fixing the two mail-server-side root causes of probe false-failures
(Dovecot userdb duplicates, postscreen btree lock contention), the probe
is expected to succeed well under 120s. This commit is defence in depth
against residual SMTP relay variance and against a future scenario where
Dovecot is transiently unresponsive during IMAP login.

The probe currently polls IMAP with `range(9) × 20s = 180s`. Brevo's
queueing, DNS variance, and general SMTP retry backoff can easily
exceed that on a bad day. Widening to 5 minutes gives plenty of headroom
while still remaining well within the CronJob's 20-minute schedule
interval.

Additionally, `imaplib.IMAP4_SSL(...)` previously had no timeout. If
Dovecot is unresponsive (e.g., mid-rollout, transient TLS handshake
hang), the connect call can block indefinitely and the probe hangs
without ever looping to the next attempt. Adding `timeout=10` caps each
connect at 10s so the retry loop keeps making forward progress.

## This change

Two edits to the embedded probe script inside the cronjob resource:

```
-    # Step 2: Wait for delivery, retry IMAP up to 3 min
+    # Step 2: Wait for delivery, retry IMAP up to 5 min (15 x 20s)
  ...
-    for attempt in range(9):
+    for attempt in range(15):
  ...
-            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
+            imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
```

Flow (before):

```
send via Brevo ─► for 9 loops: sleep 20s, IMAP connect (blocks on hang) ─► 180s total
```

Flow (after):

```
send via Brevo ─► for 15 loops: sleep 20s, IMAP connect (≤10s) ─► 300s total
                                           │
                                           └─ timeout ─► log, continue to next loop
```

## What is NOT in this change

- Probe frequency stays at `*/20 * * * *`.
- The `EmailRoundtripStale` alert thresholds are intentionally left at
  3600s + for: 10m. Those fire only on sustained multi-hour issues and
  should not be loosened — they would mask future regressions. Probe
  success rate is now expected to recover to ≥95% from the two upstream
  fixes; if it doesn't, alert tuning gets revisited separately.
- No change to the Brevo send step, the success-metrics push, or the
  cleanup of stale e2e-probe-* messages.

## Test Plan

### Automated

`scripts/tg plan -target=module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor`:

```
  # module.mailserver.kubernetes_cron_job_v1.email_roundtrip_monitor will be updated in-place
  -     for attempt in range(9):
  +     for attempt in range(15):
  -             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx)
  +             imap = imaplib.IMAP4_SSL(IMAP_HOST, 993, ssl_context=ctx, timeout=10)
Plan: 0 to add, 1 to change, 0 to destroy.
```

`scripts/tg apply`:

```
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```

### Manual Verification

1. Trigger the probe manually:
   `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
2. Tail its logs:
   `kubectl -n mailserver logs job/probe-verify-<ts> -f`
3. Expect: `Round-trip SUCCESS` within the 5-min window. Typical
   successful run should still complete in < 60s now that postscreen
   is no longer stalling.
4. Watch the 48-hour window on the `email_roundtrip_success` gauge in
   Prometheus — expect ≥95% (was ~65% before all three fixes).

## Reproduce locally

1. `kubectl -n mailserver get cronjob email-roundtrip-monitor -o yaml | grep -E "range\(|timeout"`
2. Expect: `range(15)` and `timeout=10`
3. `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)`
4. `kubectl -n mailserver logs -f job/probe-verify-<ts>`
5. Expect: eventual `Round-trip SUCCESS in <N>s` message and exit 0.

Closes: code-18e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:33:56 +00:00
..
modules/mailserver [mailserver] Widen email-roundtrip probe IMAP window 180s → 300s + per-attempt timeout 2026-04-18 21:33:56 +00:00
main.tf fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 2026-04-15 06:41:56 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00