mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay
Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs
This commit is contained in:
parent
8bc02d1401
commit
1c300a14cf
11 changed files with 993 additions and 53 deletions
|
|
@ -15,7 +15,7 @@ graph TB
|
|||
GPU[NVIDIA GPU via dcgm-exporter]
|
||||
UPS[UPS Exporter]
|
||||
NFS[NFS Exporter]
|
||||
EMAIL[Email Roundtrip Probe<br/>CronJob every 30m]
|
||||
EMAIL[Email Roundtrip Probe<br/>CronJob every 10m]
|
||||
end
|
||||
|
||||
subgraph "Monitoring Stack (platform stack)"
|
||||
|
|
@ -148,11 +148,11 @@ spec:
|
|||
- **4xx/5xx Error Rates**: HTTP error rate threshold exceeded
|
||||
|
||||
#### Email Monitoring Alerts
|
||||
- **EmailRoundtripFailing**: E2E email probe returning failure for >90m
|
||||
- **EmailRoundtripStale**: No successful email round-trip in >90m
|
||||
- **EmailRoundtripNeverRun**: Email probe has never reported (CronJob not running)
|
||||
- **EmailRoundtripFailing**: E2E email probe returning failure for >30m
|
||||
- **EmailRoundtripStale**: No successful email round-trip in >40m
|
||||
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
|
||||
|
||||
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 30 min) in the `mailserver` namespace that:
|
||||
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
|
||||
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
|
||||
2. Email lands in the `spam@` catch-all mailbox via MX delivery
|
||||
3. Verifies delivery via IMAP (searches by UUID marker in subject)
|
||||
|
|
@ -160,7 +160,7 @@ The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 30
|
|||
5. Pushes metrics (`email_roundtrip_success`, `email_roundtrip_duration_seconds`, `email_roundtrip_last_success_timestamp`) to Prometheus Pushgateway
|
||||
6. Pushes status to Uptime Kuma E2E Push monitor
|
||||
|
||||
Uptime Kuma also has TCP monitors for SMTP (port 25) and IMAP (port 993) on `10.0.20.200`.
|
||||
Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.
|
||||
|
||||
#### Backup Alerts
|
||||
- **PostgreSQLBackupStale**: >36h since last backup
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue