mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay

Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)

Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT

Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP

Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
This commit is contained in:
Viktor Barzin 2026-04-12 22:24:38 +01:00
parent 8bc02d1401
commit 1c300a14cf
11 changed files with 993 additions and 53 deletions

View file

@ -80,7 +80,7 @@ Violations cause state drift, which causes future applies to break or silently r
**Flow**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image`
**Migrated to GHA** (9): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search
**Migrated to GHA** (10): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints
**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
**Per-project files**:
@ -89,7 +89,7 @@ Violations cause state drift, which causes future applies to break or silently r
- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires)
**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML).
Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79
Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD
**Woodpecker YAML gotchas**:
- Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty
@ -131,7 +131,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- Exclude completed CronJob pods from "pod not ready" alerts.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor.
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 30 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (90m), `EmailRoundtripStale` (90m), `EmailRoundtripNeverRun` (2h). Vault: `mailgun_api_key` in `secret/viktor`.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay).
## Storage & Backup Architecture