infra/docs/plans/2026-07-04-backup-mx-design.md
Viktor Barzin 114a7743ac
All checks were successful
ci/woodpecker/push/default Pipeline was successful
backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00

20 KiB
Raw Blame History

Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design

Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design, pre-implementation · ADR: 0019

v3 incorporates two independent adversarial-challenge reviews (same day). Their material corrections are marked [CH] throughout — the largest: the v2 drain path would never have drained (primary-side smtpd rejects), monitoring-over- tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce model was wrong (it can never deliver a DSN).

Goal

Inbound mail for viktorbarzin.me must survive homelab outages without loss. Requirement level (Viktor, 2026-07-04): never lose mail; delayed delivery is acceptable; budget is $0 (hard constraint — reaffirmed after the Rollernet gates failed). A store-and-forward backup MX queues mail while the homelab is down and re-delivers when it returns.

Out of scope, explicitly:

  • Reading new mail during an outage.
  • Outbound mail during outages.
  • The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is never consulted when the primary answers. Separate hardening/alerting track.

Known residual limit (state it plainly): an outage longer than 30 days loses the queued mail silently — the VM cannot emit a bounce to anyone (egress 25 blocked), so no sender ever learns. Accepted; 30 days is already 6× the sender-retry status quo.

v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)

v1 selected Roller Network's free Secondary MX. The validation gates killed it before any DNS change:

  • G2 FAILED: the free-accounts policy caps free mail service at 200 relayed messages or 10 MB per rolling 7 days; overage → domain suspended 48 h answering SMTP 5xx (permanent bounces), repeatable. Spammers deliberately target backup MXes even while the primary is up, so background spam alone can hold the domain suspended — worse than no backup MX.
  • G1 SHAKY: same policy page says free accounts are being discontinued.
  • G3 PASSED (for posterity): mail{,2}.rollernet.us present valid LE certs over STARTTLS.
  • Signup is Cloudflare-Turnstile-gated — moot given G1/G2.

Viktor's decision: stay free → self-host on Oracle Always-Free. [CH] The external challenger re-searched the free landscape (DNSExit, KisoLabs, DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed: no credible free managed backup-MX or free VM with a usable port-25 story exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and is US-regions-only (wrong continent).

Decision

A minimal Postfix store-and-forward relay (mx2.viktorbarzin.me) on an Oracle Cloud Always-Free compute instance, published as a lower-preference MX. It accepts mail for viktorbarzin.me when the primary is unreachable, queues up to 30 days, and drains to the primary when it returns. No mailboxes, no third-party terms — the queue-lifetime and reject-behavior knobs are ours.

Architecture

                         ┌── pri 1  mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤                                        ▲
                         └── pri 20 mx2.viktorbarzin.me           │ drain: smtp to
                             (Oracle VM, Postfix relay,           │ mail.viktorbarzin.me:2526
                              queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
                                                                     2526 → 10.0.20.1:25,
                                                                     existing HAProxy frontend)
  • Normal operation: senders use pri 1; the VM idles (spammers targeting the backup + transient-blip retries get relayed onward immediately).
  • Outage: senders fall back to pri 20 → VM accepts + queues → Postfix retries the primary on its native schedule → queue drains after recovery through the standard external ingress path (PROXY v2 → :2525 → rspamd → Dovecot).
  • Custom drain port: Oracle blocks egress TCP 25 tenancy-wide (post-2021; exemptions unreliable) — the VM cannot reach mail.viktorbarzin.me:25. One pfSense WAN NAT rule TCP 2526 → 10.0.20.1:25 reuses the existing HAProxy frontend unchanged. [CH] Verified against the runbook: the frontend binds *:25 on pfSense (not strictly 10.0.20.1), rdr dst-port rewrite is the existing production pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 to the VM is unaffected by Oracle's egress-only block per practitioner evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — to be proven at gate O2 before any DNS change (Oracle publishes no positive commitment).

Oracle account & instance

  • Account: Viktor creates it (human signup; card for identity, $0 charged). Home region is fixed at signup and Always-Free compute exists only there — choose eu-frankfurt-1 deliberately; there is no try-another-region fallback without a new account. [CH]
  • [CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation: Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an idle Postfix box qualifies) and demonstrably changes free-tier terms without notice, enforcing by termination (June 2026: A1 allowance silently halved, over-limit instances shut down). PAYG keeps Always-Free resources free and exempts them from idle reclamation.
  • Shape: VM.Standard.E2.1.Micro (x86, 1/8 OCPU burst, 1 GB RAM; 2 always-free instances allowed; ample for queue-only Postfix — and untouched by the 2026 A1 cuts). ARM A1 fallback is unreliable (halved quota, chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate.
  • [CH] Reserved public IP is mandatory (oci_core_public_ip, reserved): an ephemeral IP rotates on stop/start and would silently break all four IP-keyed controls at once (pfSense NAT source-restriction, the primary's smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape allowlist) — discovered only at the next outage's drain.
  • OS: Ubuntu 24.04. [CH] OCI Ubuntu images ship an OS-level iptables ruleset (/etc/iptables/rules.v4) that ACCEPTs 22 and REJECTs everything else, independent of security lists — cloud-init must insert ACCEPT rules for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2 fails on day 1 with a correct security list.
  • Credentials: OCI API key for Terraform → Vault secret/viktor (oci_*); web login → Vaultwarden item Oracle Cloud (backup MX).

Networking & security posture

  • Ingress on the VM: TCP 25 world-open (the service). [CH] TCP 80 world-open permanently — Let's Encrypt validation is multi-perspective with no published source IPs, so it cannot be source-scoped, and a "open-only-during-renewal" toggle is unspecified automation whose realistic failure mode is an expired cert at day ~90. Nothing listens on 80 outside certbot's seconds-long renewal windows; connection-refused surface is negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32 (176.12.22.76) in both the Oracle security list and the VM firewall.
  • No public SSH: management rides the headscale tailnet — cloud-init enrolls via a preauth key for a dedicated non-OIDC headscale user with node tag tag:backup-mx (headscale 0.28.0 file-mode ACL, content in Vault secret/headscaleheadscale_acl); SSH bound to the tailnet interface. ACL grant: group:admin → tag:backup-mx:22 (cluster pods are NOT tailnet members — see monitoring). [CH] Outage caveat: headscale's control plane + DERP live in the cluster, so mid-outage tailnet reachability is cached-netmap best-effort — the runbook documents the OCI instance console connection as break-glass management. (Also fix vpn.md's stale "0.23.x / OIDC-only" claims while in there.)
  • VM compromise blast radius: plaintext of outage-queued mail + a relay surface contained by relay_domains = viktorbarzin.me only, no submission ports, no SASL, no local delivery. The VM is deliberately NOT added to the primary's mynetworks (that would let a compromised VM relay arbitrary mail through the primary) — per-stage exemptions instead, below.

Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)

  • relay_domains = viktorbarzin.me; mydestination = (empty).
  • [CH] smtpd_relay_restrictions = permit_mynetworks, reject_unauth_destination — explicit 5xx for foreign-domain RCPTs (the default tail is defer_unauth_destination, whose 4xx invites every relay probe to retry forever).
  • [CH] relay_recipient_maps explicitly set to the wildcard form (@viktorbarzin.me OK) — documents accept-all-recipients as a decision (the domain is catch-all; every RCPT is valid by definition).
  • transport_maps: viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526.
  • maximal_queue_lifetime = 30d. [CH] bounce_queue_lifetime = 1d and delay_warning_time = 0 — this host can never deliver a DSN to anyone (egress 25 blocked; its only egress is 2526 to the primary), so undeliverable bounces must be discarded quickly or they rot in the queue for a month and permanently poison the queue-depth alert.
  • [CH] message_size_limit = 209715200 — exactly the primary's 200 MB (POSTFIX_MESSAGE_SIZE_LIMIT, mailserver main.tf:88). The stock 10 MB default would 552-reject large legitimate mail during outages — the exact loss mode this project exists to prevent. Equal, never higher (higher recreates drain-time rejects).
  • [CH] postscreen on the VM in 4xx-only posture: pregreet test ON (fire-and-forget bots don't retry; real MTAs do — the whole design already rests on sender retry, so 4xx filtering is loss-free by construction), optionally postscreen_dnsbl_action = defer with a conservative threshold. v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned) with 4xx tempfail (harmless); without any hygiene the backup is a 24/7 spam backdoor since spammers deliberately deliver to the highest-numbered MX. Zero 5xx from reputation, ever.
  • inet_protocols = ipv4 [CH] — the primary publishes an AAAA (HE tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted v6 attempt per delivery.
  • smtpd_tls_cert_file = LE cert for mx2.viktorbarzin.me (opportunistic STARTTLS inbound; smtp_tls_security_level = may on the drain leg).
  • Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day accumulation for a personal domain.

TLS

certbot standalone HTTP-01 for mx2.viktorbarzin.me (no Cloudflare API token on an internet-facing VM). Port 80 permanently open (see above); certbot renew timer. The MTA-STS follow-up (separate task; policy host currently dangling — below) must list mx2.viktorbarzin.me when implemented.

Primary-side drain enablement [CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]

The v2 exemptions targeted postscreen DNSBL (which is off on the primary — ENABLE_DNSBL unset) and rspamd SPF/DMARC scoring — but missed the three mechanisms that would actually break the drain. All are keyed on the VM's reserved /32 (the PROXY-v2-recovered client IP):

  1. reject_unknown_client_hostname bypass — the primary sets POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1 (main.tf:89); an Oracle IP without full FCrDNS (PTR needs an Oracle SR; limited on free accounts) would be 450-deferred on every drain attempt → the queue never drains → mass-bounces at day 30. Fix: check_client_access permit for the VM /32 early in smtpd_client_restrictions, and a matching permit at the sender stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope senders — drained self-addressed/bounced mail would 5xx). Attempt the Oracle PTR anyway (belt and braces).
  2. Anvil rate-limit exceptionsmtpd_client_message_rate_limit = 30/min keys on the VM's IP at drain; a >3,600-message backlog would throttle for hours and false-fire the queue alert. Add the VM /32 to smtpd_client_event_limit_exceptions.
  3. rspamd: evaluate the original sender, never 5xx the drain stream — via the existing override.d ConfigMap pattern (same mount as dkim_signing.conf): (a) configure rspamd's external_relay module (ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the original client IP parsed from the VM's Received header — this keeps DMARC protection for the entire drain stream instead of v2's blanket disable; (b) cap rspamd's action at the VM /32 to tag/fold — never milter-reject: the primary's default reject tier (DMS default, active since only dkim_signing is overridden today) would 5xx high-score spam at DATA, forcing the VM to generate DSNs to forged senders = classic backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in the catch-all's Junk instead. Validate the external_relay ↔ settings-rule interplay at gate O5 with a high-spam-score message.
  4. postscreen permit for the /32 (harmless; pregreet never trips a real Postfix client and DNSBL is off — kept for future-proofing only).

Our-side changes (Terraform unless noted)

  1. New stack stacks/backup-mx/ (Tier 1): OCI provider (creds from Vault), VCN + subnet + security list + reserved public IP + VM.Standard.E2.1.Micro + cloud-init (templatefile): OS iptables ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule (persisted), postfix + config above, certbot, tailscale→headscale enrollment (preauth key from Vault), node_exporter, postfix_exporter, unattended-upgrades.
  2. DNSstacks/cloudflared/modules/cloudflared/cloudflare.tf: A mx2.viktorbarzin.me → reserved IP (non-proxied), MX pref 20 → mx2. [CH] Live zone count verified: 195/200 → 197/200 after this change; only 3 slots remain and the MTA-STS follow-up needs 12 → plan the next record-purge now, not at collision time.
  3. pfSense (live network device — approved as part of this plan): WAN NAT rdr TCP 2526 → 10.0.20.1:25 + firewall rule, source-restricted to the reserved IP. [CH] Scripted (extend the existing scripts/pfsense-*-haproxy*.php bootstrap-script family), not hand-clicked — keeps the git-rebuildable parity the rest of the pfSense mail config has. Config.xml rides the nightly backup.
  4. Mailserver stack: the four-layer drain enablement above (client+sender check_client_access permits, anvil exception, rspamd external_relay + action cap, postscreen permit) — all keyed to one /32, via the existing postfix_cf / user-patches.sh / rspamd-override hook points (verified present: main.tf:129-144, 222-281, 467-474).
  5. Monitoring [CH — replaces v2's tailnet scraping, which had no transport: no cluster→tailnet route exists and no existing target is scraped that way]: Prometheus scrapes node_exporter/postfix_exporter on the VM's public reserved IP, allowed only from the homelab WAN /32 (Oracle SL + VM firewall); blackbox TCP:25 from the cluster (BackupMxDown, warning); MX-set drift assertion (both MX records present). Alerts: BackupMxQueueStuck = non-bounce queue depth > 0 for 2 h while the primary is healthy (gate on the existing MailServerDown/roundtrip series, machine-readable — not prose); bounce residue is excluded by the 1-day bounce lifetime. Note: during a full homelab outage Prometheus itself is down — queue growth is unobservable live under ANY transport; what we actually watch is the post-recovery drain. A WAN-IP change stales the Oracle allowlist → visible as ScrapeTargetDown (self-signaling). Probe semantics note: once mx2 exists, the Brevo roundtrip probe's mail fails over to mx2 on transient primary blips and arrives minutes late via the drain — EmailRoundtripFailing may then mean "delayed via mx2", not "lost"; note in the alert description and runbook.
  6. Docs (same commit as implementation): rewrite mailserver.md §"No Backup MX", new runbook docs/runbooks/backup-mx.md (postqueue -p, forced drain postqueue -f, cert renewal, OCI console break-glass, VM rebuild from stack, Oracle account facts incl. PAYG + home-region lock), vpn.md headscale-version/OIDC staleness fix, monitoring rows.

MTA-STS finding (unchanged; no action in this change)

_mta-sts TXT is published but mta-sts.viktorbarzin.me has no record and nothing serves the policy — MTA-STS is inert today. When fixed, the policy MUST include mx: mx2.viktorbarzin.me (and budget its DNS records against the 3 remaining zone slots).

Validation gates (in order; any failure → stop and report)

# Gate Method Failure handling
O1 Oracle account (home region eu-frankfurt-1, fixed forever at signup), PAYG conversion done, E2.1.Micro capacity Viktor signs up + converts; TF apply A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor
O2 Inbound TCP 25 reachable from the internet (after the OS-iptables fix) nc -zv <reserved-ip> 25 from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) Stop; decision returns to Viktor
O3 Drain works: VM → mail.viktorbarzin.me:2526 delivers end-to-end Test message injected on the VM Debug pfSense NAT / HAProxy path
O4 LE cert issued certbot standalone STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS
O5 Live failover test — hardened [CH] presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo plus a high-spam-score message and a >10 MB message → confirm queued (postqueue -p) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers Debug or roll back (remove MX record)

Failure modes

Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP changes, short-retry senders. If pfSense is down the drain waits — Postfix retries until it heals.

Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox access; outages > 30 days lose queued mail silently (no DSN possible). Simultaneous Oracle+homelab outage = status quo ante (sender retries).

Newly introduced, accepted:

  • A pet outside the cluster — deliberately cattle: rebuilt from TF + cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a backup target.
  • Oracle free-tier caprice [CH — upgraded from v2's framing]: Oracle has silently cut Always-Free allowances and terminated over-limit instances (June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe, BackupMxDown, and the fact that outside an active outage the queue is empty — a surprise reclamation loses nothing, only coverage until rebuilt. Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
  • Spam hygiene: 4xx-only postscreen on the VM (pregreet + conservative DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by rspamd, never bounced.
  • Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant; accepted).

Rollback

Remove the MX + A records; wait for postqueue -p empty; terraform destroy on backup-mx; delete the pfSense NAT rule (scripted); drop the mailserver /32 exemptions. Order matters: MX record first.

Viktor's manual steps (everything else is mine)

  1. Create the Oracle Cloud account — home region eu-frankfurt-1 (fixed forever), card for identity, $0 charged.
  2. Convert the tenancy to Pay-As-You-Go (required — idle-reclamation exemption; Always-Free stays $0).
  3. Hand me the tenancy OCID + a console user → I mint the API key, store creds (Vault + Vaultwarden), and build the stack.
  4. Approve the (scripted) pfSense NAT rule when I reach that step.