infra/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
Viktor Barzin 114a7743ac
All checks were successful
ci/woodpecker/push/default Pipeline was successful
backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00

6.2 KiB
Raw Blame History

Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free

viktorbarzin.me has run a single direct MX to the home IP since the 2026-04-12 inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only outage protection — a documented "No Backup MX" decision made after ForwardEmail's forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email Routing proved pass-through-only. Viktor now wants inbound mail to survive homelab outages without loss (2026-07-04): delayed delivery is fine, mid-outage reading is not required, and the budget is $0 — a hard constraint that eliminated every managed option (see below).

We run a minimal Postfix store-and-forward relay on an Oracle Cloud Always-Free VM.Standard.E2.1.Micro (mx2.viktorbarzin.me, reserved public IP, MX preference 20; primary untouched at 1). It accepts everything for the domain (catch-all — every RCPT is valid; reputation may only ever 4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM — never 5xx: a backup MX that hard-rejects manufactures the loss it exists to prevent), queues up to 30 days (bounce lifetime 1 day — the VM can never deliver a DSN, its only egress is the drain), and drains to the primary over port 2526 — one scripted pfSense WAN NAT rule onto the existing HAProxy frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is tailnet-only (headscale preauth key, tag:backup-mx; OCI console as mid-outage break-glass since headscale itself lives in the cluster); TLS via certbot HTTP-01 (port 80 permanently open — LE validation is multi-perspective and unscopeable); the VM is a cattle-rebuild from a new stacks/backup-mx/ Terraform stack (OCI provider + cloud-init, which must also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT). On the primary, the drain stream (one /32) is enabled at the layers that actually bite — check_client_access permits past reject_unknown_client_hostname and spoof-protection, an anvil rate-limit exception, and rspamd external_relay (score against the original sender IP) with the reject action capped to tag/fold so drained spam can never force the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25 reachability (recurring probe — Oracle publishes no commitment), drain end-to-end, and a live failover test that includes a high-spam-score and a

10 MB message. Two independent adversarial reviews (2026-07-04) shaped this final form. Design: plans/2026-07-04-backup-mx-design.md.

Considered options

  • Roller Network free Secondary MX — v1 of this decision, killed at the validation gates the same day: free tier caps at 200 relayed messages or 10 MB per rolling 7 days, and overage suspends the domain for 48 h answering SMTP 5xx (permanent bounces) — since spammers target backup MXes even while the primary is up, background spam alone can hold it suspended, making it worse than no backup MX. Free accounts are also being discontinued. (Their TLS checked out; their paid Basic at $30/yr is the documented fallback if the OCI route sours.)
  • Dynu Email Backup ($9.99/yr) — queue lifetime undocumented (FAQ hints 1224 h, barely beating sender retry); filtering black-box; not free.
  • Cloudflare Email Routing / mailflare — no store-and-forward / terminal inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
  • Other free tiers (challenged and re-verified 2026-07-04): GCP e2-micro blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free" plan is a 6-month credit; Azure has no always-free VM and blocks 25; Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI is the only standing free option.
  • Harden-only (5xx-misconfig guards + paging) — does not address multi-day outages or short-retry senders; deferred as a complementary track.

Consequences

  • A pet outside the cluster — deliberately cattle: rebuilt entirely from Terraform + cloud-init, patched by unattended-upgrades, scraped by the cluster's Prometheus (exporters on the reserved public IP, allowlisted to the homelab WAN /32 — there is no cluster→tailnet route, so tailnet scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts besides). Never a backup target itself.
  • Oracle free-tier caprice is the top risk: Oracle silently halved the A1 free allowance in June 2026 and terminated over-limit instances, and publishes no commitment that inbound 25 stays open. Mitigations: Pay-As-You-Go conversion is a required prerequisite (exempts idle reclamation, stays $0), a recurring inbound-25 probe, BackupMxDown, and the queue being empty outside outages (a surprise reclamation loses coverage, never mail). Home region is fixed at signup — Frankfurt, chosen once.
  • The drain stream bypasses reject_unknown_client_hostname, anvil limits, and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against the original IP via external_relay), and content scoring stay on — spam arriving via the backup is tagged and folded to Junk, never bounced. The VM is deliberately NOT in the primary's mynetworks (a compromised VM must not relay through us).
  • Outages > 30 days lose queued mail silently — no DSN can ever leave the VM. Stated and accepted (6× better than the status quo).
  • Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but off-premises; accepted (same class as Brevo holding outbound today).
  • Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy host found dangling during design — inert today; must list mx2 when fixed) needs 12 more → schedule the next record purge proactively.
  • architecture/mailserver.md §"No Backup MX" superseded at implementation; new runbook docs/runbooks/backup-mx.md (incl. OCI console break-glass); vpn.md's stale headscale claims fixed in passing; the roundtrip probe's failure semantics change (a "failing" probe may now mean "delayed via mx2, drains shortly" — noted in alert description).