backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3

Rollernet's free tier failed the validation gates before any DNS change (200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces — worse than no backup MX; free accounts being discontinued). Viktor chose to stay free, so the backup MX becomes a Postfix store-and-forward relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20), draining via port 2526 through the existing pfSense HAProxy frontend since Oracle blocks egress 25. Two independent adversarial reviews then fixed the design: primary-side drain enablement moved to the layers that actually reject (unknown- client-hostname, spoof protection, anvil limits, rspamd reject tier -> external_relay + action cap, never backscatter), monitoring moved off the nonexistent cluster->tailnet path to allowlisted public-IP scrapes, bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only postscreen hygiene replaces the blanket no-filtering stance. ADR-0019 and the design doc renamed accordingly (rollernet -> oracle). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00 · 2026-07-04 13:38:39 +00:00 · 114a7743ac
commit 114a7743ac
parent c1ffed17a9
4 changed files with 432 additions and 246 deletions
--- a/docs/adr/0019-backup-mx-roller-network-free-tier.md
+++ b/docs/adr/0019-backup-mx-roller-network-free-tier.md
@ -1,65 +0,0 @@
-# Inbound mail gets a free store-and-forward backup MX (Roller Network)
-
-`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
-inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
-outage protection — a documented "No Backup MX" decision made after ForwardEmail's
-forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
-Routing proved pass-through-only (no queue). Viktor now wants inbound mail to
-survive homelab outages **without loss** (2026-07-04): delayed delivery is fine,
-mid-outage reading is not required, and the budget is **$0** — which rules out
-the doc-flagged Dynu fallback ($9.99/yr).
-
-We adopt **Roller Network's free-tier Secondary MX** (`mail.rollernet.us` +
-`mail2.rollernet.us` at equal MX preference 20, primary untouched): a
-purpose-built store-and-forward relay with a **3-week queue** (sliding retries,
-15 min doubling to 1-week max), **no forced spam filtering** on the secondary
-path, and a valid-user table with a default-*allow-any* mode that preserves our
-catch-all's infinite ad-hoc aliases. Our side whitelists their relay CIDRs in
-postscreen (skip DNSBL/pregreet for queue drains) and exempts them from
-SPF/DMARC *scoring* in rspamd — the ForwardEmail lesson applied at the right
-layer; DKIM verification and content/AV scanning stay fully active. Go-live is
-gated: confirm the free tier still includes Secondary MX, confirm the 10 MB/day
-overage lock answers 4xx (defer) rather than 5xx (bounce), capture their
-authoritative relay CIDRs, apply the whitelist **before** the MX records, and
-finish with a live failover test (mailserver scaled to 0, probes from Gmail +
-Brevo, verified queue-and-drain). Design:
-[`plans/2026-07-04-backup-mx-rollernet-design.md`](../plans/2026-07-04-backup-mx-rollernet-design.md).
-
-## Considered options
-
- **Dynu Email Backup ($9.99/yr)** — the previously doc-flagged option; simple,
-  but queue lifetime is undocumented (FAQ hints at 12–24 h retry ceilings),
-  filtering behaviour is a black box, and it costs money the free requirement
-  excludes.
- **Self-hosted VPS relay** (Hetzner ~€50/yr, or Oracle Always-Free at $0) —
-  full control (30-day queue, own TLS/MTA-STS story), but a second
-  internet-facing pet to patch and monitor; Oracle hard-blocks egress port 25,
-  forcing delivery to the primary on a custom port, and idles risk free-tier
-  reclamation.
- **Cloudflare Email Routing / mailflare** — no store-and-forward (pass-through
-  only) / a terminal inbox on Cloudflare respectively; both previously
-  evaluated and rejected (2026-04-12; 2026-07-04, memory #7148).
- **Harden-only** (guard hard-5xx misconfig modes, add paging) — cheaper but
-  does not address multi-day outages or short-retry senders; deferred as a
-  complementary track, not an alternative.
-
-## Consequences
-
- Outage mail queues **in plaintext at a third party** for up to 3 weeks —
-  accepted; same trust class as Brevo holding our outbound relay traffic.
- The backup path bypasses postscreen DNSBL and SPF/DMARC scoring for
-  Rollernet's CIDRs; content/AV/Bayes and DKIM verification still apply. A
-  slight spam uptick during outages is possible (catch-all absorbs to `spam@`).
- The free tier's **10 MB/day cap** locks the domain until midnight Pacific
-  when exceeded; the G2 gate decides whether that lock defers (harmless) or
-  bounces (revisit: paid tier or accept). Overage never affects the primary
-  path — only mail arriving via the backup while locked.
- Two more records in a Cloudflare zone already near the Free-plan 200-record
-  cap (headroom must be verified at apply time).
- **MTA-STS was found dangling** during design: the `_mta-sts` TXT is published
-  but no policy host exists, so MTA-STS is inert today. Any future fix must
-  list the Rollernet MX hosts in the policy or enforcing senders will skip the
-  backup path.
- `architecture/mailserver.md` §"No Backup MX" is superseded at implementation
-  time; a new runbook covers ACC queue inspection, post-outage drain checks,
-  and Accept-and-Hold for planned maintenance.
--- a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
+++ b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
@ -0,0 +1,97 @@
+# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
+
+`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
+inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
+outage protection — a documented "No Backup MX" decision made after ForwardEmail's
+forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
+Routing proved pass-through-only. Viktor now wants inbound mail to survive
+homelab outages **without loss** (2026-07-04): delayed delivery is fine,
+mid-outage reading is not required, and the budget is **$0** — a hard
+constraint that eliminated every managed option (see below).
+
+We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
+Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
+public IP, MX preference 20; primary untouched at 1). It accepts everything
+for the domain (catch-all — every RCPT is valid; reputation may only ever
+4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
+never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
+prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
+deliver a DSN, its only egress is the drain), and drains to the primary over
+**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
+frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
+tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
+mid-outage break-glass since headscale itself lives in the cluster); TLS via
+certbot HTTP-01 (port 80 permanently open — LE validation is
+multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
+`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
+also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
+On the primary, the drain stream (one /32) is enabled at the layers that
+actually bite — `check_client_access` permits past
+`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
+exception, and rspamd `external_relay` (score against the *original* sender
+IP) with the reject action capped to tag/fold so drained spam can never force
+the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
+reachability (recurring probe — Oracle publishes no commitment), drain
+end-to-end, and a live failover test that includes a high-spam-score and a
+>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
+final form. Design:
+[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
+
+## Considered options
+
+- **Roller Network free Secondary MX** — v1 of this decision, killed at the
+  validation gates the same day: free tier caps at 200 relayed messages or
+  10 MB per rolling 7 days, and overage suspends the domain for 48 h
+  answering **SMTP 5xx** (permanent bounces) — since spammers target backup
+  MXes even while the primary is up, background spam alone can hold it
+  suspended, making it *worse than no backup MX*. Free accounts are also
+  being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
+  the documented fallback if the OCI route sours.)
+- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
+  12–24 h, barely beating sender retry); filtering black-box; not free.
+- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
+  inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
+- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
+  blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
+  plan is a 6-month credit; Azure has no always-free VM and blocks 25;
+  Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
+  trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
+  is the only standing free option.
+- **Harden-only** (5xx-misconfig guards + paging) — does not address
+  multi-day outages or short-retry senders; deferred as a complementary
+  track.
+
+## Consequences
+
+- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
+  Terraform + cloud-init, patched by unattended-upgrades, scraped by the
+  cluster's Prometheus (exporters on the reserved public IP, allowlisted to
+  the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
+  scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
+  besides). Never a backup target itself.
+- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
+  free allowance in June 2026 and terminated over-limit instances, and
+  publishes no commitment that inbound 25 stays open. Mitigations:
+  **Pay-As-You-Go conversion is a required prerequisite** (exempts idle
+  reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
+  the queue being empty outside outages (a surprise reclamation loses
+  coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
+  once.
+- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
+  and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
+  the original IP via `external_relay`), and content scoring stay on — spam
+  arriving via the backup is tagged and folded to Junk, never bounced. The VM
+  is deliberately NOT in the primary's `mynetworks` (a compromised VM must
+  not relay through us).
+- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
+  VM. Stated and accepted (6× better than the status quo).
+- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
+  off-premises; accepted (same class as Brevo holding outbound today).
+- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
+  host found dangling during design — inert today; must list `mx2` when
+  fixed) needs 1–2 more → schedule the next record purge proactively.
+- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
+  new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
+  `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
+  failure semantics change (a "failing" probe may now mean "delayed via mx2,
+  drains shortly" — noted in alert description).