infra/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
Viktor Barzin 114a7743ac
All checks were successful
ci/woodpecker/push/default Pipeline was successful
backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00

97 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
inbound overhaul, with sender-MTA retry (15 days, sender-dependent) as the only
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
Routing proved pass-through-only. Viktor now wants inbound mail to survive
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
mid-outage reading is not required, and the budget is **$0** — a hard
constraint that eliminated every managed option (see below).
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
public IP, MX preference 20; primary untouched at 1). It accepts everything
for the domain (catch-all — every RCPT is valid; reputation may only ever
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
deliver a DSN, its only egress is the drain), and drains to the primary over
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
mid-outage break-glass since headscale itself lives in the cluster); TLS via
certbot HTTP-01 (port 80 permanently open — LE validation is
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
On the primary, the drain stream (one /32) is enabled at the layers that
actually bite — `check_client_access` permits past
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
exception, and rspamd `external_relay` (score against the *original* sender
IP) with the reject action capped to tag/fold so drained spam can never force
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
reachability (recurring probe — Oracle publishes no commitment), drain
end-to-end, and a live failover test that includes a high-spam-score and a
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
final form. Design:
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
## Considered options
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
validation gates the same day: free tier caps at 200 relayed messages or
10 MB per rolling 7 days, and overage suspends the domain for 48 h
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
MXes even while the primary is up, background spam alone can hold it
suspended, making it *worse than no backup MX*. Free accounts are also
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
1224 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
multi-day outages or short-retry senders; deferred as a complementary
track.
## Consequences
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
free allowance in June 2026 and terminated over-limit instances, and
publishes no commitment that inbound 25 stays open. Mitigations:
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
the queue being empty outside outages (a surprise reclamation loses
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
the original IP via `external_relay`), and content scoring stay on — spam
arriving via the backup is tagged and folded to Junk, never bounced. The VM
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
host found dangling during design — inert today; must list `mx2` when
fixed) needs 12 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
failure semantics change (a "failing" probe may now mean "delayed via mx2,
drains shortly" — noted in alert description).