Rollernet's free tier failed the validation gates before any DNS change (200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces — worse than no backup MX; free accounts being discontinued). Viktor chose to stay free, so the backup MX becomes a Postfix store-and-forward relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20), draining via port 2526 through the existing pfSense HAProxy frontend since Oracle blocks egress 25. Two independent adversarial reviews then fixed the design: primary-side drain enablement moved to the layers that actually reject (unknown- client-hostname, spoof protection, anvil limits, rspamd reject tier -> external_relay + action cap, never backscatter), monitoring moved off the nonexistent cluster->tailnet path to allowlisted public-IP scrapes, bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only postscreen hygiene replaces the blanket no-filtering stance. ADR-0019 and the design doc renamed accordingly (rollernet -> oracle). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
20 KiB
Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design, pre-implementation · ADR: 0019
v3 incorporates two independent adversarial-challenge reviews (same day). Their material corrections are marked [CH] throughout — the largest: the v2 drain path would never have drained (primary-side smtpd rejects), monitoring-over- tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce model was wrong (it can never deliver a DSN).
Goal
Inbound mail for viktorbarzin.me must survive homelab outages without loss.
Requirement level (Viktor, 2026-07-04): never lose mail; delayed delivery is
acceptable; budget is $0 (hard constraint — reaffirmed after the Rollernet
gates failed). A store-and-forward backup MX queues mail while the homelab is
down and re-delivers when it returns.
Out of scope, explicitly:
- Reading new mail during an outage.
- Outbound mail during outages.
- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is never consulted when the primary answers. Separate hardening/alerting track.
Known residual limit (state it plainly): an outage longer than 30 days loses the queued mail silently — the VM cannot emit a bounce to anyone (egress 25 blocked), so no sender ever learns. Accepted; 30 days is already 6× the sender-retry status quo.
v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
v1 selected Roller Network's free Secondary MX. The validation gates killed it before any DNS change:
- G2 FAILED: the free-accounts policy caps free mail service at 200 relayed messages or 10 MB per rolling 7 days; overage → domain suspended 48 h answering SMTP 5xx (permanent bounces), repeatable. Spammers deliberately target backup MXes even while the primary is up, so background spam alone can hold the domain suspended — worse than no backup MX.
- G1 SHAKY: same policy page says free accounts are being discontinued.
- G3 PASSED (for posterity):
mail{,2}.rollernet.uspresent valid LE certs over STARTTLS. - Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
Viktor's decision: stay free → self-host on Oracle Always-Free. [CH] The external challenger re-searched the free landscape (DNSExit, KisoLabs, DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed: no credible free managed backup-MX or free VM with a usable port-25 story exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and is US-regions-only (wrong continent).
Decision
A minimal Postfix store-and-forward relay (mx2.viktorbarzin.me) on an
Oracle Cloud Always-Free compute instance, published as a lower-preference
MX. It accepts mail for viktorbarzin.me when the primary is unreachable,
queues up to 30 days, and drains to the primary when it returns. No mailboxes,
no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
Architecture
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤ ▲
└── pri 20 mx2.viktorbarzin.me │ drain: smtp to
(Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526
queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
2526 → 10.0.20.1:25,
existing HAProxy frontend)
- Normal operation: senders use pri 1; the VM idles (spammers targeting the backup + transient-blip retries get relayed onward immediately).
- Outage: senders fall back to pri 20 → VM accepts + queues → Postfix retries the primary on its native schedule → queue drains after recovery through the standard external ingress path (PROXY v2 → :2525 → rspamd → Dovecot).
- Custom drain port: Oracle blocks egress TCP 25 tenancy-wide
(post-2021; exemptions unreliable) — the VM cannot reach
mail.viktorbarzin.me:25. One pfSense WAN NAT ruleTCP 2526 → 10.0.20.1:25reuses the existing HAProxy frontend unchanged. [CH] Verified against the runbook: the frontend binds*:25on pfSense (not strictly 10.0.20.1), rdr dst-port rewrite is the existing production pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 to the VM is unaffected by Oracle's egress-only block per practitioner evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — to be proven at gate O2 before any DNS change (Oracle publishes no positive commitment).
Oracle account & instance
- Account: Viktor creates it (human signup; card for identity, $0
charged). Home region is fixed at signup and Always-Free compute exists
only there — choose
eu-frankfurt-1deliberately; there is no try-another-region fallback without a new account. [CH] - [CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation: Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an idle Postfix box qualifies) and demonstrably changes free-tier terms without notice, enforcing by termination (June 2026: A1 allowance silently halved, over-limit instances shut down). PAYG keeps Always-Free resources free and exempts them from idle reclamation.
- Shape:
VM.Standard.E2.1.Micro(x86, 1/8 OCPU burst, 1 GB RAM; 2 always-free instances allowed; ample for queue-only Postfix — and untouched by the 2026 A1 cuts). ARM A1 fallback is unreliable (halved quota, chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate. - [CH] Reserved public IP is mandatory (
oci_core_public_ip, reserved): an ephemeral IP rotates on stop/start and would silently break all four IP-keyed controls at once (pfSense NAT source-restriction, the primary's smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape allowlist) — discovered only at the next outage's drain. - OS: Ubuntu 24.04. [CH] OCI Ubuntu images ship an OS-level iptables
ruleset (
/etc/iptables/rules.v4) that ACCEPTs 22 and REJECTs everything else, independent of security lists — cloud-init must insert ACCEPT rules for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2 fails on day 1 with a correct security list. - Credentials: OCI API key for Terraform → Vault
secret/viktor(oci_*); web login → Vaultwarden itemOracle Cloud (backup MX).
Networking & security posture
- Ingress on the VM: TCP 25 world-open (the service). [CH] TCP 80 world-open permanently — Let's Encrypt validation is multi-perspective with no published source IPs, so it cannot be source-scoped, and a "open-only-during-renewal" toggle is unspecified automation whose realistic failure mode is an expired cert at day ~90. Nothing listens on 80 outside certbot's seconds-long renewal windows; connection-refused surface is negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32 (176.12.22.76) in both the Oracle security list and the VM firewall.
- No public SSH: management rides the headscale tailnet — cloud-init
enrolls via a preauth key for a dedicated non-OIDC headscale user with
node tag
tag:backup-mx(headscale 0.28.0 file-mode ACL, content in Vaultsecret/headscale→headscale_acl); SSH bound to the tailnet interface. ACL grant:group:admin → tag:backup-mx:22(cluster pods are NOT tailnet members — see monitoring). [CH] Outage caveat: headscale's control plane + DERP live in the cluster, so mid-outage tailnet reachability is cached-netmap best-effort — the runbook documents the OCI instance console connection as break-glass management. (Also fixvpn.md's stale "0.23.x / OIDC-only" claims while in there.) - VM compromise blast radius: plaintext of outage-queued mail + a relay
surface contained by
relay_domains = viktorbarzin.meonly, no submission ports, no SASL, no local delivery. The VM is deliberately NOT added to the primary'smynetworks(that would let a compromised VM relay arbitrary mail through the primary) — per-stage exemptions instead, below.
Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
relay_domains = viktorbarzin.me;mydestination =(empty).- [CH]
smtpd_relay_restrictions = permit_mynetworks, reject_unauth_destination— explicit 5xx for foreign-domain RCPTs (the default tail isdefer_unauth_destination, whose 4xx invites every relay probe to retry forever). - [CH]
relay_recipient_mapsexplicitly set to the wildcard form (@viktorbarzin.me OK) — documents accept-all-recipients as a decision (the domain is catch-all; every RCPT is valid by definition). transport_maps:viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526.maximal_queue_lifetime = 30d. [CH]bounce_queue_lifetime = 1danddelay_warning_time = 0— this host can never deliver a DSN to anyone (egress 25 blocked; its only egress is 2526 to the primary), so undeliverable bounces must be discarded quickly or they rot in the queue for a month and permanently poison the queue-depth alert.- [CH]
message_size_limit = 209715200— exactly the primary's 200 MB (POSTFIX_MESSAGE_SIZE_LIMIT, mailserver main.tf:88). The stock 10 MB default would 552-reject large legitimate mail during outages — the exact loss mode this project exists to prevent. Equal, never higher (higher recreates drain-time rejects). - [CH] postscreen on the VM in 4xx-only posture: pregreet test ON
(fire-and-forget bots don't retry; real MTAs do — the whole design already
rests on sender retry, so 4xx filtering is loss-free by construction),
optionally
postscreen_dnsbl_action = deferwith a conservative threshold. v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned) with 4xx tempfail (harmless); without any hygiene the backup is a 24/7 spam backdoor since spammers deliberately deliver to the highest-numbered MX. Zero 5xx from reputation, ever. inet_protocols = ipv4[CH] — the primary publishes an AAAA (HE tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted v6 attempt per delivery.smtpd_tls_cert_file= LE cert formx2.viktorbarzin.me(opportunistic STARTTLS inbound;smtp_tls_security_level = mayon the drain leg).- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day accumulation for a personal domain.
TLS
certbot standalone HTTP-01 for mx2.viktorbarzin.me (no Cloudflare API token
on an internet-facing VM). Port 80 permanently open (see above); certbot renew
timer. The MTA-STS follow-up (separate task; policy host currently dangling —
below) must list mx2.viktorbarzin.me when implemented.
Primary-side drain enablement [CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]
The v2 exemptions targeted postscreen DNSBL (which is off on the primary —
ENABLE_DNSBL unset) and rspamd SPF/DMARC scoring — but missed the three
mechanisms that would actually break the drain. All are keyed on the VM's
reserved /32 (the PROXY-v2-recovered client IP):
reject_unknown_client_hostnamebypass — the primary setsPOSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1(main.tf:89); an Oracle IP without full FCrDNS (PTR needs an Oracle SR; limited on free accounts) would be 450-deferred on every drain attempt → the queue never drains → mass-bounces at day 30. Fix:check_client_accesspermit for the VM /32 early insmtpd_client_restrictions, and a matching permit at the sender stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope senders — drained self-addressed/bounced mail would 5xx). Attempt the Oracle PTR anyway (belt and braces).- Anvil rate-limit exception —
smtpd_client_message_rate_limit = 30/min keys on the VM's IP at drain; a >3,600-message backlog would throttle for hours and false-fire the queue alert. Add the VM /32 tosmtpd_client_event_limit_exceptions. - rspamd: evaluate the original sender, never 5xx the drain stream — via
the existing override.d ConfigMap pattern (same mount as
dkim_signing.conf): (a) configure rspamd'sexternal_relaymodule (ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the original client IP parsed from the VM's Received header — this keeps DMARC protection for the entire drain stream instead of v2's blanket disable; (b) cap rspamd's action at the VM /32 to tag/fold — never milter-reject: the primary's default reject tier (DMS default, active since only dkim_signing is overridden today) would 5xx high-score spam at DATA, forcing the VM to generate DSNs to forged senders = classic backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in the catch-all's Junk instead. Validate the external_relay ↔ settings-rule interplay at gate O5 with a high-spam-score message. - postscreen permit for the /32 (harmless; pregreet never trips a real Postfix client and DNSBL is off — kept for future-proofing only).
Our-side changes (Terraform unless noted)
- New stack
stacks/backup-mx/(Tier 1): OCI provider (creds from Vault), VCN + subnet + security list + reserved public IP +VM.Standard.E2.1.Micro+ cloud-init (templatefile): OS iptables ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule (persisted), postfix + config above, certbot, tailscale→headscale enrollment (preauth key from Vault), node_exporter, postfix_exporter, unattended-upgrades. - DNS —
stacks/cloudflared/modules/cloudflared/cloudflare.tf: Amx2.viktorbarzin.me→ reserved IP (non-proxied), MX pref 20 →mx2. [CH] Live zone count verified: 195/200 → 197/200 after this change; only 3 slots remain and the MTA-STS follow-up needs 1–2 → plan the next record-purge now, not at collision time. - pfSense (live network device — approved as part of this plan): WAN NAT
rdr
TCP 2526 → 10.0.20.1:25+ firewall rule, source-restricted to the reserved IP. [CH] Scripted (extend the existingscripts/pfsense-*-haproxy*.phpbootstrap-script family), not hand-clicked — keeps the git-rebuildable parity the rest of the pfSense mail config has. Config.xml rides the nightly backup. - Mailserver stack: the four-layer drain enablement above (client+sender
check_client_accesspermits, anvil exception, rspamd external_relay + action cap, postscreen permit) — all keyed to one /32, via the existingpostfix_cf/user-patches.sh/ rspamd-override hook points (verified present: main.tf:129-144, 222-281, 467-474). - Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
no cluster→tailnet route exists and no existing target is scraped that
way]: Prometheus scrapes
node_exporter/postfix_exporteron the VM's public reserved IP, allowed only from the homelab WAN /32 (Oracle SL + VM firewall); blackbox TCP:25 from the cluster (BackupMxDown, warning); MX-set drift assertion (both MX records present). Alerts:BackupMxQueueStuck= non-bounce queue depth > 0 for 2 h while the primary is healthy (gate on the existingMailServerDown/roundtrip series, machine-readable — not prose); bounce residue is excluded by the 1-day bounce lifetime. Note: during a full homelab outage Prometheus itself is down — queue growth is unobservable live under ANY transport; what we actually watch is the post-recovery drain. A WAN-IP change stales the Oracle allowlist → visible as ScrapeTargetDown (self-signaling). Probe semantics note: once mx2 exists, the Brevo roundtrip probe's mail fails over to mx2 on transient primary blips and arrives minutes late via the drain —EmailRoundtripFailingmay then mean "delayed via mx2", not "lost"; note in the alert description and runbook. - Docs (same commit as implementation): rewrite
mailserver.md§"No Backup MX", new runbookdocs/runbooks/backup-mx.md(postqueue -p, forced drainpostqueue -f, cert renewal, OCI console break-glass, VM rebuild from stack, Oracle account facts incl. PAYG + home-region lock),vpn.mdheadscale-version/OIDC staleness fix, monitoring rows.
MTA-STS finding (unchanged; no action in this change)
_mta-sts TXT is published but mta-sts.viktorbarzin.me has no record and
nothing serves the policy — MTA-STS is inert today. When fixed, the policy
MUST include mx: mx2.viktorbarzin.me (and budget its DNS records against the
3 remaining zone slots).
Validation gates (in order; any failure → stop and report)
| # | Gate | Method | Failure handling |
|---|---|---|---|
| O1 | Oracle account (home region eu-frankfurt-1, fixed forever at signup), PAYG conversion done, E2.1.Micro capacity |
Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | nc -zv <reserved-ip> 25 from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) |
Stop; decision returns to Viktor |
| O3 | Drain works: VM → mail.viktorbarzin.me:2526 delivers end-to-end |
Test message injected on the VM | Debug pfSense NAT / HAProxy path |
| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
| O5 | Live failover test — hardened [CH] | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo plus a high-spam-score message and a >10 MB message → confirm queued (postqueue -p) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers |
Debug or roll back (remove MX record) |
Failure modes
Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP changes, short-retry senders. If pfSense is down the drain waits — Postfix retries until it heals.
Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox access; outages > 30 days lose queued mail silently (no DSN possible). Simultaneous Oracle+homelab outage = status quo ante (sender retries).
Newly introduced, accepted:
- A pet outside the cluster — deliberately cattle: rebuilt from TF + cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a backup target.
- Oracle free-tier caprice [CH — upgraded from v2's framing]: Oracle has
silently cut Always-Free allowances and terminated over-limit instances
(June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
BackupMxDown, and the fact that outside an active outage the queue is empty — a surprise reclamation loses nothing, only coverage until rebuilt. Rollernet Basic ($30/yr) stays the documented fallback if OCI sours. - Spam hygiene: 4xx-only postscreen on the VM (pregreet + conservative DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by rspamd, never bounced.
- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant; accepted).
Rollback
Remove the MX + A records; wait for postqueue -p empty; terraform destroy
on backup-mx; delete the pfSense NAT rule (scripted); drop the mailserver
/32 exemptions. Order matters: MX record first.
Viktor's manual steps (everything else is mine)
- Create the Oracle Cloud account — home region
eu-frankfurt-1(fixed forever), card for identity, $0 charged. - Convert the tenancy to Pay-As-You-Go (required — idle-reclamation exemption; Always-Free stays $0).
- Hand me the tenancy OCID + a console user → I mint the API key, store creds (Vault + Vaultwarden), and build the stack.
- Approve the (scripted) pfSense NAT rule when I reach that step.