infra/docs/plans/2026-07-04-backup-mx-design.md
Viktor Barzin 114a7743ac
All checks were successful
ci/woodpecker/push/default Pipeline was successful
backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
Rollernet's free tier failed the validation gates before any DNS change
(200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces —
worse than no backup MX; free accounts being discontinued). Viktor
chose to stay free, so the backup MX becomes a Postfix store-and-forward
relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20),
draining via port 2526 through the existing pfSense HAProxy frontend
since Oracle blocks egress 25.

Two independent adversarial reviews then fixed the design: primary-side
drain enablement moved to the layers that actually reject (unknown-
client-hostname, spoof protection, anvil limits, rspamd reject tier ->
external_relay + action cap, never backscatter), monitoring moved off
the nonexistent cluster->tailnet path to allowlisted public-IP scrapes,
bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level
iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only
postscreen hygiene replaces the blanket no-filtering stance.

ADR-0019 and the design doc renamed accordingly (rollernet -> oracle).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:38:39 +00:00

335 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design,
pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md)
v3 incorporates two independent adversarial-challenge reviews (same day). Their
material corrections are marked **[CH]** throughout — the largest: the v2 drain
path would never have drained (primary-side smtpd rejects), monitoring-over-
tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce
model was wrong (it can never deliver a DSN).
## Goal
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet
gates failed). A store-and-forward backup MX queues mail while the homelab is
down and re-delivers when it returns.
Out of scope, explicitly:
- Reading new mail *during* an outage.
- Outbound mail during outages.
- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is
never consulted when the primary answers. Separate hardening/alerting track.
Known residual limit (state it plainly): an outage **longer than 30 days**
loses the queued mail *silently* — the VM cannot emit a bounce to anyone
(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already
6× the sender-retry status quo.
## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
v1 selected Roller Network's free Secondary MX. The validation gates killed it
before any DNS change:
- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html)
caps free mail service at **200 relayed messages or 10 MB per rolling 7
days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent
bounces), repeatable. Spammers deliberately target backup MXes even while
the primary is up, so background spam alone can hold the domain suspended —
worse than no backup MX.
- **G1 SHAKY**: same policy page says free accounts are being discontinued.
- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE
certs over STARTTLS.
- Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The
external challenger re-searched the free landscape (DNSExit, KisoLabs,
DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed:
no credible free managed backup-MX or free VM with a usable port-25 story
exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and
is US-regions-only (wrong continent).
## Decision
A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an
Oracle Cloud **Always-Free** compute instance, published as a lower-preference
MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable,
queues up to 30 days, and drains to the primary when it returns. No mailboxes,
no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
## Architecture
```
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤ ▲
└── pri 20 mx2.viktorbarzin.me │ drain: smtp to
(Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526
queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
2526 → 10.0.20.1:25,
existing HAProxy frontend)
```
- **Normal operation**: senders use pri 1; the VM idles (spammers targeting
the backup + transient-blip retries get relayed onward immediately).
- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix
retries the primary on its native schedule → queue drains after recovery
through the standard external ingress path (PROXY v2 → :2525 → rspamd →
Dovecot).
- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide
(post-2021; exemptions unreliable) — the VM cannot reach
`mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 →
10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH]
Verified against the runbook**: the frontend binds `*:25` on pfSense (not
strictly 10.0.20.1), rdr dst-port rewrite is the existing production
pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides
with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to**
the VM is unaffected by Oracle's egress-only block per practitioner
evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be
proven at gate O2 before any DNS change** (Oracle publishes no positive
commitment).
## Oracle account & instance
- **Account**: Viktor creates it (human signup; card for identity, $0
charged). **Home region is fixed at signup and Always-Free compute exists
only there — choose `eu-frankfurt-1` deliberately; there is no
try-another-region fallback without a new account. [CH]**
- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**:
Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days an
idle Postfix box qualifies) and demonstrably changes free-tier terms without
notice, enforcing by termination (June 2026: A1 allowance silently halved,
over-limit instances shut down). PAYG keeps Always-Free resources free and
exempts them from idle reclamation.
- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2
always-free instances allowed; ample for queue-only Postfix and untouched
by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota,
chronic Frankfurt capacity) treat E2.1.Micro availability as the gate.
- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved):
an ephemeral IP rotates on stop/start and would silently break all four
IP-keyed controls at once (pfSense NAT source-restriction, the primary's
smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape
allowlist) discovered only at the next outage's drain.
- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables
ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything
else, independent of security lists** cloud-init must insert ACCEPT rules
for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2
fails on day 1 with a correct security list.
- **Credentials**: OCI API key for Terraform Vault `secret/viktor`
(`oci_*`); web login Vaultwarden item `Oracle Cloud (backup MX)`.
## Networking & security posture
- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80
world-open permanently** Let's Encrypt validation is multi-perspective
with no published source IPs, so it cannot be source-scoped, and a
"open-only-during-renewal" toggle is unspecified automation whose realistic
failure mode is an expired cert at day ~90. Nothing listens on 80 outside
certbot's seconds-long renewal windows; connection-refused surface is
negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32
(176.12.22.76) in both the Oracle security list and the VM firewall.
- **No public SSH**: management rides the headscale tailnet cloud-init
enrolls via a **preauth key for a dedicated non-OIDC headscale user** with
node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault
`secret/headscale` `headscale_acl`); SSH bound to the tailnet interface.
ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet
members see monitoring). **[CH] Outage caveat**: headscale's control
plane + DERP live in the cluster, so mid-outage tailnet reachability is
cached-netmap best-effort the runbook documents the **OCI instance
console connection as break-glass** management. (Also fix `vpn.md`'s stale
"0.23.x / OIDC-only" claims while in there.)
- **VM compromise blast radius**: plaintext of outage-queued mail + a relay
surface contained by `relay_domains = viktorbarzin.me` only, no submission
ports, no SASL, no local delivery. The VM is deliberately NOT added to the
primary's `mynetworks` (that would let a compromised VM relay arbitrary
mail *through* the primary) per-stage exemptions instead, below.
## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
- `relay_domains = viktorbarzin.me`; `mydestination =` (empty).
- **[CH]** `smtpd_relay_restrictions = permit_mynetworks,
reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the
default tail is `defer_unauth_destination`, whose 4xx invites every relay
probe to retry forever).
- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form
(`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision
(the domain is catch-all; every RCPT is valid by definition).
- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`.
- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and
`delay_warning_time = 0` — this host can never deliver a DSN to anyone
(egress 25 blocked; its only egress is 2526 to the primary), so undeliverable
bounces must be discarded quickly or they rot in the queue for a month and
permanently poison the queue-depth alert.
- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB
(`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB
default would 552-reject large legitimate mail during outages — the exact
loss mode this project exists to prevent. Equal, never higher (higher
recreates drain-time rejects).
- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON
(fire-and-forget bots don't retry; real MTAs do — the whole design already
rests on sender retry, so 4xx filtering is loss-free by construction),
optionally `postscreen_dnsbl_action = defer` with a conservative threshold.
v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned)
with 4xx tempfail (harmless); without any hygiene the backup is a 24/7
spam backdoor since spammers deliberately deliver to the highest-numbered
MX. Zero 5xx from reputation, ever.
- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE
tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted
v6 attempt per delivery.
- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic
STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg).
- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day
accumulation for a personal domain.
## TLS
certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token
on an internet-facing VM). Port 80 permanently open (see above); certbot renew
timer. The MTA-STS follow-up (separate task; policy host currently dangling —
below) must list `mx2.viktorbarzin.me` when implemented.
## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]**
The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary —
`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three
mechanisms that would actually break the drain. All are keyed on the VM's
reserved /32 (the PROXY-v2-recovered client IP):
1. **`reject_unknown_client_hostname` bypass** — the primary sets
`POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP
without full FCrDNS (PTR needs an Oracle SR; limited on free accounts)
would be **450-deferred on every drain attempt → the queue never drains →
mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32
early in `smtpd_client_restrictions`, and a matching permit at the sender
stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope
senders — drained self-addressed/bounced mail would 5xx). Attempt the
Oracle PTR anyway (belt and braces).
2. **Anvil rate-limit exception** — `smtpd_client_message_rate_limit = 30`/min
keys on the VM's IP at drain; a >3,600-message backlog would throttle for
hours and false-fire the queue alert. Add the VM /32 to
`smtpd_client_event_limit_exceptions`.
3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via
the existing override.d ConfigMap pattern (same mount as
`dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module
(ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the
*original* client IP parsed from the VM's Received header — this keeps
DMARC protection for the entire drain stream instead of v2's blanket
disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never
milter-reject**: the primary's default reject tier (DMS default, active
since only dkim_signing is overridden today) would 5xx high-score spam at
DATA, forcing the VM to generate DSNs to forged senders = classic
backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in
the catch-all's Junk instead. Validate the external_relay ↔ settings-rule
interplay at gate O5 with a high-spam-score message.
4. postscreen permit for the /32 (harmless; pregreet never trips a real
Postfix client and DNSBL is off — kept for future-proofing only).
## Our-side changes (Terraform unless noted)
1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from
Vault), VCN + subnet + security list + **reserved public IP** +
`VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables
ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule
(persisted)**, postfix + config above, certbot, tailscale→headscale
enrollment (preauth key from Vault), node_exporter, postfix_exporter,
unattended-upgrades.
2. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A
`mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`.
**[CH] Live zone count verified: 195/200 → 197/200 after this change; only
3 slots remain and the MTA-STS follow-up needs 12 → plan the next
record-purge now, not at collision time.**
3. **pfSense (live network device — approved as part of this plan)**: WAN NAT
rdr `TCP 2526 10.0.20.1:25` + firewall rule, source-restricted to the
reserved IP. **[CH] Scripted** (extend the existing
`scripts/pfsense-*-haproxy*.php` bootstrap-script family), not
hand-clicked — keeps the git-rebuildable parity the rest of the pfSense
mail config has. Config.xml rides the nightly backup.
4. **Mailserver stack**: the four-layer drain enablement above (client+sender
`check_client_access` permits, anvil exception, rspamd external_relay +
action cap, postscreen permit) — all keyed to one /32, via the existing
`postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified
present: main.tf:129-144, 222-281, 467-474).
5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
no cluster→tailnet route exists and no existing target is scraped that
way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's
**public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL +
VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning);
MX-set drift assertion (both MX records present). Alerts:
`BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the
primary is healthy (gate on the existing `MailServerDown`/roundtrip
series, machine-readable — not prose); bounce residue is excluded by the
1-day bounce lifetime. Note: during a full homelab outage Prometheus
itself is down — queue growth is unobservable live under ANY transport;
what we actually watch is the post-recovery drain. A WAN-IP change stales
the Oracle allowlist → visible as ScrapeTargetDown (self-signaling).
**Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's
mail fails over to mx2 on transient primary blips and arrives minutes late
via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2",
not "lost"; note in the alert description and runbook.
6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No
Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`,
forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM
rebuild from stack, Oracle account facts incl. PAYG + home-region lock),
`vpn.md` headscale-version/OIDC staleness fix, monitoring rows.
### MTA-STS finding (unchanged; no action in this change)
`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and
nothing serves the policy — MTA-STS is inert today. When fixed, the policy
MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the
3 remaining zone slots).
## Validation gates (in order; any failure → stop and report)
| # | Gate | Method | Failure handling |
|---|------|--------|------------------|
| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv <reserved-ip> 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor |
| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path |
| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) |
## Failure modes
Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP
changes, short-retry senders. If pfSense is down the drain waits — Postfix
retries until it heals.
Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox
access; **outages > 30 days lose queued mail silently (no DSN possible)**.
Simultaneous Oracle+homelab outage = status quo ante (sender retries).
Newly introduced, accepted:
- **A pet outside the cluster** — deliberately cattle: rebuilt from TF +
cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a
backup target.
- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has
silently cut Always-Free allowances and terminated over-limit instances
(June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
`BackupMxDown`, and the fact that outside an active outage the queue is
empty — a surprise reclamation loses nothing, only coverage until rebuilt.
Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative
DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by
rspamd, never bounced.
- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant;
accepted).
## Rollback
Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy`
on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver
/32 exemptions. Order matters: MX record first.
## Viktor's manual steps (everything else is mine)
1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed
forever), card for identity, $0 charged.
2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation
exemption; Always-Free stays $0).
3. Hand me the tenancy OCID + a console user → I mint the API key, store
creds (Vault + Vaultwarden), and build the stack.
4. Approve the (scripted) pfSense NAT rule when I reach that step.