infra/docs/plans/2026-07-04-backup-mx-rollernet-design.md
Viktor Barzin c91fa881e6
All checks were successful
ci/woodpecker/push/default Pipeline was successful
docs: design + ADR-0019 — free backup MX via Roller Network secondary MX
Viktor wants inbound email to survive homelab outages without loss;
delayed delivery is acceptable and the budget is zero, which rules out
the previously doc-flagged Dynu option. Design adopts Roller Network's
free Secondary MX (3-week store-and-forward queue, no forced filtering,
catch-all-compatible) with our-side postscreen/rspamd whitelisting,
five validation gates before any DNS change, and a live failover test.
Also records the dangling-MTA-STS finding (TXT published, policy host
absent) as a follow-up. Implementation starts only after Viktor reviews
these docs; account will use rollernet@viktorbarzin.me.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 09:59:16 +00:00

179 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Backup MX via Roller Network free Secondary MX — design
Date: 2026-07-04 · Status: design approved pending user review, pre-implementation · ADR: [0019](../adr/0019-backup-mx-roller-network-free-tier.md)
## Goal
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
acceptable; budget is $0**. A store-and-forward backup MX queues mail while the
homelab is down and re-delivers when it returns.
Out of scope, explicitly:
- Reading new mail *during* an outage (would need a deliver-to-mailbox backup —
rejected in favour of queue-only).
- Outbound mail during outages.
- The "primary up but hard-bouncing 5xx" misconfig class (e.g. broken alias map
`550 user unknown`): a backup MX is never consulted when the primary
answers. That is a separate hardening/alerting track.
## Current state and gap
- Single MX: `mail.viktorbarzin.me` (pri 1) → `176.12.22.76` → pfSense HAProxy
(PROXY v2) → mailserver pod. No backup MX — documented decision in
[`architecture/mailserver.md`](../architecture/mailserver.md) §"No Backup MX"
(2026-04-12), which this design supersedes (ADR-0019).
- Only protection today: sender MTAs queue and retry, typically 15 days.
Loss vectors: outages longer than a sender's retry window, and senders with
unusually short retry policies.
- Prior art: **ForwardEmail** relay abandoned 2026-04-12 (its forced
anti-spoofing rejected legitimate forwarded mail); **Cloudflare Email
Routing** rejected (pass-through only, no queue); **Dynu** ($9.99/yr) was the
doc-flagged fallback; **mailflare** (hieunc229) evaluated and rejected
2026-07-04 (memory #7148).
## Decision
Adopt **Roller Network free-tier Secondary MX** (`mail.rollernet.us` +
`mail2.rollernet.us`) as a store-and-forward backup MX. Rationale (full
alternatives in ADR-0019):
- Purpose-built queue relay: **3-week queue**, sliding retry (15 min doubling
to a max 1-week interval), queue storage not counted against the account.
- Spam filtering on secondary MX is **optional and off by default** ("little to
no spam filtering" per their FAQ) — avoids the ForwardEmail failure class.
- **Catch-all compatible**: their valid-user table supports a default *allow
any* ("catch-all/dropbox") action, preserving the `@viktorbarzin.me → spam@`
infinite-alias pattern. New domains default to *deny* — must be flipped
explicitly at setup.
- Free; unlimited domains; config API; "Accept and Hold" mode usable for
planned maintenance windows.
## Architecture
Normal operation (unchanged): senders resolve MX, prefer pri 1
`mail.viktorbarzin.me`, deliver directly. Rollernet sits idle. (Spammers
deliberately targeting the backup MX get relayed to the primary immediately —
see failure modes.)
Outage: senders fail to connect to pri 1 → fall back to pri 20
`mail{,2}.rollernet.us` → Rollernet accepts (allow-any user table), queues up
to 3 weeks, retries the primary on a sliding schedule → queue drains
automatically after recovery, entering via the standard external path (pfSense
HAProxy → `:2525` postscreen, PROXY v2), then rspamd → Dovecot as usual.
```
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
sender MTA ──► MX lookup ┤ ▲
└── pri 20 mail.rollernet.us ─┐ │ retry ≤ 3 weeks
pri 20 mail2.rollernet.us ┴─► Rollernet queue ───────┘
(only used when pri 1 unreachable)
```
## Rollernet account & configuration (out-of-band SaaS, like Brevo)
- Account email: **`rollernet@viktorbarzin.me`** (Viktor, 2026-07-04; resolves
via catch-all → `spam@`). Known circularity: during an outage their
notifications to this address are themselves queued (at their side) until
recovery. Accepted — credentials and config live in Vault and the runbook
documents ACC access; nothing operational depends on receiving their mail
mid-outage.
- Credentials → Vault `secret/viktor` (`rollernet_password`, plus API key if
minted).
- Domain `viktorbarzin.me` in **Secondary MX** mode; valid-user table default
action = **allow any** (catch-all).
- `abuse@` / `postmaster@` must be deliverable (their RFC requirement) — the
catch-all already satisfies this.
- Record their **relay source CIDRs** from the post-signup Resource Access page
(feeds the whitelist below). Their published mail ranges as of 2026 include
`162.216.242.0/24` and `72.51.58.0/24` — confirm the authoritative list in
the ACC.
## Our-side changes (all Terraform; worktree → master → CI apply)
1. **DNS**`stacks/cloudflared/modules/cloudflared/cloudflare.tf`: add two MX
records for the zone apex, `mail.rollernet.us` and `mail2.rollernet.us` at
equal preference **20** (primary record untouched at pri 1). Implementation
checks: (a) their MX-setup help page has a loop-avoidance rule about
priority layout — confirm 1/20/20 matches their prescription post-signup;
(b) **the zone sits near Cloudflare's Free-plan 200-record cap** (commit
`1a63fee4` dropped 6 names for headroom) — verify ≥2 free slots before
apply.
2. **Postscreen whitelist**`stacks/mailserver/modules/mailserver/main.tf`:
mount a `postscreen_access.cidr` (permit Rollernet CIDRs) via the existing
config ConfigMap and set `postscreen_access_list =
permit_mynetworks, cidr:/tmp/docker-mailserver/postscreen_access.cidr` on
the `:2525` alt listener (in `user-patches.sh`, where the listener is
defined). Rationale: their relays must not be DNSBL-scored or
pregreet-tested — queue drains would tempfail/deferred otherwise.
3. **rspamd SPF/DMARC exemption** — same stack, via the established
`override.d`/`local.d` ConfigMap-mount pattern (as `dkim_signing.conf`
today): exempt the Rollernet CIDRs from **SPF and DMARC scoring only**
(relayed mail arrives from their IPs, so envelope SPF legitimately fails —
the exact ForwardEmail lesson applied on our side). Content, AV, Bayes and
DKIM verification stay fully active; DKIM-signed senders still validate
end-to-end through the relay.
4. **Monitoring** — blackbox DNS assertion that the MX set contains all three
hosts (drift guard, same pattern as `viktorbarzin-apex-probe`); alert on
drift. Optional informational probe: TCP:25 reachability of
`mail.rollernet.us` (their uptime, weekly cadence, no paging).
5. **Docs (same commit as implementation)** — rewrite `mailserver.md` §"No
Backup MX" (decision superseded by ADR-0019, new inbound flow + DNS table +
monitoring rows), add `docs/runbooks/backup-mx-rollernet.md` (ACC queue
inspection, post-outage drain verification, Accept-and-Hold for planned
maintenance, overage semantics, whitelist upkeep if their CIDRs change).
### MTA-STS finding (no action in this change)
`_mta-sts.viktorbarzin.me TXT "v=STSv1; id=20260412"` is published, but
`mta-sts.viktorbarzin.me` has **no public DNS record and nothing serves the
policy file** → per RFC 8461 senders that see the TXT fail the HTTPS policy
fetch and proceed as if no policy exists. MTA-STS is inert today (docs-vs-live
mismatch vs the mailserver.md DNS table). Whenever it is fixed properly, the
policy's `mx:` list MUST include `mail.rollernet.us` and `mail2.rollernet.us`,
or MTA-STS-enforcing senders will refuse the backup path. Tracked as a
follow-up, not part of this design.
## Validation gates (in order; any failure → stop and report)
| # | Gate | Method | Failure handling |
|---|------|--------|------------------|
| G1 | Free tier still includes Secondary MX (2026) | Signup + ACC | Decision returns to Viktor: Dynu $9.99/yr vs Rollernet Basic $30/yr vs Oracle-VM self-host |
| G2 | 10 MB/day overage semantics: locked domain answers **4xx (defer)** not 5xx (bounce) | Their docs/support ticket before DNS golive | If 5xx: decision returns to Viktor (paid tier lifts cap, or accept the risk window) |
| G3 | STARTTLS on their MX hosts (cert quality) | `openssl s_client -starttls smtp -connect mail.rollernet.us:25` | Informational now (blocks only the future MTA-STS fix) |
| G4 | Authoritative relay CIDRs published | ACC Resource Access page | Whitelist (changes 23) MUST be applied **before** the MX records go live — ordering guard |
| G5 | Live failover test | See below | Debug or roll back (remove MX records) |
**G5 live failover test**: `presence claim service:mailserver` → scale
mailserver deployment to 0 (≈30 min window) → send probes from Gmail and via
Brevo API → confirm both queue in Rollernet ACC → scale back to 1 → verify
clean drain: delivered to `spam@`/target mailbox, headers show no SPF/DMARC
penalty and no postscreen interference, DKIM still validates. Also verify the
E2E roundtrip probe recovers on its own.
## Failure modes
Covered: cluster/pod outages, pfSense/power/ISP outages, WAN IP changes (queue
holds while DNS is fixed), multi-day outages ≤ 3 weeks, short-retry senders.
Not covered (out of scope, above): primary-up-but-5xx misconfigs; outbound;
mid-outage mailbox access.
Newly introduced, accepted:
- **Plaintext queue at a third party** during outages (same trust class as
Brevo holding outbound today).
- **Spam-path bypass**: mail via the backup skips postscreen DNSBL (their IPs
are whitelisted) and SPF/DMARC scoring; rspamd content/AV/Bayes still apply.
Slight spam uptick possible during outages; catch-all absorbs to `spam@`.
- **10 MB/day cap** mid-outage → domain locks until midnight PT (severity
depends on G2: defer = harmless delay at sender; bounce = loss → gate).
- Rollernet outage while primary is also down = status quo ante (sender
retries), never worse than today.
## Rollback
Remove the two MX records (TTL is automatic/low) and disable the domain in the
ACC. Whitelist + rspamd exemption are inert without the MX records and may be
reverted in the same commit or left for a retry.