docs: design + ADR-0019 — free backup MX via Roller Network secondary MX
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants inbound email to survive homelab outages without loss; delayed delivery is acceptable and the budget is zero, which rules out the previously doc-flagged Dynu option. Design adopts Roller Network's free Secondary MX (3-week store-and-forward queue, no forced filtering, catch-all-compatible) with our-side postscreen/rspamd whitelisting, five validation gates before any DNS change, and a live failover test. Also records the dangling-MTA-STS finding (TXT published, policy host absent) as a follow-up. Implementation starts only after Viktor reviews these docs; account will use rollernet@viktorbarzin.me. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
1a63fee4e4
commit
c91fa881e6
2 changed files with 244 additions and 0 deletions
65
docs/adr/0019-backup-mx-roller-network-free-tier.md
Normal file
65
docs/adr/0019-backup-mx-roller-network-free-tier.md
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
# Inbound mail gets a free store-and-forward backup MX (Roller Network)
|
||||
|
||||
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
|
||||
inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
|
||||
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
|
||||
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
|
||||
Routing proved pass-through-only (no queue). Viktor now wants inbound mail to
|
||||
survive homelab outages **without loss** (2026-07-04): delayed delivery is fine,
|
||||
mid-outage reading is not required, and the budget is **$0** — which rules out
|
||||
the doc-flagged Dynu fallback ($9.99/yr).
|
||||
|
||||
We adopt **Roller Network's free-tier Secondary MX** (`mail.rollernet.us` +
|
||||
`mail2.rollernet.us` at equal MX preference 20, primary untouched): a
|
||||
purpose-built store-and-forward relay with a **3-week queue** (sliding retries,
|
||||
15 min doubling to 1-week max), **no forced spam filtering** on the secondary
|
||||
path, and a valid-user table with a default-*allow-any* mode that preserves our
|
||||
catch-all's infinite ad-hoc aliases. Our side whitelists their relay CIDRs in
|
||||
postscreen (skip DNSBL/pregreet for queue drains) and exempts them from
|
||||
SPF/DMARC *scoring* in rspamd — the ForwardEmail lesson applied at the right
|
||||
layer; DKIM verification and content/AV scanning stay fully active. Go-live is
|
||||
gated: confirm the free tier still includes Secondary MX, confirm the 10 MB/day
|
||||
overage lock answers 4xx (defer) rather than 5xx (bounce), capture their
|
||||
authoritative relay CIDRs, apply the whitelist **before** the MX records, and
|
||||
finish with a live failover test (mailserver scaled to 0, probes from Gmail +
|
||||
Brevo, verified queue-and-drain). Design:
|
||||
[`plans/2026-07-04-backup-mx-rollernet-design.md`](../plans/2026-07-04-backup-mx-rollernet-design.md).
|
||||
|
||||
## Considered options
|
||||
|
||||
- **Dynu Email Backup ($9.99/yr)** — the previously doc-flagged option; simple,
|
||||
but queue lifetime is undocumented (FAQ hints at 12–24 h retry ceilings),
|
||||
filtering behaviour is a black box, and it costs money the free requirement
|
||||
excludes.
|
||||
- **Self-hosted VPS relay** (Hetzner ~€50/yr, or Oracle Always-Free at $0) —
|
||||
full control (30-day queue, own TLS/MTA-STS story), but a second
|
||||
internet-facing pet to patch and monitor; Oracle hard-blocks egress port 25,
|
||||
forcing delivery to the primary on a custom port, and idles risk free-tier
|
||||
reclamation.
|
||||
- **Cloudflare Email Routing / mailflare** — no store-and-forward (pass-through
|
||||
only) / a terminal inbox on Cloudflare respectively; both previously
|
||||
evaluated and rejected (2026-04-12; 2026-07-04, memory #7148).
|
||||
- **Harden-only** (guard hard-5xx misconfig modes, add paging) — cheaper but
|
||||
does not address multi-day outages or short-retry senders; deferred as a
|
||||
complementary track, not an alternative.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Outage mail queues **in plaintext at a third party** for up to 3 weeks —
|
||||
accepted; same trust class as Brevo holding our outbound relay traffic.
|
||||
- The backup path bypasses postscreen DNSBL and SPF/DMARC scoring for
|
||||
Rollernet's CIDRs; content/AV/Bayes and DKIM verification still apply. A
|
||||
slight spam uptick during outages is possible (catch-all absorbs to `spam@`).
|
||||
- The free tier's **10 MB/day cap** locks the domain until midnight Pacific
|
||||
when exceeded; the G2 gate decides whether that lock defers (harmless) or
|
||||
bounces (revisit: paid tier or accept). Overage never affects the primary
|
||||
path — only mail arriving via the backup while locked.
|
||||
- Two more records in a Cloudflare zone already near the Free-plan 200-record
|
||||
cap (headroom must be verified at apply time).
|
||||
- **MTA-STS was found dangling** during design: the `_mta-sts` TXT is published
|
||||
but no policy host exists, so MTA-STS is inert today. Any future fix must
|
||||
list the Rollernet MX hosts in the policy or enforcing senders will skip the
|
||||
backup path.
|
||||
- `architecture/mailserver.md` §"No Backup MX" is superseded at implementation
|
||||
time; a new runbook covers ACC queue inspection, post-outage drain checks,
|
||||
and Accept-and-Hold for planned maintenance.
|
||||
179
docs/plans/2026-07-04-backup-mx-rollernet-design.md
Normal file
179
docs/plans/2026-07-04-backup-mx-rollernet-design.md
Normal file
|
|
@ -0,0 +1,179 @@
|
|||
# Backup MX via Roller Network free Secondary MX — design
|
||||
|
||||
Date: 2026-07-04 · Status: design approved pending user review, pre-implementation · ADR: [0019](../adr/0019-backup-mx-roller-network-free-tier.md)
|
||||
|
||||
## Goal
|
||||
|
||||
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
|
||||
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
|
||||
acceptable; budget is $0**. A store-and-forward backup MX queues mail while the
|
||||
homelab is down and re-delivers when it returns.
|
||||
|
||||
Out of scope, explicitly:
|
||||
|
||||
- Reading new mail *during* an outage (would need a deliver-to-mailbox backup —
|
||||
rejected in favour of queue-only).
|
||||
- Outbound mail during outages.
|
||||
- The "primary up but hard-bouncing 5xx" misconfig class (e.g. broken alias map
|
||||
→ `550 user unknown`): a backup MX is never consulted when the primary
|
||||
answers. That is a separate hardening/alerting track.
|
||||
|
||||
## Current state and gap
|
||||
|
||||
- Single MX: `mail.viktorbarzin.me` (pri 1) → `176.12.22.76` → pfSense HAProxy
|
||||
(PROXY v2) → mailserver pod. No backup MX — documented decision in
|
||||
[`architecture/mailserver.md`](../architecture/mailserver.md) §"No Backup MX"
|
||||
(2026-04-12), which this design supersedes (ADR-0019).
|
||||
- Only protection today: sender MTAs queue and retry, typically 1–5 days.
|
||||
Loss vectors: outages longer than a sender's retry window, and senders with
|
||||
unusually short retry policies.
|
||||
- Prior art: **ForwardEmail** relay abandoned 2026-04-12 (its forced
|
||||
anti-spoofing rejected legitimate forwarded mail); **Cloudflare Email
|
||||
Routing** rejected (pass-through only, no queue); **Dynu** ($9.99/yr) was the
|
||||
doc-flagged fallback; **mailflare** (hieunc229) evaluated and rejected
|
||||
2026-07-04 (memory #7148).
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt **Roller Network free-tier Secondary MX** (`mail.rollernet.us` +
|
||||
`mail2.rollernet.us`) as a store-and-forward backup MX. Rationale (full
|
||||
alternatives in ADR-0019):
|
||||
|
||||
- Purpose-built queue relay: **3-week queue**, sliding retry (15 min doubling
|
||||
to a max 1-week interval), queue storage not counted against the account.
|
||||
- Spam filtering on secondary MX is **optional and off by default** ("little to
|
||||
no spam filtering" per their FAQ) — avoids the ForwardEmail failure class.
|
||||
- **Catch-all compatible**: their valid-user table supports a default *allow
|
||||
any* ("catch-all/dropbox") action, preserving the `@viktorbarzin.me → spam@`
|
||||
infinite-alias pattern. New domains default to *deny* — must be flipped
|
||||
explicitly at setup.
|
||||
- Free; unlimited domains; config API; "Accept and Hold" mode usable for
|
||||
planned maintenance windows.
|
||||
|
||||
## Architecture
|
||||
|
||||
Normal operation (unchanged): senders resolve MX, prefer pri 1
|
||||
`mail.viktorbarzin.me`, deliver directly. Rollernet sits idle. (Spammers
|
||||
deliberately targeting the backup MX get relayed to the primary immediately —
|
||||
see failure modes.)
|
||||
|
||||
Outage: senders fail to connect to pri 1 → fall back to pri 20
|
||||
`mail{,2}.rollernet.us` → Rollernet accepts (allow-any user table), queues up
|
||||
to 3 weeks, retries the primary on a sliding schedule → queue drains
|
||||
automatically after recovery, entering via the standard external path (pfSense
|
||||
HAProxy → `:2525` postscreen, PROXY v2), then rspamd → Dovecot as usual.
|
||||
|
||||
```
|
||||
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
|
||||
sender MTA ──► MX lookup ┤ ▲
|
||||
└── pri 20 mail.rollernet.us ─┐ │ retry ≤ 3 weeks
|
||||
pri 20 mail2.rollernet.us ┴─► Rollernet queue ───────┘
|
||||
(only used when pri 1 unreachable)
|
||||
```
|
||||
|
||||
## Rollernet account & configuration (out-of-band SaaS, like Brevo)
|
||||
|
||||
- Account email: **`rollernet@viktorbarzin.me`** (Viktor, 2026-07-04; resolves
|
||||
via catch-all → `spam@`). Known circularity: during an outage their
|
||||
notifications to this address are themselves queued (at their side) until
|
||||
recovery. Accepted — credentials and config live in Vault and the runbook
|
||||
documents ACC access; nothing operational depends on receiving their mail
|
||||
mid-outage.
|
||||
- Credentials → Vault `secret/viktor` (`rollernet_password`, plus API key if
|
||||
minted).
|
||||
- Domain `viktorbarzin.me` in **Secondary MX** mode; valid-user table default
|
||||
action = **allow any** (catch-all).
|
||||
- `abuse@` / `postmaster@` must be deliverable (their RFC requirement) — the
|
||||
catch-all already satisfies this.
|
||||
- Record their **relay source CIDRs** from the post-signup Resource Access page
|
||||
(feeds the whitelist below). Their published mail ranges as of 2026 include
|
||||
`162.216.242.0/24` and `72.51.58.0/24` — confirm the authoritative list in
|
||||
the ACC.
|
||||
|
||||
## Our-side changes (all Terraform; worktree → master → CI apply)
|
||||
|
||||
1. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: add two MX
|
||||
records for the zone apex, `mail.rollernet.us` and `mail2.rollernet.us` at
|
||||
equal preference **20** (primary record untouched at pri 1). Implementation
|
||||
checks: (a) their MX-setup help page has a loop-avoidance rule about
|
||||
priority layout — confirm 1/20/20 matches their prescription post-signup;
|
||||
(b) **the zone sits near Cloudflare's Free-plan 200-record cap** (commit
|
||||
`1a63fee4` dropped 6 names for headroom) — verify ≥2 free slots before
|
||||
apply.
|
||||
2. **Postscreen whitelist** — `stacks/mailserver/modules/mailserver/main.tf`:
|
||||
mount a `postscreen_access.cidr` (permit Rollernet CIDRs) via the existing
|
||||
config ConfigMap and set `postscreen_access_list =
|
||||
permit_mynetworks, cidr:/tmp/docker-mailserver/postscreen_access.cidr` on
|
||||
the `:2525` alt listener (in `user-patches.sh`, where the listener is
|
||||
defined). Rationale: their relays must not be DNSBL-scored or
|
||||
pregreet-tested — queue drains would tempfail/deferred otherwise.
|
||||
3. **rspamd SPF/DMARC exemption** — same stack, via the established
|
||||
`override.d`/`local.d` ConfigMap-mount pattern (as `dkim_signing.conf`
|
||||
today): exempt the Rollernet CIDRs from **SPF and DMARC scoring only**
|
||||
(relayed mail arrives from their IPs, so envelope SPF legitimately fails —
|
||||
the exact ForwardEmail lesson applied on our side). Content, AV, Bayes and
|
||||
DKIM verification stay fully active; DKIM-signed senders still validate
|
||||
end-to-end through the relay.
|
||||
4. **Monitoring** — blackbox DNS assertion that the MX set contains all three
|
||||
hosts (drift guard, same pattern as `viktorbarzin-apex-probe`); alert on
|
||||
drift. Optional informational probe: TCP:25 reachability of
|
||||
`mail.rollernet.us` (their uptime, weekly cadence, no paging).
|
||||
5. **Docs (same commit as implementation)** — rewrite `mailserver.md` §"No
|
||||
Backup MX" (decision superseded by ADR-0019, new inbound flow + DNS table +
|
||||
monitoring rows), add `docs/runbooks/backup-mx-rollernet.md` (ACC queue
|
||||
inspection, post-outage drain verification, Accept-and-Hold for planned
|
||||
maintenance, overage semantics, whitelist upkeep if their CIDRs change).
|
||||
|
||||
### MTA-STS finding (no action in this change)
|
||||
|
||||
`_mta-sts.viktorbarzin.me TXT "v=STSv1; id=20260412"` is published, but
|
||||
`mta-sts.viktorbarzin.me` has **no public DNS record and nothing serves the
|
||||
policy file** → per RFC 8461 senders that see the TXT fail the HTTPS policy
|
||||
fetch and proceed as if no policy exists. MTA-STS is inert today (docs-vs-live
|
||||
mismatch vs the mailserver.md DNS table). Whenever it is fixed properly, the
|
||||
policy's `mx:` list MUST include `mail.rollernet.us` and `mail2.rollernet.us`,
|
||||
or MTA-STS-enforcing senders will refuse the backup path. Tracked as a
|
||||
follow-up, not part of this design.
|
||||
|
||||
## Validation gates (in order; any failure → stop and report)
|
||||
|
||||
| # | Gate | Method | Failure handling |
|
||||
|---|------|--------|------------------|
|
||||
| G1 | Free tier still includes Secondary MX (2026) | Signup + ACC | Decision returns to Viktor: Dynu $9.99/yr vs Rollernet Basic $30/yr vs Oracle-VM self-host |
|
||||
| G2 | 10 MB/day overage semantics: locked domain answers **4xx (defer)** not 5xx (bounce) | Their docs/support ticket before DNS golive | If 5xx: decision returns to Viktor (paid tier lifts cap, or accept the risk window) |
|
||||
| G3 | STARTTLS on their MX hosts (cert quality) | `openssl s_client -starttls smtp -connect mail.rollernet.us:25` | Informational now (blocks only the future MTA-STS fix) |
|
||||
| G4 | Authoritative relay CIDRs published | ACC Resource Access page | Whitelist (changes 2–3) MUST be applied **before** the MX records go live — ordering guard |
|
||||
| G5 | Live failover test | See below | Debug or roll back (remove MX records) |
|
||||
|
||||
**G5 live failover test**: `presence claim service:mailserver` → scale
|
||||
mailserver deployment to 0 (≈30 min window) → send probes from Gmail and via
|
||||
Brevo API → confirm both queue in Rollernet ACC → scale back to 1 → verify
|
||||
clean drain: delivered to `spam@`/target mailbox, headers show no SPF/DMARC
|
||||
penalty and no postscreen interference, DKIM still validates. Also verify the
|
||||
E2E roundtrip probe recovers on its own.
|
||||
|
||||
## Failure modes
|
||||
|
||||
Covered: cluster/pod outages, pfSense/power/ISP outages, WAN IP changes (queue
|
||||
holds while DNS is fixed), multi-day outages ≤ 3 weeks, short-retry senders.
|
||||
|
||||
Not covered (out of scope, above): primary-up-but-5xx misconfigs; outbound;
|
||||
mid-outage mailbox access.
|
||||
|
||||
Newly introduced, accepted:
|
||||
|
||||
- **Plaintext queue at a third party** during outages (same trust class as
|
||||
Brevo holding outbound today).
|
||||
- **Spam-path bypass**: mail via the backup skips postscreen DNSBL (their IPs
|
||||
are whitelisted) and SPF/DMARC scoring; rspamd content/AV/Bayes still apply.
|
||||
Slight spam uptick possible during outages; catch-all absorbs to `spam@`.
|
||||
- **10 MB/day cap** mid-outage → domain locks until midnight PT (severity
|
||||
depends on G2: defer = harmless delay at sender; bounce = loss → gate).
|
||||
- Rollernet outage while primary is also down = status quo ante (sender
|
||||
retries), never worse than today.
|
||||
|
||||
## Rollback
|
||||
|
||||
Remove the two MX records (TTL is automatic/low) and disable the domain in the
|
||||
ACC. Whitelist + rspamd exemption are inert without the MX records and may be
|
||||
reverted in the same commit or left for a retry.
|
||||
Loading…
Add table
Add a link
Reference in a new issue