backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Rollernet's free tier failed the validation gates before any DNS change (200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces — worse than no backup MX; free accounts being discontinued). Viktor chose to stay free, so the backup MX becomes a Postfix store-and-forward relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20), draining via port 2526 through the existing pfSense HAProxy frontend since Oracle blocks egress 25. Two independent adversarial reviews then fixed the design: primary-side drain enablement moved to the layers that actually reject (unknown- client-hostname, spoof protection, anvil limits, rspamd reject tier -> external_relay + action cap, never backscatter), monitoring moved off the nonexistent cluster->tailnet path to allowlisted public-IP scrapes, bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only postscreen hygiene replaces the blanket no-filtering stance. ADR-0019 and the design doc renamed accordingly (rollernet -> oracle). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
c1ffed17a9
commit
114a7743ac
4 changed files with 432 additions and 246 deletions
|
|
@ -1,65 +0,0 @@
|
||||||
# Inbound mail gets a free store-and-forward backup MX (Roller Network)
|
|
||||||
|
|
||||||
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
|
|
||||||
inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
|
|
||||||
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
|
|
||||||
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
|
|
||||||
Routing proved pass-through-only (no queue). Viktor now wants inbound mail to
|
|
||||||
survive homelab outages **without loss** (2026-07-04): delayed delivery is fine,
|
|
||||||
mid-outage reading is not required, and the budget is **$0** — which rules out
|
|
||||||
the doc-flagged Dynu fallback ($9.99/yr).
|
|
||||||
|
|
||||||
We adopt **Roller Network's free-tier Secondary MX** (`mail.rollernet.us` +
|
|
||||||
`mail2.rollernet.us` at equal MX preference 20, primary untouched): a
|
|
||||||
purpose-built store-and-forward relay with a **3-week queue** (sliding retries,
|
|
||||||
15 min doubling to 1-week max), **no forced spam filtering** on the secondary
|
|
||||||
path, and a valid-user table with a default-*allow-any* mode that preserves our
|
|
||||||
catch-all's infinite ad-hoc aliases. Our side whitelists their relay CIDRs in
|
|
||||||
postscreen (skip DNSBL/pregreet for queue drains) and exempts them from
|
|
||||||
SPF/DMARC *scoring* in rspamd — the ForwardEmail lesson applied at the right
|
|
||||||
layer; DKIM verification and content/AV scanning stay fully active. Go-live is
|
|
||||||
gated: confirm the free tier still includes Secondary MX, confirm the 10 MB/day
|
|
||||||
overage lock answers 4xx (defer) rather than 5xx (bounce), capture their
|
|
||||||
authoritative relay CIDRs, apply the whitelist **before** the MX records, and
|
|
||||||
finish with a live failover test (mailserver scaled to 0, probes from Gmail +
|
|
||||||
Brevo, verified queue-and-drain). Design:
|
|
||||||
[`plans/2026-07-04-backup-mx-rollernet-design.md`](../plans/2026-07-04-backup-mx-rollernet-design.md).
|
|
||||||
|
|
||||||
## Considered options
|
|
||||||
|
|
||||||
- **Dynu Email Backup ($9.99/yr)** — the previously doc-flagged option; simple,
|
|
||||||
but queue lifetime is undocumented (FAQ hints at 12–24 h retry ceilings),
|
|
||||||
filtering behaviour is a black box, and it costs money the free requirement
|
|
||||||
excludes.
|
|
||||||
- **Self-hosted VPS relay** (Hetzner ~€50/yr, or Oracle Always-Free at $0) —
|
|
||||||
full control (30-day queue, own TLS/MTA-STS story), but a second
|
|
||||||
internet-facing pet to patch and monitor; Oracle hard-blocks egress port 25,
|
|
||||||
forcing delivery to the primary on a custom port, and idles risk free-tier
|
|
||||||
reclamation.
|
|
||||||
- **Cloudflare Email Routing / mailflare** — no store-and-forward (pass-through
|
|
||||||
only) / a terminal inbox on Cloudflare respectively; both previously
|
|
||||||
evaluated and rejected (2026-04-12; 2026-07-04, memory #7148).
|
|
||||||
- **Harden-only** (guard hard-5xx misconfig modes, add paging) — cheaper but
|
|
||||||
does not address multi-day outages or short-retry senders; deferred as a
|
|
||||||
complementary track, not an alternative.
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- Outage mail queues **in plaintext at a third party** for up to 3 weeks —
|
|
||||||
accepted; same trust class as Brevo holding our outbound relay traffic.
|
|
||||||
- The backup path bypasses postscreen DNSBL and SPF/DMARC scoring for
|
|
||||||
Rollernet's CIDRs; content/AV/Bayes and DKIM verification still apply. A
|
|
||||||
slight spam uptick during outages is possible (catch-all absorbs to `spam@`).
|
|
||||||
- The free tier's **10 MB/day cap** locks the domain until midnight Pacific
|
|
||||||
when exceeded; the G2 gate decides whether that lock defers (harmless) or
|
|
||||||
bounces (revisit: paid tier or accept). Overage never affects the primary
|
|
||||||
path — only mail arriving via the backup while locked.
|
|
||||||
- Two more records in a Cloudflare zone already near the Free-plan 200-record
|
|
||||||
cap (headroom must be verified at apply time).
|
|
||||||
- **MTA-STS was found dangling** during design: the `_mta-sts` TXT is published
|
|
||||||
but no policy host exists, so MTA-STS is inert today. Any future fix must
|
|
||||||
list the Rollernet MX hosts in the policy or enforcing senders will skip the
|
|
||||||
backup path.
|
|
||||||
- `architecture/mailserver.md` §"No Backup MX" is superseded at implementation
|
|
||||||
time; a new runbook covers ACC queue inspection, post-outage drain checks,
|
|
||||||
and Accept-and-Hold for planned maintenance.
|
|
||||||
97
docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
Normal file
97
docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
Normal file
|
|
@ -0,0 +1,97 @@
|
||||||
|
# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
|
||||||
|
|
||||||
|
`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
|
||||||
|
inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
|
||||||
|
outage protection — a documented "No Backup MX" decision made after ForwardEmail's
|
||||||
|
forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
|
||||||
|
Routing proved pass-through-only. Viktor now wants inbound mail to survive
|
||||||
|
homelab outages **without loss** (2026-07-04): delayed delivery is fine,
|
||||||
|
mid-outage reading is not required, and the budget is **$0** — a hard
|
||||||
|
constraint that eliminated every managed option (see below).
|
||||||
|
|
||||||
|
We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
|
||||||
|
Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
|
||||||
|
public IP, MX preference 20; primary untouched at 1). It accepts everything
|
||||||
|
for the domain (catch-all — every RCPT is valid; reputation may only ever
|
||||||
|
4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
|
||||||
|
never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
|
||||||
|
prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
|
||||||
|
deliver a DSN, its only egress is the drain), and drains to the primary over
|
||||||
|
**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
|
||||||
|
frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
|
||||||
|
tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
|
||||||
|
mid-outage break-glass since headscale itself lives in the cluster); TLS via
|
||||||
|
certbot HTTP-01 (port 80 permanently open — LE validation is
|
||||||
|
multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
|
||||||
|
`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
|
||||||
|
also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
|
||||||
|
On the primary, the drain stream (one /32) is enabled at the layers that
|
||||||
|
actually bite — `check_client_access` permits past
|
||||||
|
`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
|
||||||
|
exception, and rspamd `external_relay` (score against the *original* sender
|
||||||
|
IP) with the reject action capped to tag/fold so drained spam can never force
|
||||||
|
the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
|
||||||
|
reachability (recurring probe — Oracle publishes no commitment), drain
|
||||||
|
end-to-end, and a live failover test that includes a high-spam-score and a
|
||||||
|
>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
|
||||||
|
final form. Design:
|
||||||
|
[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
|
||||||
|
|
||||||
|
## Considered options
|
||||||
|
|
||||||
|
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
|
||||||
|
validation gates the same day: free tier caps at 200 relayed messages or
|
||||||
|
10 MB per rolling 7 days, and overage suspends the domain for 48 h
|
||||||
|
answering **SMTP 5xx** (permanent bounces) — since spammers target backup
|
||||||
|
MXes even while the primary is up, background spam alone can hold it
|
||||||
|
suspended, making it *worse than no backup MX*. Free accounts are also
|
||||||
|
being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
|
||||||
|
the documented fallback if the OCI route sours.)
|
||||||
|
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
|
||||||
|
12–24 h, barely beating sender retry); filtering black-box; not free.
|
||||||
|
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
|
||||||
|
inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
|
||||||
|
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
|
||||||
|
blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
|
||||||
|
plan is a 6-month credit; Azure has no always-free VM and blocks 25;
|
||||||
|
Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
|
||||||
|
trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
|
||||||
|
is the only standing free option.
|
||||||
|
- **Harden-only** (5xx-misconfig guards + paging) — does not address
|
||||||
|
multi-day outages or short-retry senders; deferred as a complementary
|
||||||
|
track.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
|
||||||
|
Terraform + cloud-init, patched by unattended-upgrades, scraped by the
|
||||||
|
cluster's Prometheus (exporters on the reserved public IP, allowlisted to
|
||||||
|
the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
|
||||||
|
scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
|
||||||
|
besides). Never a backup target itself.
|
||||||
|
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
|
||||||
|
free allowance in June 2026 and terminated over-limit instances, and
|
||||||
|
publishes no commitment that inbound 25 stays open. Mitigations:
|
||||||
|
**Pay-As-You-Go conversion is a required prerequisite** (exempts idle
|
||||||
|
reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
|
||||||
|
the queue being empty outside outages (a surprise reclamation loses
|
||||||
|
coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
|
||||||
|
once.
|
||||||
|
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
|
||||||
|
and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
|
||||||
|
the original IP via `external_relay`), and content scoring stay on — spam
|
||||||
|
arriving via the backup is tagged and folded to Junk, never bounced. The VM
|
||||||
|
is deliberately NOT in the primary's `mynetworks` (a compromised VM must
|
||||||
|
not relay through us).
|
||||||
|
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
|
||||||
|
VM. Stated and accepted (6× better than the status quo).
|
||||||
|
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
|
||||||
|
off-premises; accepted (same class as Brevo holding outbound today).
|
||||||
|
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
|
||||||
|
host found dangling during design — inert today; must list `mx2` when
|
||||||
|
fixed) needs 1–2 more → schedule the next record purge proactively.
|
||||||
|
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
|
||||||
|
new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
|
||||||
|
`vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
|
||||||
|
failure semantics change (a "failing" probe may now mean "delayed via mx2,
|
||||||
|
drains shortly" — noted in alert description).
|
||||||
335
docs/plans/2026-07-04-backup-mx-design.md
Normal file
335
docs/plans/2026-07-04-backup-mx-design.md
Normal file
|
|
@ -0,0 +1,335 @@
|
||||||
|
# Backup MX — self-hosted store-and-forward relay on Oracle Always-Free — design
|
||||||
|
|
||||||
|
Date: 2026-07-04 (v3 — post-challenge; v2 Oracle pivot same day) · Status: design,
|
||||||
|
pre-implementation · ADR: [0019](../adr/0019-backup-mx-self-hosted-oracle-relay.md)
|
||||||
|
|
||||||
|
v3 incorporates two independent adversarial-challenge reviews (same day). Their
|
||||||
|
material corrections are marked **[CH]** throughout — the largest: the v2 drain
|
||||||
|
path would never have drained (primary-side smtpd rejects), monitoring-over-
|
||||||
|
tailnet was fiction (no cluster→tailnet route exists), and the VM's bounce
|
||||||
|
model was wrong (it can never deliver a DSN).
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
|
||||||
|
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
|
||||||
|
acceptable; budget is $0** (hard constraint — reaffirmed after the Rollernet
|
||||||
|
gates failed). A store-and-forward backup MX queues mail while the homelab is
|
||||||
|
down and re-delivers when it returns.
|
||||||
|
|
||||||
|
Out of scope, explicitly:
|
||||||
|
|
||||||
|
- Reading new mail *during* an outage.
|
||||||
|
- Outbound mail during outages.
|
||||||
|
- The "primary up but hard-bouncing 5xx" misconfig class — a backup MX is
|
||||||
|
never consulted when the primary answers. Separate hardening/alerting track.
|
||||||
|
|
||||||
|
Known residual limit (state it plainly): an outage **longer than 30 days**
|
||||||
|
loses the queued mail *silently* — the VM cannot emit a bounce to anyone
|
||||||
|
(egress 25 blocked), so no sender ever learns. Accepted; 30 days is already
|
||||||
|
6× the sender-retry status quo.
|
||||||
|
|
||||||
|
## v1 → v2: why Rollernet was dropped (gate evidence, 2026-07-04)
|
||||||
|
|
||||||
|
v1 selected Roller Network's free Secondary MX. The validation gates killed it
|
||||||
|
before any DNS change:
|
||||||
|
|
||||||
|
- **G2 FAILED**: the [free-accounts policy](https://rollernet.us/policy/free-accounts.html)
|
||||||
|
caps free mail service at **200 relayed messages or 10 MB per rolling 7
|
||||||
|
days**; overage → domain suspended **48 h answering SMTP 5xx** (permanent
|
||||||
|
bounces), repeatable. Spammers deliberately target backup MXes even while
|
||||||
|
the primary is up, so background spam alone can hold the domain suspended —
|
||||||
|
worse than no backup MX.
|
||||||
|
- **G1 SHAKY**: same policy page says free accounts are being discontinued.
|
||||||
|
- **G3 PASSED** (for posterity): `mail{,2}.rollernet.us` present valid LE
|
||||||
|
certs over STARTTLS.
|
||||||
|
- Signup is Cloudflare-Turnstile-gated — moot given G1/G2.
|
||||||
|
|
||||||
|
Viktor's decision: stay free → self-host on Oracle Always-Free. **[CH]** The
|
||||||
|
external challenger re-searched the free landscape (DNSExit, KisoLabs,
|
||||||
|
DuoCircle, AWS/Azure/GCP/Hetzner/Fly/Vultr/Linode free tiers) and confirmed:
|
||||||
|
no credible free managed backup-MX or free VM with a usable port-25 story
|
||||||
|
exists in 2026 other than OCI. GCP's free e2-micro also blocks egress 25 and
|
||||||
|
is US-regions-only (wrong continent).
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A minimal **Postfix store-and-forward relay** (`mx2.viktorbarzin.me`) on an
|
||||||
|
Oracle Cloud **Always-Free** compute instance, published as a lower-preference
|
||||||
|
MX. It accepts mail for `viktorbarzin.me` when the primary is unreachable,
|
||||||
|
queues up to 30 days, and drains to the primary when it returns. No mailboxes,
|
||||||
|
no third-party terms — the queue-lifetime and reject-behavior knobs are ours.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
|
||||||
|
sender MTA ──► MX lookup ┤ ▲
|
||||||
|
└── pri 20 mx2.viktorbarzin.me │ drain: smtp to
|
||||||
|
(Oracle VM, Postfix relay, │ mail.viktorbarzin.me:2526
|
||||||
|
queue ≤ 30 days) ───────────────────┘ (pfSense WAN NAT rdr
|
||||||
|
2526 → 10.0.20.1:25,
|
||||||
|
existing HAProxy frontend)
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Normal operation**: senders use pri 1; the VM idles (spammers targeting
|
||||||
|
the backup + transient-blip retries get relayed onward immediately).
|
||||||
|
- **Outage**: senders fall back to pri 20 → VM accepts + queues → Postfix
|
||||||
|
retries the primary on its native schedule → queue drains after recovery
|
||||||
|
through the standard external ingress path (PROXY v2 → :2525 → rspamd →
|
||||||
|
Dovecot).
|
||||||
|
- **Custom drain port**: Oracle blocks **egress TCP 25** tenancy-wide
|
||||||
|
(post-2021; exemptions unreliable) — the VM cannot reach
|
||||||
|
`mail.viktorbarzin.me:25`. One pfSense WAN NAT rule `TCP 2526 →
|
||||||
|
10.0.20.1:25` reuses the existing HAProxy frontend unchanged. **[CH]
|
||||||
|
Verified against the runbook**: the frontend binds `*:25` on pfSense (not
|
||||||
|
strictly 10.0.20.1), rdr dst-port rewrite is the existing production
|
||||||
|
pattern (WAN:25 already rewrites to 10.0.20.1:25), and port 2526 collides
|
||||||
|
with nothing (the HAProxy test frontend uses :2525). Inbound TCP 25 **to**
|
||||||
|
the VM is unaffected by Oracle's egress-only block per practitioner
|
||||||
|
evidence (iRedMail/mailcow on OCI: receive works, send doesn't) — **to be
|
||||||
|
proven at gate O2 before any DNS change** (Oracle publishes no positive
|
||||||
|
commitment).
|
||||||
|
|
||||||
|
## Oracle account & instance
|
||||||
|
|
||||||
|
- **Account**: Viktor creates it (human signup; card for identity, $0
|
||||||
|
charged). **Home region is fixed at signup and Always-Free compute exists
|
||||||
|
only there — choose `eu-frankfurt-1` deliberately; there is no
|
||||||
|
try-another-region fallback without a new account. [CH]**
|
||||||
|
- **[CH] PAYG conversion is a REQUIRED prerequisite, not a recommendation**:
|
||||||
|
Oracle stops idle Always-Free instances (95th-pct CPU < 20% over 7 days — an
|
||||||
|
idle Postfix box qualifies) and demonstrably changes free-tier terms without
|
||||||
|
notice, enforcing by termination (June 2026: A1 allowance silently halved,
|
||||||
|
over-limit instances shut down). PAYG keeps Always-Free resources free and
|
||||||
|
exempts them from idle reclamation.
|
||||||
|
- **Shape**: `VM.Standard.E2.1.Micro` (x86, 1/8 OCPU burst, 1 GB RAM; 2
|
||||||
|
always-free instances allowed; ample for queue-only Postfix — and untouched
|
||||||
|
by the 2026 A1 cuts). ARM A1 fallback is **unreliable** (halved quota,
|
||||||
|
chronic Frankfurt capacity) — treat E2.1.Micro availability as the gate.
|
||||||
|
- **[CH] Reserved public IP is mandatory** (`oci_core_public_ip`, reserved):
|
||||||
|
an ephemeral IP rotates on stop/start and would silently break all four
|
||||||
|
IP-keyed controls at once (pfSense NAT source-restriction, the primary's
|
||||||
|
smtpd/rspamd exemptions, the Oracle security list, Prometheus scrape
|
||||||
|
allowlist) — discovered only at the next outage's drain.
|
||||||
|
- **OS**: Ubuntu 24.04. **[CH] OCI Ubuntu images ship an OS-level iptables
|
||||||
|
ruleset (`/etc/iptables/rules.v4`) that ACCEPTs 22 and REJECTs everything
|
||||||
|
else, independent of security lists** — cloud-init must insert ACCEPT rules
|
||||||
|
for 25/80 (+ scrape ports) ahead of the REJECT and persist them, or gate O2
|
||||||
|
fails on day 1 with a correct security list.
|
||||||
|
- **Credentials**: OCI API key for Terraform → Vault `secret/viktor`
|
||||||
|
(`oci_*`); web login → Vaultwarden item `Oracle Cloud (backup MX)`.
|
||||||
|
|
||||||
|
## Networking & security posture
|
||||||
|
|
||||||
|
- **Ingress on the VM**: TCP 25 world-open (the service). **[CH] TCP 80
|
||||||
|
world-open permanently** — Let's Encrypt validation is multi-perspective
|
||||||
|
with no published source IPs, so it cannot be source-scoped, and a
|
||||||
|
"open-only-during-renewal" toggle is unspecified automation whose realistic
|
||||||
|
failure mode is an expired cert at day ~90. Nothing listens on 80 outside
|
||||||
|
certbot's seconds-long renewal windows; connection-refused surface is
|
||||||
|
negligible. TCP 9100/9154 (exporters) restricted to the homelab WAN /32
|
||||||
|
(176.12.22.76) in both the Oracle security list and the VM firewall.
|
||||||
|
- **No public SSH**: management rides the headscale tailnet — cloud-init
|
||||||
|
enrolls via a **preauth key for a dedicated non-OIDC headscale user** with
|
||||||
|
node tag `tag:backup-mx` (headscale 0.28.0 file-mode ACL, content in Vault
|
||||||
|
`secret/headscale` → `headscale_acl`); SSH bound to the tailnet interface.
|
||||||
|
ACL grant: `group:admin → tag:backup-mx:22` (cluster pods are NOT tailnet
|
||||||
|
members — see monitoring). **[CH] Outage caveat**: headscale's control
|
||||||
|
plane + DERP live in the cluster, so mid-outage tailnet reachability is
|
||||||
|
cached-netmap best-effort — the runbook documents the **OCI instance
|
||||||
|
console connection as break-glass** management. (Also fix `vpn.md`'s stale
|
||||||
|
"0.23.x / OIDC-only" claims while in there.)
|
||||||
|
- **VM compromise blast radius**: plaintext of outage-queued mail + a relay
|
||||||
|
surface contained by `relay_domains = viktorbarzin.me` only, no submission
|
||||||
|
ports, no SASL, no local delivery. The VM is deliberately NOT added to the
|
||||||
|
primary's `mynetworks` (that would let a compromised VM relay arbitrary
|
||||||
|
mail *through* the primary) — per-stage exemptions instead, below.
|
||||||
|
|
||||||
|
## Postfix configuration (relay-only, accept-and-queue with 4xx-only hygiene)
|
||||||
|
|
||||||
|
- `relay_domains = viktorbarzin.me`; `mydestination =` (empty).
|
||||||
|
- **[CH]** `smtpd_relay_restrictions = permit_mynetworks,
|
||||||
|
reject_unauth_destination` — explicit 5xx for foreign-domain RCPTs (the
|
||||||
|
default tail is `defer_unauth_destination`, whose 4xx invites every relay
|
||||||
|
probe to retry forever).
|
||||||
|
- **[CH]** `relay_recipient_maps` explicitly set to the wildcard form
|
||||||
|
(`@viktorbarzin.me OK`) — documents accept-all-recipients as a decision
|
||||||
|
(the domain is catch-all; every RCPT is valid by definition).
|
||||||
|
- `transport_maps`: `viktorbarzin.me smtp:[mail.viktorbarzin.me]:2526`.
|
||||||
|
- `maximal_queue_lifetime = 30d`. **[CH]** `bounce_queue_lifetime = 1d` and
|
||||||
|
`delay_warning_time = 0` — this host can never deliver a DSN to anyone
|
||||||
|
(egress 25 blocked; its only egress is 2526 to the primary), so undeliverable
|
||||||
|
bounces must be discarded quickly or they rot in the queue for a month and
|
||||||
|
permanently poison the queue-depth alert.
|
||||||
|
- **[CH]** `message_size_limit = 209715200` — exactly the primary's 200 MB
|
||||||
|
(`POSTFIX_MESSAGE_SIZE_LIMIT`, mailserver main.tf:88). The stock 10 MB
|
||||||
|
default would 552-reject large legitimate mail during outages — the exact
|
||||||
|
loss mode this project exists to prevent. Equal, never higher (higher
|
||||||
|
recreates drain-time rejects).
|
||||||
|
- **[CH] postscreen on the VM in 4xx-only posture**: pregreet test ON
|
||||||
|
(fire-and-forget bots don't retry; real MTAs do — the whole design already
|
||||||
|
rests on sender retry, so 4xx filtering is loss-free by construction),
|
||||||
|
optionally `postscreen_dnsbl_action = defer` with a conservative threshold.
|
||||||
|
v2's blanket "no DNSBL" conflated 5xx reputation rejects (rightly banned)
|
||||||
|
with 4xx tempfail (harmless); without any hygiene the backup is a 24/7
|
||||||
|
spam backdoor since spammers deliberately deliver to the highest-numbered
|
||||||
|
MX. Zero 5xx from reputation, ever.
|
||||||
|
- `inet_protocols = ipv4` **[CH]** — the primary publishes an AAAA (HE
|
||||||
|
tunnel) but the IPv6 HAProxy bridge has no :2526 listener; skip the wasted
|
||||||
|
v6 attempt per delivery.
|
||||||
|
- `smtpd_tls_cert_file` = LE cert for `mx2.viktorbarzin.me` (opportunistic
|
||||||
|
STARTTLS inbound; `smtp_tls_security_level = may` on the drain leg).
|
||||||
|
- Queue disk: the ~45 GB free boot volume dwarfs any realistic 30-day
|
||||||
|
accumulation for a personal domain.
|
||||||
|
|
||||||
|
## TLS
|
||||||
|
|
||||||
|
certbot standalone HTTP-01 for `mx2.viktorbarzin.me` (no Cloudflare API token
|
||||||
|
on an internet-facing VM). Port 80 permanently open (see above); certbot renew
|
||||||
|
timer. The MTA-STS follow-up (separate task; policy host currently dangling —
|
||||||
|
below) must list `mx2.viktorbarzin.me` when implemented.
|
||||||
|
|
||||||
|
## Primary-side drain enablement **[CH — this section replaces v2's "SPF/DMARC exemption + postscreen permit", which exempted the wrong layers]**
|
||||||
|
|
||||||
|
The v2 exemptions targeted postscreen DNSBL (which is **off** on the primary —
|
||||||
|
`ENABLE_DNSBL` unset) and rspamd SPF/DMARC scoring — but missed the three
|
||||||
|
mechanisms that would actually break the drain. All are keyed on the VM's
|
||||||
|
reserved /32 (the PROXY-v2-recovered client IP):
|
||||||
|
|
||||||
|
1. **`reject_unknown_client_hostname` bypass** — the primary sets
|
||||||
|
`POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME=1` (main.tf:89); an Oracle IP
|
||||||
|
without full FCrDNS (PTR needs an Oracle SR; limited on free accounts)
|
||||||
|
would be **450-deferred on every drain attempt → the queue never drains →
|
||||||
|
mass-bounces at day 30**. Fix: `check_client_access` permit for the VM /32
|
||||||
|
early in `smtpd_client_restrictions`, and a matching permit at the sender
|
||||||
|
stage (SPOOF_PROTECTION=1 rejects unauthenticated own-domain envelope
|
||||||
|
senders — drained self-addressed/bounced mail would 5xx). Attempt the
|
||||||
|
Oracle PTR anyway (belt and braces).
|
||||||
|
2. **Anvil rate-limit exception** — `smtpd_client_message_rate_limit = 30`/min
|
||||||
|
keys on the VM's IP at drain; a >3,600-message backlog would throttle for
|
||||||
|
hours and false-fire the queue alert. Add the VM /32 to
|
||||||
|
`smtpd_client_event_limit_exceptions`.
|
||||||
|
3. **rspamd: evaluate the original sender, never 5xx the drain stream** — via
|
||||||
|
the existing override.d ConfigMap pattern (same mount as
|
||||||
|
`dkim_signing.conf`): (a) configure rspamd's **`external_relay`** module
|
||||||
|
(ip_map = VM /32) so SPF/DMARC/IP reputation evaluate against the
|
||||||
|
*original* client IP parsed from the VM's Received header — this keeps
|
||||||
|
DMARC protection for the entire drain stream instead of v2's blanket
|
||||||
|
disable; (b) cap rspamd's **action at the VM /32 to tag/fold — never
|
||||||
|
milter-reject**: the primary's default reject tier (DMS default, active
|
||||||
|
since only dkim_signing is overridden today) would 5xx high-score spam at
|
||||||
|
DATA, forcing the VM to generate DSNs to forged senders = classic
|
||||||
|
backup-MX backscatter → mx2's IP blacklisted. Drained spam lands tagged in
|
||||||
|
the catch-all's Junk instead. Validate the external_relay ↔ settings-rule
|
||||||
|
interplay at gate O5 with a high-spam-score message.
|
||||||
|
4. postscreen permit for the /32 (harmless; pregreet never trips a real
|
||||||
|
Postfix client and DNSBL is off — kept for future-proofing only).
|
||||||
|
|
||||||
|
## Our-side changes (Terraform unless noted)
|
||||||
|
|
||||||
|
1. **New stack `stacks/backup-mx/`** (Tier 1): OCI provider (creds from
|
||||||
|
Vault), VCN + subnet + security list + **reserved public IP** +
|
||||||
|
`VM.Standard.E2.1.Micro` + cloud-init (`templatefile`): **OS iptables
|
||||||
|
ACCEPTs for 25/80/9100/9154 ahead of the OCI image's REJECT rule
|
||||||
|
(persisted)**, postfix + config above, certbot, tailscale→headscale
|
||||||
|
enrollment (preauth key from Vault), node_exporter, postfix_exporter,
|
||||||
|
unattended-upgrades.
|
||||||
|
2. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: A
|
||||||
|
`mx2.viktorbarzin.me` → reserved IP (non-proxied), MX pref 20 → `mx2`.
|
||||||
|
**[CH] Live zone count verified: 195/200 → 197/200 after this change; only
|
||||||
|
3 slots remain and the MTA-STS follow-up needs 1–2 → plan the next
|
||||||
|
record-purge now, not at collision time.**
|
||||||
|
3. **pfSense (live network device — approved as part of this plan)**: WAN NAT
|
||||||
|
rdr `TCP 2526 → 10.0.20.1:25` + firewall rule, source-restricted to the
|
||||||
|
reserved IP. **[CH] Scripted** (extend the existing
|
||||||
|
`scripts/pfsense-*-haproxy*.php` bootstrap-script family), not
|
||||||
|
hand-clicked — keeps the git-rebuildable parity the rest of the pfSense
|
||||||
|
mail config has. Config.xml rides the nightly backup.
|
||||||
|
4. **Mailserver stack**: the four-layer drain enablement above (client+sender
|
||||||
|
`check_client_access` permits, anvil exception, rspamd external_relay +
|
||||||
|
action cap, postscreen permit) — all keyed to one /32, via the existing
|
||||||
|
`postfix_cf` / `user-patches.sh` / rspamd-override hook points (verified
|
||||||
|
present: main.tf:129-144, 222-281, 467-474).
|
||||||
|
5. **Monitoring [CH — replaces v2's tailnet scraping, which had no transport:
|
||||||
|
no cluster→tailnet route exists and no existing target is scraped that
|
||||||
|
way]**: Prometheus scrapes `node_exporter`/`postfix_exporter` on the VM's
|
||||||
|
**public reserved IP**, allowed only from the homelab WAN /32 (Oracle SL +
|
||||||
|
VM firewall); blackbox TCP:25 from the cluster (`BackupMxDown`, warning);
|
||||||
|
MX-set drift assertion (both MX records present). Alerts:
|
||||||
|
`BackupMxQueueStuck` = **non-bounce** queue depth > 0 for 2 h while the
|
||||||
|
primary is healthy (gate on the existing `MailServerDown`/roundtrip
|
||||||
|
series, machine-readable — not prose); bounce residue is excluded by the
|
||||||
|
1-day bounce lifetime. Note: during a full homelab outage Prometheus
|
||||||
|
itself is down — queue growth is unobservable live under ANY transport;
|
||||||
|
what we actually watch is the post-recovery drain. A WAN-IP change stales
|
||||||
|
the Oracle allowlist → visible as ScrapeTargetDown (self-signaling).
|
||||||
|
**Probe semantics note**: once mx2 exists, the Brevo roundtrip probe's
|
||||||
|
mail fails over to mx2 on transient primary blips and arrives minutes late
|
||||||
|
via the drain — `EmailRoundtripFailing` may then mean "delayed via mx2",
|
||||||
|
not "lost"; note in the alert description and runbook.
|
||||||
|
6. **Docs (same commit as implementation)**: rewrite `mailserver.md` §"No
|
||||||
|
Backup MX", new runbook `docs/runbooks/backup-mx.md` (`postqueue -p`,
|
||||||
|
forced drain `postqueue -f`, cert renewal, **OCI console break-glass**, VM
|
||||||
|
rebuild from stack, Oracle account facts incl. PAYG + home-region lock),
|
||||||
|
`vpn.md` headscale-version/OIDC staleness fix, monitoring rows.
|
||||||
|
|
||||||
|
### MTA-STS finding (unchanged; no action in this change)
|
||||||
|
|
||||||
|
`_mta-sts` TXT is published but `mta-sts.viktorbarzin.me` has no record and
|
||||||
|
nothing serves the policy — MTA-STS is inert today. When fixed, the policy
|
||||||
|
MUST include `mx: mx2.viktorbarzin.me` (and budget its DNS records against the
|
||||||
|
3 remaining zone slots).
|
||||||
|
|
||||||
|
## Validation gates (in order; any failure → stop and report)
|
||||||
|
|
||||||
|
| # | Gate | Method | Failure handling |
|
||||||
|
|---|------|--------|------------------|
|
||||||
|
| O1 | Oracle account (home region `eu-frankfurt-1`, **fixed forever at signup**), **PAYG conversion done**, E2.1.Micro capacity | Viktor signs up + converts; TF apply | A1-in-home-region is a best-effort fallback only (halved quota, contended); else decision returns to Viktor |
|
||||||
|
| O2 | Inbound TCP 25 reachable from the internet (after the OS-iptables fix) | `nc -zv <reserved-ip> 25` from outside + recurring Uptime-Kuma TCP monitor (keeps proving it — Oracle publishes no commitment) | Stop; decision returns to Viktor |
|
||||||
|
| O3 | Drain works: VM → `mail.viktorbarzin.me:2526` delivers end-to-end | Test message injected on the VM | Debug pfSense NAT / HAProxy path |
|
||||||
|
| O4 | LE cert issued | certbot standalone | STARTTLS is opportunistic — non-blocking for go-live; fix before MTA-STS |
|
||||||
|
| O5 | Live failover test — **hardened [CH]** | presence-claim → scale mailserver to 0 (~30 min) → send from Gmail + Brevo **plus a high-spam-score message and a >10 MB message** → confirm queued (`postqueue -p`) → scale up → verify full drain within the anvil-exception expectations, spam folded to Junk (not bounced), headers show original-IP SPF/DMARC evaluation, no DSN generated on the VM, roundtrip probe recovers | Debug or roll back (remove MX record) |
|
||||||
|
|
||||||
|
## Failure modes
|
||||||
|
|
||||||
|
Covered: cluster/pod outages, pfSense/power/ISP outages ≤ 30 days, WAN IP
|
||||||
|
changes, short-retry senders. If pfSense is down the drain waits — Postfix
|
||||||
|
retries until it heals.
|
||||||
|
|
||||||
|
Not covered: primary-up-but-5xx misconfigs; outbound; mid-outage mailbox
|
||||||
|
access; **outages > 30 days lose queued mail silently (no DSN possible)**.
|
||||||
|
Simultaneous Oracle+homelab outage = status quo ante (sender retries).
|
||||||
|
|
||||||
|
Newly introduced, accepted:
|
||||||
|
|
||||||
|
- **A pet outside the cluster** — deliberately cattle: rebuilt from TF +
|
||||||
|
cloud-init, patched by unattended-upgrades, scraped by Prometheus. Never a
|
||||||
|
backup target.
|
||||||
|
- **Oracle free-tier caprice [CH — upgraded from v2's framing]**: Oracle has
|
||||||
|
silently cut Always-Free allowances and terminated over-limit instances
|
||||||
|
(June 2026, A1). Mitigations: PAYG (required), recurring inbound-25 probe,
|
||||||
|
`BackupMxDown`, and the fact that outside an active outage the queue is
|
||||||
|
empty — a surprise reclamation loses nothing, only coverage until rebuilt.
|
||||||
|
Rollernet Basic ($30/yr) stays the documented fallback if OCI sours.
|
||||||
|
- **Spam hygiene**: 4xx-only postscreen on the VM (pregreet + conservative
|
||||||
|
DNSBL-defer) instead of v2's nothing; drained spam is tagged/folded by
|
||||||
|
rspamd, never bounced.
|
||||||
|
- Outage mail sits plaintext on Oracle disk ≤ 30 days (single-tenant;
|
||||||
|
accepted).
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
Remove the MX + A records; wait for `postqueue -p` empty; `terraform destroy`
|
||||||
|
on `backup-mx`; delete the pfSense NAT rule (scripted); drop the mailserver
|
||||||
|
/32 exemptions. Order matters: MX record first.
|
||||||
|
|
||||||
|
## Viktor's manual steps (everything else is mine)
|
||||||
|
|
||||||
|
1. Create the Oracle Cloud account — **home region `eu-frankfurt-1`** (fixed
|
||||||
|
forever), card for identity, $0 charged.
|
||||||
|
2. **Convert the tenancy to Pay-As-You-Go** (required — idle-reclamation
|
||||||
|
exemption; Always-Free stays $0).
|
||||||
|
3. Hand me the tenancy OCID + a console user → I mint the API key, store
|
||||||
|
creds (Vault + Vaultwarden), and build the stack.
|
||||||
|
4. Approve the (scripted) pfSense NAT rule when I reach that step.
|
||||||
|
|
@ -1,181 +0,0 @@
|
||||||
# Backup MX via Roller Network free Secondary MX — design
|
|
||||||
|
|
||||||
Date: 2026-07-04 · Status: design approved pending user review, pre-implementation · ADR: [0019](../adr/0019-backup-mx-roller-network-free-tier.md)
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Inbound mail for `viktorbarzin.me` must survive homelab outages without loss.
|
|
||||||
Requirement level (Viktor, 2026-07-04): **never lose mail; delayed delivery is
|
|
||||||
acceptable; budget is $0**. A store-and-forward backup MX queues mail while the
|
|
||||||
homelab is down and re-delivers when it returns.
|
|
||||||
|
|
||||||
Out of scope, explicitly:
|
|
||||||
|
|
||||||
- Reading new mail *during* an outage (would need a deliver-to-mailbox backup —
|
|
||||||
rejected in favour of queue-only).
|
|
||||||
- Outbound mail during outages.
|
|
||||||
- The "primary up but hard-bouncing 5xx" misconfig class (e.g. broken alias map
|
|
||||||
→ `550 user unknown`): a backup MX is never consulted when the primary
|
|
||||||
answers. That is a separate hardening/alerting track.
|
|
||||||
|
|
||||||
## Current state and gap
|
|
||||||
|
|
||||||
- Single MX: `mail.viktorbarzin.me` (pri 1) → `176.12.22.76` → pfSense HAProxy
|
|
||||||
(PROXY v2) → mailserver pod. No backup MX — documented decision in
|
|
||||||
[`architecture/mailserver.md`](../architecture/mailserver.md) §"No Backup MX"
|
|
||||||
(2026-04-12), which this design supersedes (ADR-0019).
|
|
||||||
- Only protection today: sender MTAs queue and retry, typically 1–5 days.
|
|
||||||
Loss vectors: outages longer than a sender's retry window, and senders with
|
|
||||||
unusually short retry policies.
|
|
||||||
- Prior art: **ForwardEmail** relay abandoned 2026-04-12 (its forced
|
|
||||||
anti-spoofing rejected legitimate forwarded mail); **Cloudflare Email
|
|
||||||
Routing** rejected (pass-through only, no queue); **Dynu** ($9.99/yr) was the
|
|
||||||
doc-flagged fallback; **mailflare** (hieunc229) evaluated and rejected
|
|
||||||
2026-07-04 (memory #7148).
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
Adopt **Roller Network free-tier Secondary MX** (`mail.rollernet.us` +
|
|
||||||
`mail2.rollernet.us`) as a store-and-forward backup MX. Rationale (full
|
|
||||||
alternatives in ADR-0019):
|
|
||||||
|
|
||||||
- Purpose-built queue relay: **3-week queue**, sliding retry (15 min doubling
|
|
||||||
to a max 1-week interval), queue storage not counted against the account.
|
|
||||||
- Spam filtering on secondary MX is **optional and off by default** ("little to
|
|
||||||
no spam filtering" per their FAQ) — avoids the ForwardEmail failure class.
|
|
||||||
- **Catch-all compatible**: their valid-user table supports a default *allow
|
|
||||||
any* ("catch-all/dropbox") action, preserving the `@viktorbarzin.me → spam@`
|
|
||||||
infinite-alias pattern. New domains default to *deny* — must be flipped
|
|
||||||
explicitly at setup.
|
|
||||||
- Free; unlimited domains; config API; "Accept and Hold" mode usable for
|
|
||||||
planned maintenance windows.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
Normal operation (unchanged): senders resolve MX, prefer pri 1
|
|
||||||
`mail.viktorbarzin.me`, deliver directly. Rollernet sits idle. (Spammers
|
|
||||||
deliberately targeting the backup MX get relayed to the primary immediately —
|
|
||||||
see failure modes.)
|
|
||||||
|
|
||||||
Outage: senders fail to connect to pri 1 → fall back to pri 20
|
|
||||||
`mail{,2}.rollernet.us` → Rollernet accepts (allow-any user table), queues up
|
|
||||||
to 3 weeks, retries the primary on a sliding schedule → queue drains
|
|
||||||
automatically after recovery, entering via the standard external path (pfSense
|
|
||||||
HAProxy → `:2525` postscreen, PROXY v2), then rspamd → Dovecot as usual.
|
|
||||||
|
|
||||||
```
|
|
||||||
┌── pri 1 mail.viktorbarzin.me ──► pfSense HAProxy ──► mailserver pod
|
|
||||||
sender MTA ──► MX lookup ┤ ▲
|
|
||||||
└── pri 20 mail.rollernet.us ─┐ │ retry ≤ 3 weeks
|
|
||||||
pri 20 mail2.rollernet.us ┴─► Rollernet queue ───────┘
|
|
||||||
(only used when pri 1 unreachable)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Rollernet account & configuration (out-of-band SaaS, like Brevo)
|
|
||||||
|
|
||||||
- Account email: **`rollernet@viktorbarzin.me`** (Viktor, 2026-07-04; resolves
|
|
||||||
via catch-all → `spam@`). Known circularity: during an outage their
|
|
||||||
notifications to this address are themselves queued (at their side) until
|
|
||||||
recovery. Accepted — credentials live in Vaultwarden and the runbook
|
|
||||||
documents ACC access; nothing operational depends on receiving their mail
|
|
||||||
mid-outage.
|
|
||||||
- Credentials → **Vaultwarden** item `Rollernet (backup MX)` (Viktor,
|
|
||||||
2026-07-04 — personal web login, so the password manager, not Vault KV;
|
|
||||||
retrieve via `homelab vault get "Rollernet (backup MX)"`). Any API key
|
|
||||||
minted later joins the same item as a custom field.
|
|
||||||
- Domain `viktorbarzin.me` in **Secondary MX** mode; valid-user table default
|
|
||||||
action = **allow any** (catch-all).
|
|
||||||
- `abuse@` / `postmaster@` must be deliverable (their RFC requirement) — the
|
|
||||||
catch-all already satisfies this.
|
|
||||||
- Record their **relay source CIDRs** from the post-signup Resource Access page
|
|
||||||
(feeds the whitelist below). Their published mail ranges as of 2026 include
|
|
||||||
`162.216.242.0/24` and `72.51.58.0/24` — confirm the authoritative list in
|
|
||||||
the ACC.
|
|
||||||
|
|
||||||
## Our-side changes (all Terraform; worktree → master → CI apply)
|
|
||||||
|
|
||||||
1. **DNS** — `stacks/cloudflared/modules/cloudflared/cloudflare.tf`: add two MX
|
|
||||||
records for the zone apex, `mail.rollernet.us` and `mail2.rollernet.us` at
|
|
||||||
equal preference **20** (primary record untouched at pri 1). Implementation
|
|
||||||
checks: (a) their MX-setup help page has a loop-avoidance rule about
|
|
||||||
priority layout — confirm 1/20/20 matches their prescription post-signup;
|
|
||||||
(b) **the zone sits near Cloudflare's Free-plan 200-record cap** (commit
|
|
||||||
`1a63fee4` dropped 6 names for headroom) — verify ≥2 free slots before
|
|
||||||
apply.
|
|
||||||
2. **Postscreen whitelist** — `stacks/mailserver/modules/mailserver/main.tf`:
|
|
||||||
mount a `postscreen_access.cidr` (permit Rollernet CIDRs) via the existing
|
|
||||||
config ConfigMap and set `postscreen_access_list =
|
|
||||||
permit_mynetworks, cidr:/tmp/docker-mailserver/postscreen_access.cidr` on
|
|
||||||
the `:2525` alt listener (in `user-patches.sh`, where the listener is
|
|
||||||
defined). Rationale: their relays must not be DNSBL-scored or
|
|
||||||
pregreet-tested — queue drains would tempfail/deferred otherwise.
|
|
||||||
3. **rspamd SPF/DMARC exemption** — same stack, via the established
|
|
||||||
`override.d`/`local.d` ConfigMap-mount pattern (as `dkim_signing.conf`
|
|
||||||
today): exempt the Rollernet CIDRs from **SPF and DMARC scoring only**
|
|
||||||
(relayed mail arrives from their IPs, so envelope SPF legitimately fails —
|
|
||||||
the exact ForwardEmail lesson applied on our side). Content, AV, Bayes and
|
|
||||||
DKIM verification stay fully active; DKIM-signed senders still validate
|
|
||||||
end-to-end through the relay.
|
|
||||||
4. **Monitoring** — blackbox DNS assertion that the MX set contains all three
|
|
||||||
hosts (drift guard, same pattern as `viktorbarzin-apex-probe`); alert on
|
|
||||||
drift. Optional informational probe: TCP:25 reachability of
|
|
||||||
`mail.rollernet.us` (their uptime, weekly cadence, no paging).
|
|
||||||
5. **Docs (same commit as implementation)** — rewrite `mailserver.md` §"No
|
|
||||||
Backup MX" (decision superseded by ADR-0019, new inbound flow + DNS table +
|
|
||||||
monitoring rows), add `docs/runbooks/backup-mx-rollernet.md` (ACC queue
|
|
||||||
inspection, post-outage drain verification, Accept-and-Hold for planned
|
|
||||||
maintenance, overage semantics, whitelist upkeep if their CIDRs change).
|
|
||||||
|
|
||||||
### MTA-STS finding (no action in this change)
|
|
||||||
|
|
||||||
`_mta-sts.viktorbarzin.me TXT "v=STSv1; id=20260412"` is published, but
|
|
||||||
`mta-sts.viktorbarzin.me` has **no public DNS record and nothing serves the
|
|
||||||
policy file** → per RFC 8461 senders that see the TXT fail the HTTPS policy
|
|
||||||
fetch and proceed as if no policy exists. MTA-STS is inert today (docs-vs-live
|
|
||||||
mismatch vs the mailserver.md DNS table). Whenever it is fixed properly, the
|
|
||||||
policy's `mx:` list MUST include `mail.rollernet.us` and `mail2.rollernet.us`,
|
|
||||||
or MTA-STS-enforcing senders will refuse the backup path. Tracked as a
|
|
||||||
follow-up, not part of this design.
|
|
||||||
|
|
||||||
## Validation gates (in order; any failure → stop and report)
|
|
||||||
|
|
||||||
| # | Gate | Method | Failure handling |
|
|
||||||
|---|------|--------|------------------|
|
|
||||||
| G1 | Free tier still includes Secondary MX (2026) | Signup + ACC | Decision returns to Viktor: Dynu $9.99/yr vs Rollernet Basic $30/yr vs Oracle-VM self-host |
|
|
||||||
| G2 | 10 MB/day overage semantics: locked domain answers **4xx (defer)** not 5xx (bounce) | Their docs/support ticket before DNS golive | If 5xx: decision returns to Viktor (paid tier lifts cap, or accept the risk window) |
|
|
||||||
| G3 | STARTTLS on their MX hosts (cert quality) | `openssl s_client -starttls smtp -connect mail.rollernet.us:25` | Informational now (blocks only the future MTA-STS fix) |
|
|
||||||
| G4 | Authoritative relay CIDRs published | ACC Resource Access page | Whitelist (changes 2–3) MUST be applied **before** the MX records go live — ordering guard |
|
|
||||||
| G5 | Live failover test | See below | Debug or roll back (remove MX records) |
|
|
||||||
|
|
||||||
**G5 live failover test**: `presence claim service:mailserver` → scale
|
|
||||||
mailserver deployment to 0 (≈30 min window) → send probes from Gmail and via
|
|
||||||
Brevo API → confirm both queue in Rollernet ACC → scale back to 1 → verify
|
|
||||||
clean drain: delivered to `spam@`/target mailbox, headers show no SPF/DMARC
|
|
||||||
penalty and no postscreen interference, DKIM still validates. Also verify the
|
|
||||||
E2E roundtrip probe recovers on its own.
|
|
||||||
|
|
||||||
## Failure modes
|
|
||||||
|
|
||||||
Covered: cluster/pod outages, pfSense/power/ISP outages, WAN IP changes (queue
|
|
||||||
holds while DNS is fixed), multi-day outages ≤ 3 weeks, short-retry senders.
|
|
||||||
|
|
||||||
Not covered (out of scope, above): primary-up-but-5xx misconfigs; outbound;
|
|
||||||
mid-outage mailbox access.
|
|
||||||
|
|
||||||
Newly introduced, accepted:
|
|
||||||
|
|
||||||
- **Plaintext queue at a third party** during outages (same trust class as
|
|
||||||
Brevo holding outbound today).
|
|
||||||
- **Spam-path bypass**: mail via the backup skips postscreen DNSBL (their IPs
|
|
||||||
are whitelisted) and SPF/DMARC scoring; rspamd content/AV/Bayes still apply.
|
|
||||||
Slight spam uptick possible during outages; catch-all absorbs to `spam@`.
|
|
||||||
- **10 MB/day cap** mid-outage → domain locks until midnight PT (severity
|
|
||||||
depends on G2: defer = harmless delay at sender; bounce = loss → gate).
|
|
||||||
- Rollernet outage while primary is also down = status quo ante (sender
|
|
||||||
retries), never worse than today.
|
|
||||||
|
|
||||||
## Rollback
|
|
||||||
|
|
||||||
Remove the two MX records (TTL is automatic/low) and disable the domain in the
|
|
||||||
ACC. Whitelist + rspamd exemption are inert without the MX records and may be
|
|
||||||
reverted in the same commit or left for a retry.
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue