infra

Author	SHA1	Message	Date
Viktor Barzin	5258f09230	mailserver: decommission SendGrid Remove leftover SendGrid references after the Brevo migration was completed: - Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`, SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge once the CNAME is removed). - Clean up commented-out `smtp.sendgrid.net` relayhost references and the `# For sendgrid` comment on `sasl_passwd` in the mailserver module. DNS records deleted out-of-band (not TF-managed): - CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries) - Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`, `s2._domainkey` CNAMEs → sendgrid.net Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive, roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey` local + `brevo1/brevo2._domainkey` Brevo) untouched.	2026-05-22 20:08:38 +00:00
Viktor Barzin	a32bfbf07e	[mailserver] Require STARTTLS before AUTH on submission [ci skip] ## Context docker-mailserver 15.0.0's default Postfix config does NOT set `smtpd_tls_auth_only = yes`. Clients that skip STARTTLS on port 587 (or 25 with AUTH) can send PLAIN/LOGIN creds in cleartext. CrowdSec and rate limiting don't catch this — it's an auth-path leak, not a bruteforce. Addresses bd code-vnw. ## This change Adds `smtpd_tls_auth_only = yes` to `postfix_cf` (applied via the `postfix-main.cf` ConfigMap key consumed by docker-mailserver). Rolled the pod to pick up the new ConfigMap. ### Deviation from task spec code-vnw's fix field cited `smtpd_sasl_auth_only = yes`. That is NOT a real Postfix parameter — attempting it gets `postconf: warning: smtpd_sasl_auth_only: unknown parameter`. The acceptance test (reject PLAIN auth before STARTTLS) is satisfied by `smtpd_tls_auth_only`, which is the correct knob. Added an inline comment noting the common confusion. ## What is NOT in this change - Per-service override in master.cf (smtpd_tls_auth_only applied globally, which is safe because port 25 doesn't accept AUTH here) - Other Postfix hardening (sender_restrictions, etc.) ## Test Plan ### Automated ``` $ kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- \ postconf smtpd_tls_auth_only smtpd_tls_auth_only = yes $ kubectl rollout status deployment/mailserver -n mailserver deployment "mailserver" successfully rolled out ``` ### Manual Verification 1. `openssl s_client -connect mail.viktorbarzin.me:587 -starttls smtp` 2. At prompt, send `AUTH PLAIN <base64>` BEFORE `STARTTLS` 3. Expected: Postfix rejects with `503 5.5.1 Error: authentication not enabled` 4. Follow-up: STARTTLS first, then `AUTH PLAIN <base64>` — succeeds for valid creds ## Reproduce locally 1. From a shell with `kubectl` access to the cluster: 2. `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- postconf smtpd_tls_auth_only` 3. Expected: `smtpd_tls_auth_only = yes` Closes: code-vnw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:31:15 +00:00
Viktor Barzin	f568e7d2bf	[mailserver] Delete unused postfix_cf_reference_DO_NOT_USE variable [ci skip] ## Context `infra/stacks/mailserver/modules/mailserver/variables.tf` carried a 130-line historical scaffolding variable `postfix_cf_reference_DO_NOT_USE` containing a reference copy of an older Postfix main.cf layout. The variable name itself signalled dead-code intent ("DO_NOT_USE"), and a repo-wide `grep -rn postfix_cf_reference infra/` confirmed zero consumers — no module, no stack, no script, no doc ever referenced it. Carrying dead Terraform variables costs nothing at runtime but wastes reviewer attention on every `git blame` and drives up `variables.tf` read time. Note on history: the prior commit `09c11056` landed with an identical title ("Delete postfix_cf_reference_DO_NOT_USE dead code") but actually committed `docs/runbooks/mailserver-proxy-protocol.md` — fallout from a race between two concurrent mailserver sessions that staged files in parallel. That commit accidentally closed this beads task via the `Closes:` trailer without performing the deletion. This commit does the actual deletion that was originally intended for code-o3q. The runbook from `09c11056` is legitimate work for code-rtb and is left in place. ## This change Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }` block (136 lines incl. trailing blank). No other variable touched, no resource touched, no comment elsewhere touched. `variables.tf` now contains only the live `postfix_cf` variable that is actually consumed by the module. ## What is NOT in this change - No Terraform state modification — variable was never read, so state has no record of it. - No Postfix runtime behaviour change — `postfix_cf` (the live one) is untouched. - No fix for the pre-existing `kubernetes_deployment.mailserver` / `kubernetes_service.mailserver` drift that `terragrunt plan` surfaces independently. Those 2 in-place updates are known and tracked separately. - No apply needed — pure source hygiene. ## Test Plan ### Automated Reference check before edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" { ``` (single match — the declaration itself) Reference check after edit: ``` $ grep -rn postfix_cf_reference /home/wizard/code/infra/ (no matches) ``` `terragrunt validate` (from `infra/stacks/mailserver/`): ``` Success! The configuration is valid, but there were some validation warnings as shown above. ``` (warnings are pre-existing `kubernetes_namespace` -> `_v1` deprecation notices, unrelated) `terragrunt plan` (from `infra/stacks/mailserver/`): ``` # module.mailserver.kubernetes_deployment.mailserver will be updated in-place # module.mailserver.kubernetes_service.mailserver will be updated in-place Plan: 0 to add, 2 to change, 0 to destroy. ``` Both in-place updates are the known pre-existing drift. No change is attributable to this commit — the dead variable was never referenced. ### Manual Verification 1. `cd infra/stacks/mailserver/modules/mailserver/` 2. `grep -c postfix_cf_reference variables.tf` -> expected `0` 3. `wc -l variables.tf` -> expected `39` (was `175`; 136 lines removed) 4. `cd ../..` -> `terragrunt validate` -> expected `Success!` 5. `terragrunt plan` -> expected `Plan: 0 to add, 2 to change, 0 to destroy.` (pre-existing drift only). Closes: code-o3q Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:07:43 +00:00
Viktor Barzin	e2516b07a3	[mailserver] Disable postscreen btree cache to stop SMTP lock-contention stalls ## Context Postfix inside docker-mailserver was spamming fatal errors at roughly 1 per minute — 5,464 of them in a 24h window — all of the same shape: ``` postfix/postscreen[NNN]: fatal: btree:/var/lib/postfix/postscreen_cache: unable to get exclusive lock: Resource temporarily unavailable ``` Every time one of these fires, the postscreen process dies mid-connection and the inbound SMTP session is dropped. Legitimate mail (including Brevo deliveries for our e2e email-roundtrip probe) gets re-queued by the sender and arrives late — frequently past the probe's 180s IMAP polling window, producing a 35%/7d probe success rate and the EmailRoundtripStale alert noise that was originally flagged as "probably nothing." ## Root cause `master.cf` declares postscreen with `maxproc=1`, but postscreen still re-spawns per incoming connection (or for short-lived reopens), and each instance opens the shared btree cache with an exclusive file lock. Under any concurrency (two TCP SYNs arriving close together, or a retry during teardown), the second process hits EWOULDBLOCK on fcntl and Postfix treats that as fatal. Three options were considered: \| Option \| Verdict \| \|--------\|---------\| \| (a) Disable cache (postscreen_cache_map = ) \| ✓ chosen \| \| (b) Switch btree → lmdb \| ✗ lmdb not compiled into docker-mailserver 15.0.0's postfix (`postconf -m` has no lmdb) \| \| (c) proxy:btree via proxymap \| ✗ unsafe — Postfix docs: "postscreen does its own locking, not safe via proxymap" \| \| (d) Memcached sidecar \| ✗ new moving part; deferred \| Option (a) is a small trade-off: legitimate clients re-run the greet-action / bare-newline-action checks on every fresh TCP session instead of hitting the 7-day whitelist cache. At our volume (~100 deliveries/day, ~72 of which are the probe itself) that's negligible CPU. DNSBL re-evaluation is also avoided only partially, but this mailserver already has `postscreen_dnsbl_action = ignore` so the cache's DNSBL role was doing nothing anyway. ## This change Appends a stanza to the user-merged postfix main.cf stored in `variable.postfix_cf` that sets `postscreen_cache_map =` (empty value). Postfix treats an empty cache_map as "no persistent cache" — per-session decisions are still enforced, they just aren't cached across sessions. Before: ``` smtpd ──► postscreen (maxproc=1, btree cache with exclusive lock) ├─ concurrent access → fcntl EWOULDBLOCK → fatal └─ connection dropped, sender retries, mail arrives late ``` After: ``` smtpd ──► postscreen (no cache, per-session checks only) └─ no shared file, no lock → no fatal, no dropped session ``` No change to master.cf (postscreen still the front-end), no change to DNSBL / greet / bare-newline policy. ## What is NOT in this change - Dovecot userdb dedup (shipped in the previous commit). - Email-roundtrip probe widening (next commit). - Rebuilding docker-mailserver image with lmdb support (deferred — disabling the cache is simpler and sufficient at our volume). ## Test Plan ### Automated `postconf -m` in the running container to confirm lmdb is genuinely absent (ruling out option (b) before we commit to (a)): ``` btree cidr environ fail hash inline internal ldap memcache nis pcre pipemap proxy randmap regexp socketmap static tcp texthash unionmap unix ``` No lmdb entry — confirmed. `scripts/tg plan -target=module.mailserver.kubernetes_config_map.mailserver_config`: ``` ~ "postfix-main.cf" = <<-EOT + postscreen_cache_map = ``` `scripts/tg apply`: ``` Apply complete! Resources: 0 added, 1 changed, 0 destroyed. ``` Reloader triggers pod rollout — baseline error count before apply was 34 `unable to get exclusive lock` lines per `--tail=500` log window. ### Manual Verification Post-rollout, when the new pod is Ready: 1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map` Expect: empty (no value) 2. Watch for 15 min: `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --tail=1000 \| grep -c "unable to get exclusive lock"` Expect: 0 new occurrences (any hits are from before the rollout). 3. Trigger a probe run manually: `kubectl -n mailserver create job --from=cronjob/email-roundtrip-monitor probe-verify-$(date +%s)` then `kubectl -n mailserver logs job/probe-verify-...` Expect: `Round-trip SUCCESS` with duration < 120s. ## Reproduce locally 1. `kubectl -n mailserver exec <pod> -c docker-mailserver -- postconf postscreen_cache_map` 2. Expect: `postscreen_cache_map =` (empty value) 3. `kubectl -n mailserver logs -l app=mailserver -c docker-mailserver --since=15m \| grep -c "unable to get exclusive lock"` 4. Expect: 0 Closes: code-1dc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:32:48 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00

6 commits