Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.
**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
(secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
per-instance zone_count gauges to Pushgateway, fail the job on any
create error (was silently passing).
**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
forward, health_check on viktorbarzin.lan forward, serve_stale
3600s/86400s on both cache blocks — pfSense flap no longer takes the
cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.
**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
CoreDNSForwardFailureRate.
**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
kubectl rollout status on all 3 deployments (180s), per-pod
/api/stats/get probe, zone-count parity across the 3 instances.
Fails the apply on any check fail. Override: -var skip_readiness=true.
**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
modes, emergency override.
Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
After code-yiu Phases 1a–6 landed, `docs/architecture/mailserver.md` still
carried the pre-HAProxy Mermaid diagram, a retired Dovecot-exporter
component row, stale PVC names (`-proxmox` suffixes that were renamed
`-encrypted` during the LUKS migration), a wrong probe schedule
(claimed 10 min, actually 20 min), and a Mailgun-API claim for the
probe (it's been on Brevo since code-n5l). The two-path architecture
(external-via-HAProxy + intra-cluster-via-ClusterIP) that defines the
current design wasn't visualised at all.
## This change
Rewrote the Architecture Diagram section to show **both ingress paths
in one Mermaid flowchart**, colour-coded:
- External (orange): Sender → pfSense NAT → HAProxy → NodePort →
**alt PROXY listeners** (2525/4465/5587/10993).
- Intra-cluster (blue): Roundcube / probe → ClusterIP Service →
**stock listeners** (25/465/587/993), no PROXY.
- The pod subgraph shows both listener sets feeding the same Postfix /
Rspamd / Dovecot / Maildir pipeline.
- Security dotted edges: Postfix log stream → CrowdSec agent →
LAPI → pfSense bouncer decisions.
- Monitoring dotted edges: probe → Brevo HTTP → MX → pod → IMAP →
Pushgateway/Uptime Kuma.
Added a **sequenceDiagram** for the external SMTP roundtrip — walks
through the wire-level handshake from external MTA → pfSense NAT →
HAProxy TCP connect → PROXY v2 header write → kube-proxy SNAT → pod
postscreen parse → smtpd banner. Makes the "how does the pod see the
real IP despite SNAT?" question self-answering.
Added a **Port mapping table** listing all 8 container listeners (4
stock + 4 alt) with their Service, NodePort, PROXY-required flag, and
who uses each path. Replaces the ambiguous prose about "alt ports".
Fixed stale bits:
- Removed Dovecot Exporter row from Components (retired in code-1ik).
- Added pfSense HAProxy row.
- Probe schedule: every 10 min → **every 20 min** (`*/20 * * * *`).
- Probe API: Mailgun → **Brevo HTTP**.
- PVC names: `-proxmox` → **`-encrypted`** (all three); storage class
`proxmox-lvm` → **`proxmox-lvm-encrypted`**.
- Added `mailserver-backup-host` + `roundcube-backup-host` RWX NFS
PVCs to the Storage table with backup flow pointer.
- Expanded Troubleshooting → Inbound to include HAProxy health check
+ container-listener verification steps.
- Secrets table: `brevo_api_key` now marked as used by both relay +
probe; `mailgun_api_key` marked historical.
Added a prominent **UPDATE 2026-04-19** header to
`docs/runbooks/mailserver-proxy-protocol.md` pointing future readers
at the implemented state in `mailserver-pfsense-haproxy.md`. Research
doc preserved as a decision record — it's the canonical "why not just
pin the pod?" reference.
## What is NOT in this change
- No Terraform changes; this is docs-only.
- No changes to the runbook (`mailserver-pfsense-haproxy.md`) — it was
already rewritten during Phase 6.
## Test Plan
### Automated
```
$ awk '/^```mermaid/ {c++} END{print c}' docs/architecture/mailserver.md
2
$ grep -c '\-encrypted' docs/architecture/mailserver.md
5 # PVC references normalised
$ grep -c '\-proxmox' docs/architecture/mailserver.md
0 # no stale names left
```
### Manual Verification
Render `docs/architecture/mailserver.md` on GitHub or any Mermaid-
capable viewer:
1. Top Architecture Diagram should show two labelled paths into the
pod, colour-coded (orange = external, blue = intra-cluster).
2. Sequence diagram should show 10 numbered steps ending at Rspamd +
Dovecot delivery.
3. Port Mapping table should make it obvious that the 4 alt container
ports are only reachable via `mailserver-proxy` NodePort and require
PROXY v2.
## Context (bd code-yiu)
With Phase 4+5 proven (external mail flows through pfSense HAProxy +
PROXY v2 to the alt PROXY-speaking container listeners), the MetalLB
LoadBalancer Service + `10.0.20.202` external IP + ETP:Local policy are
obsolete. Phase 6 decommissions them and documents the steady-state
architecture.
## This change
### Terraform (stacks/mailserver/modules/mailserver/main.tf)
- `kubernetes_service.mailserver` downgraded: `LoadBalancer` → `ClusterIP`.
- Removed `metallb.io/loadBalancerIPs = "10.0.20.202"` annotation.
- Removed `external_traffic_policy = "Local"` (irrelevant for ClusterIP).
- Port set unchanged — the Service still exposes 25/465/587/993 for
intra-cluster clients (Roundcube pod, `email-roundtrip-monitor`
CronJob) that hit the stock PROXY-free container listeners.
- Inline comment documents the downgrade rationale + companion
`mailserver-proxy` NodePort Service that now carries external traffic.
### pfSense (ops, not in git)
- `mailserver` host alias (pointing at `10.0.20.202`) deleted. No NAT
rule references it post-Phase-4; keeping it would be misleading dead
metadata. Reversible via WebUI + `php /tmp/delete-mailserver-alias.php`
companion script (ad-hoc, not checked in — alias is just a
Firewall → Aliases → Hosts entry).
### Uptime Kuma (ops)
- Monitors `282` and `283` (PORT checks) retargeted from `10.0.20.202`
→ `10.0.20.1`. Renamed to `Mailserver HAProxy SMTP (pfSense :25)` /
`... IMAPS (pfSense :993)` to reflect their new purpose (HAProxy
layer liveness). History retained (edit, not delete-recreate).
### Docs
- `docs/runbooks/mailserver-pfsense-haproxy.md` — fully rewritten
"Current state" section; now reflects steady-state architecture with
two-path diagram (external via HAProxy / intra-cluster via ClusterIP).
Phase history table marks Phase 6 ✅. Rollback section updated (no
one-liner post-Phase-6; need Service-type re-upgrade + alias re-add).
- `docs/architecture/mailserver.md` — Overview, Mermaid diagram, Inbound
flow, CrowdSec section, Uptime Kuma monitors list, Decisions section
(dedicated MetalLB IP → "Client-IP Preservation via HAProxy + PROXY
v2"), Troubleshooting all updated.
- `.claude/CLAUDE.md` — mailserver monitoring + architecture paragraph
updated with new external path description; references the new runbook.
## What is NOT in this change
- Removal of `10.0.20.202` from `cloudflare_proxied_names` or any
reserved-IP tracking — wasn't there to begin with. The
`metallb-system default` IPAddressPool (10.0.20.200-220) shows 2 of
19 available after this, confirming `.202` went back to the pool.
- Phase 4 NAT-flip rollback scripts — kept on-disk, still valid if
someone re-introduces the MetalLB LB (see runbook "Rollback").
## Test Plan
### Automated (verified pre-commit 2026-04-19)
```
# Service is ClusterIP with no EXTERNAL-IP
$ kubectl get svc -n mailserver mailserver
mailserver ClusterIP 10.103.108.217 <none> 25/TCP,465/TCP,587/TCP,993/TCP
# 10.0.20.202 no longer answers ARP (ping from pfSense)
$ ssh admin@10.0.20.1 'ping -c 2 -t 2 10.0.20.202'
2 packets transmitted, 0 packets received, 100.0% packet loss
# MetalLB pool released the IP
$ kubectl get ipaddresspool default -n metallb-system \
-o jsonpath='{.status.assignedIPv4} of {.status.availableIPv4}'
2 of 19 available
# E2E probe — external Brevo → WAN:25 → pfSense HAProxy → pod — STILL SUCCEEDS
$ kubectl create job --from=cronjob/email-roundtrip-monitor probe-phase6 -n mailserver
... Round-trip SUCCESS in 20.3s ...
$ kubectl delete job probe-phase6 -n mailserver
# pfSense mailserver alias removed
$ ssh admin@10.0.20.1 'php -r "..." | grep mailserver'
(no output)
```
### Manual Verification
1. Visit `https://uptime.viktorbarzin.me` — monitors 282/283 green on new
hostname `10.0.20.1`.
2. Roundcube login works (`https://mail.viktorbarzin.me/`).
3. Send test email to `smoke-test@viktorbarzin.me` from Gmail — observe
`postfix/smtpd-proxy25/postscreen: CONNECT from [<Gmail-IP>]` in
mailserver logs within ~10s.
4. CrowdSec should still see real client IPs in postfix/dovecot parsers
(verify with `cscli alerts list` on next auth-fail event).
## Phase history (bd code-yiu)
| Phase | Status | Description |
|---|---|---|
| 1a | ✅ `ef75c02f` | k8s alt :2525 listener + NodePort Service |
| 2 | ✅ 2026-04-19 | pfSense HAProxy pkg installed |
| 3 | ✅ `ba697b02` | HAProxy config persisted in pfSense XML |
| 4+5 | ✅ `9806d515` | 4-port alt listeners + HAProxy frontends + NAT flip |
| 6 | ✅ **this commit** | MetalLB LB retired; 10.0.20.202 released; docs updated |
Closes: code-yiu
## Context (bd code-yiu)
Phase 2 (HAProxy on pfSense) and Phase 3 (persist config in pfSense XML so
it lives in the nightly backup) of the PROXY-v2 migration. Test path only —
listens on pfSense 10.0.20.1:2525 → k8s node NodePort :30125 → pod :2525
postscreen. Real client IP verified in maillog
(`postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:...`), Phase 1a
container plumbing is already live (commit ef75c02f).
pfSense HAProxy config lives in `/cf/conf/config.xml` under
`<installedpackages><haproxy>`. That file is captured daily by
`scripts/daily-backup.sh` (scp → `/mnt/backup/pfsense/config-YYYYMMDD.xml`)
and synced offsite to Synology. No new backup wiring needed — this commit
documents the fact + adds the reproducer script.
## This change
Two files, both additive:
1. `scripts/pfsense-haproxy-bootstrap.php` — idempotent PHP script that
edits pfSense config.xml to add:
- Backend pool `mailserver_nodes` with 4 k8s workers on NodePort 30125,
`send-proxy-v2`, TCP health-check every 120000 ms (2 min).
- Frontend `mailserver_proxy_test` listening on pfSense 10.0.20.1:2525
in TCP mode, forwarding to the pool.
Uses `haproxy_check_and_run()` to regenerate `/var/etc/haproxy/haproxy.cfg`
and reload HAProxy. Removes existing items with the same name before
adding, so repeat runs converge on declared state.
2. `docs/runbooks/mailserver-pfsense-haproxy.md` — ops runbook covering
current state, validation, bootstrap/restore, health checks, phase
roadmap, and known warts (health-check noise + bind-address templating).
## What is NOT in this change
- Phase 4 (NAT rdr flip for :25 from `<mailserver>` → HAProxy) — deferred.
- Phase 5 (extend to 465/587/993 with alt listeners + Dovecot dual-
inet_listener) — deferred.
- Terraform for pfSense HAProxy pkg install — not possible (no Terraform
provider for pfSense pkg management). Runbook documents the manual
`pkg install` command.
## Test Plan
### Automated
```
$ ssh admin@10.0.20.1 'pgrep -lf haproxy; sockstat -l | grep :2525'
64009 /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D
www haproxy 64009 5 tcp4 *:2525 *:*
$ ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" \
| awk 'NR>1 {print $4, $6}'
node1 2
node2 2
node3 2
node4 2 # all UP
$ python3 -c "
import socket; s=socket.socket(); s.settimeout(10)
s.connect(('10.0.20.1', 2525))
print(s.recv(200).decode())
s.send(b'EHLO persist-test.example.com\r\n')
print(s.recv(500).decode())
s.send(b'QUIT\r\n'); s.close()"
220-mail.viktorbarzin.me ESMTP
...
250-mail.viktorbarzin.me
250-SIZE 209715200
...
221 2.0.0 Bye
$ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver --tail=50 \
| grep smtpd-proxy.*CONNECT
postfix/smtpd-proxy/postscreen: CONNECT from [10.0.10.10]:33010 to [10.0.20.1]:2525
```
Real client IP `[10.0.10.10]` visible (not the k8s-node IP after kube-proxy
SNAT) → PROXY-v2 roundtrip confirmed.
### Manual Verification
Trigger a pfSense reboot; after boot, HAProxy should auto-restart from the
now-persisted config (`<enable>yes</enable>` in XML). Connection test above
should still work.
## Reproduce locally
1. `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/`
2. `ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'` → rc=OK
3. `python3 -c '...' ` SMTP roundtrip test above.
## Context
`modules/kubernetes/nfs_volume` creates the K8s PV but NOT the underlying
directory on the Proxmox NFS host (`192.168.1.127:/srv/nfs/<subdir>`).
The first time a new consumer is added, the mount fails with
`mount.nfs: … No such file or directory` and the pod hangs in
ContainerCreating.
This bit us twice during the Wave 1/2 rollout — once for the mailserver
backup (code-z26) and again for the Roundcube backup (code-1f6). Both
times the fix was `ssh root@192.168.1.127 'mkdir -p /srv/nfs/<subdir>'`.
Rather than automate the SSH dependency into the module (which would
break hermeticity and fail for operators without host SSH), this runbook
documents the manual bootstrap step and the rationale.
Addresses bd code-yo4.
## This change
New file: `docs/runbooks/nfs-prerequisites.md`. Lists known consumers,
gives the copy-paste SSH command, and explains why auto-creation was
rejected (two options, neither worth the churn).
## What is NOT in this change
- Any automation of the bootstrap — runbook only
- Migration to `nfs-subdir-external-provisioner` — explicitly out of scope
## Test Plan
### Automated
```
$ cat docs/runbooks/nfs-prerequisites.md | head -5
# NFS Prerequisites for `modules/kubernetes/nfs_volume`
The `nfs_volume` Terraform module creates a `PersistentVolume` pointing at a
path on the Proxmox NFS server (`192.168.1.127`). It does **not** create the
underlying directory on the server.
```
### Manual Verification
Before the next stack adds a new `nfs_volume` consumer, read the runbook
and run the `ssh root@192.168.1.127 'mkdir -p ...'` step. First pod
reaches Ready within a minute of the PV creation.
Closes: code-yo4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
`infra/stacks/mailserver/modules/mailserver/variables.tf` carried a
130-line historical scaffolding variable
`postfix_cf_reference_DO_NOT_USE` containing a reference copy of an
older Postfix `main.cf` layout. The variable name itself signalled
dead-code intent ("DO_NOT_USE"), and a repo-wide
`grep -rn postfix_cf_reference infra/` confirmed zero consumers — no
module, no stack, no script, no doc ever referenced it. Carrying dead
Terraform variables costs nothing at runtime but actively wastes
reviewer attention on every `git blame`, drives up `variables.tf` read
time, and lets drift calcify.
Trade-offs considered:
- Keep it "just in case" → rejected; the file it mirrored
(`/usr/share/postfix/main.cf.dist`) is already canonical upstream and
reproducible inside any docker-mailserver container.
- Move it to a comment block → rejected; same noise cost, no value
over deletion (authoritative source is in the image).
## This change
Drops the entire `variable "postfix_cf_reference_DO_NOT_USE" { ... }`
block (136 lines incl. trailing blank). No other variable touched, no
resource touched, no comment elsewhere touched. `variables.tf` now
contains only the single live variable `postfix_cf` that is actually
consumed by the module.
## What is NOT in this change
- No Terraform state modification — variable was never read, so state
has no record of it.
- No Postfix runtime behaviour change — `postfix_cf` (the live one) is
untouched.
- No fix for the pre-existing `kubernetes_deployment.mailserver` /
`kubernetes_service.mailserver` drift that `terragrunt plan` surfaces
independently. Those 2 in-place updates are known and tracked
separately; this commit explicitly avoids conflating cleanup with
drift resolution.
- No apply needed — pure source hygiene.
## Test Plan
### Automated
Reference check before edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
infra/stacks/mailserver/modules/mailserver/variables.tf:41:variable "postfix_cf_reference_DO_NOT_USE" {
```
(single match — the declaration itself)
Reference check after edit:
```
$ grep -rn postfix_cf_reference /home/wizard/code/infra/
(no matches)
```
`terragrunt validate` (from `infra/stacks/mailserver/`):
```
Success! The configuration is valid, but there were some
validation warnings as shown above.
```
(warnings are pre-existing `kubernetes_namespace` → `_v1` deprecation
notices, unrelated)
`terragrunt plan` (from `infra/stacks/mailserver/`):
```
# module.mailserver.kubernetes_deployment.mailserver will be updated in-place
# module.mailserver.kubernetes_service.mailserver will be updated in-place
Plan: 0 to add, 2 to change, 0 to destroy.
```
Both in-place updates are the known pre-existing drift
(volume_mount ordering + stale `metallb.io/ip-allocated-from-pool`
annotation). No change is attributable to this commit — the dead
variable was never referenced, so removing it leaves state untouched.
### Manual Verification
1. `cd infra/stacks/mailserver/modules/mailserver/`
2. `grep -c postfix_cf_reference variables.tf` → expected `0`
3. `wc -l variables.tf` → expected `39` (was `175`; 136 lines removed
including the trailing blank after the EOT)
4. Open `variables.tf` → expected: only `variable "postfix_cf"` remains
5. `cd ../..` (stack root) → `terragrunt validate` → expected:
`Success! The configuration is valid`
6. `terragrunt plan` → expected: `Plan: 0 to add, 2 to change, 0 to
destroy.` (the 2 are the pre-existing drift, not from this commit).
Closes: code-o3q
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
An audit of the mailserver stack raised the question: why is Fail2ban
disabled in the docker-mailserver deployment? The setting
`ENABLE_FAIL2BAN = "0"` lives in the env ConfigMap at
`stacks/mailserver/modules/mailserver/main.tf:68` with no documented
rationale, which made the decision look accidental rather than
deliberate.
The decision is deliberate: CrowdSec is the cluster-wide bouncer for
SSH, HTTP, and SMTP/IMAP brute-force defence. It already tails
`postfix` + `dovecot` logs via the installed collections and enforces
decisions at the LB/firewall tier with real client IPs preserved by
`externalTrafficPolicy: Local` on the dedicated MetalLB IP. Enabling
Fail2ban in-pod would duplicate that response path — two systems
racing to ban the same offender from different enforcement points,
iptables churn inside the container, and a split audit trail across
two decision stores. User decision 2026-04-18: keep disabled, document
the decision so the next auditor doesn't have to re-derive it.
## This change
Adds a new subsection "Fail2ban Disabled (CrowdSec is the Policy)" to
the Security section of `docs/architecture/mailserver.md`, placed
immediately after the existing CrowdSec Integration block. The
paragraph cites `stacks/mailserver/modules/mailserver/main.tf:68`
(where `ENABLE_FAIL2BAN = "0"` lives) and explains why duplicating the
layer would make things worse, not better. Pure docs — no Terraform
touched.
## Test Plan
### Automated
None — docs-only change. No tests, lint, or type checks apply to
markdown prose.
### Manual Verification
1. `less infra/docs/architecture/mailserver.md` — locate the Security
section; confirm the new "Fail2ban Disabled (CrowdSec is the
Policy)" subsection appears between "CrowdSec Integration" and
"Rspamd".
2. Render on GitHub or via a markdown previewer; confirm the inline
link to `main.tf` resolves and the paragraph reads cleanly.
3. `grep -n 'ENABLE_FAIL2BAN' infra/stacks/mailserver/modules/mailserver/main.tf`
— confirm it still reports the value on line 68, matching the
citation in the doc.
Closes: code-zhn
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Outbound mail relay migrated from Mailgun EU to Brevo EU on 2026-04-12 when
variables.tf:6 of the mailserver stack was switched to `smtp-relay.brevo.com:587`.
Postfix immediately began using Brevo for user mail — but the SPF TXT record
at viktorbarzin.me was left pointing at `include:mailgun.org -all`, so every
Brevo-relayed message failed SPF alignment and was spam-foldered or
DMARC-quarantined by Gmail/Outlook.
Observed on 2026-04-18 via `dig TXT viktorbarzin.me @1.1.1.1`:
"v=spf1 include:mailgun.org -all" <-- wrong sender network
User decision (2026-04-18): switch to `v=spf1 include:spf.brevo.com ~all`.
Soft-fail (`~all`) is intentional during cutover — keeps unauthorized Brevo
sends quarantined rather than outright rejected while we validate Brevo's
sending IPs + rate limits for real user mail. Tighten to `-all` once the
relay is proven stable.
The docs in `docs/architecture/mailserver.md` still described the old
Mailgun-based configuration (Overview paragraph, DNS table, Vault secrets
table). Per `infra/.claude/CLAUDE.md` rule "Update docs with every change",
those are updated in the same commit.
## This change
Coupled commit covering beads tasks code-q8p (SPF) + code-9pe (docs):
1. `stacks/cloudflared/modules/cloudflared/cloudflare.tf` — SPF TXT content
flipped from `include:mailgun.org -all` to `include:spf.brevo.com ~all`,
with an inline comment pointing at the mailserver docs for rationale.
2. `docs/architecture/mailserver.md` —
- Last-updated stamp moved to 2026-04-18 with the cutover note.
- Overview paragraph now says "relays through Brevo EU" (was Mailgun).
- DNS table SPF row reflects the new value plus an annotated history
note ("was include:mailgun.org -all until 2026-04-18").
- DMARC row now calls out the intended `dmarc@viktorbarzin.me` rua
target and flags that the current live record still points at
e21c0ff8@dmarc.mailgun.org, tracked under follow-up code-569.
- Vault secrets table: `mailserver_sasl_passwd` relabelled as Brevo
relay credentials; `mailgun_api_key` annotated as retained for the
E2E roundtrip probe only (inbound delivery testing, not user mail).
Apply was scoped with `-target=module.cloudflared.cloudflare_record.mail_spf`
to avoid sweeping up two unrelated pre-existing drifts that the Terraform
state shows on this stack: the DMARC + mail._domainkey_rspamd records are
stored on Cloudflare as RFC-compliant split TXT strings (>255 bytes), and
a naive refresh+apply would normalize them in the state back to single
strings. Those drifts are semantically equivalent (DNS concatenates
adjacent TXT strings at resolution time) and are out of scope for this
commit — they'll be handled under their own ticket.
## What is NOT in this change
- DMARC `rua=mailto:dmarc@viktorbarzin.me` cutover — that's code-569 (M1),
still using the legacy `e21c0ff8@dmarc.mailgun.org` + ondmarc addresses
in the live record.
- DMARC/DKIM TXT multi-string state reconciliation on `mail_dmarc` and
`mail_domainkey_rspamd` — pre-existing Cloudflare representation drift,
untouched here.
- Removal of Mailgun references in history/decision sections of the docs,
or the Mailgun-backed E2E roundtrip probe — probe still uses Mailgun API
on purpose for inbound delivery testing (code-569 scope).
- Mailgun DKIM record `s1._domainkey` — left in place; still consumed by
the roundtrip probe.
- Other pending items from the 2026-04-18 mail audit plan.
## Test Plan
### Automated
Targeted plan showed exactly one change, no other drift sneaking in:
module.cloudflared.cloudflare_record.mail_spf will be updated in-place
~ content = "\"v=spf1 include:mailgun.org -all\""
-> "\"v=spf1 include:spf.brevo.com ~all\""
Plan: 0 to add, 1 to change, 0 to destroy.
Apply result:
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
DNS propagation verified on three independent resolvers immediately after
apply:
$ dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
"v=spf1 include:spf.brevo.com ~all"
$ dig TXT viktorbarzin.me @8.8.8.8 +short | grep spf
"v=spf1 include:spf.brevo.com ~all"
$ dig TXT viktorbarzin.me @10.0.20.201 +short | grep spf # Technitium primary
"v=spf1 include:spf.brevo.com ~all"
### Manual Verification
Setup: nothing extra — change is already live (TF applied before commit
per home-lab convention; `[ci skip]` in title).
1. Confirm SPF is the Brevo-only record from an external resolver:
dig TXT viktorbarzin.me @1.1.1.1 +short
Expected: `"v=spf1 include:spf.brevo.com ~all"` — no Mailgun reference.
2. Send a test email via the mailserver (through Brevo relay) to a Gmail
account and view the original headers:
Authentication-Results: ... spf=pass smtp.mailfrom=viktorbarzin.me
...
Received-SPF: Pass (google.com: domain of ... designates ... as
permitted sender)
Expected: `spf=pass` (it was `spf=fail` or `spf=softfail` before this
change because the envelope sender IP was a Brevo IP not covered by
`include:mailgun.org`).
3. Confirm no live Mailgun references in the mailserver doc:
grep -n mailgun.org infra/docs/architecture/mailserver.md
Expected: only annotated-history mentions — SPF "was ... until
2026-04-18" and DMARC "current live record still points at
e21c0ff8@dmarc.mailgun.org pending cutover". No claims of active
Mailgun relay.
## Reproduce locally
cd infra
git pull
dig TXT viktorbarzin.me @1.1.1.1 +short | grep spf
# expected: "v=spf1 include:spf.brevo.com ~all"
# inspect the TF change:
git show HEAD -- stacks/cloudflared/modules/cloudflared/cloudflare.tf
# inspect the doc change:
git show HEAD -- docs/architecture/mailserver.md
Closes: code-q8p
Closes: code-9pe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Until now, handing work to the in-cluster `beads-task-runner` agent required
opening BeadBoard and clicking the manual Dispatch button on each bead. We
want users to be able to describe work as a bead, set `assignee=agent`, and
have the agent pick it up within a couple of minutes — no clicks.
The existing pieces already provide everything we need:
- `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock`
- BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer
- BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll
- Dolt stores beads and is already in-cluster at `dolt.beads-server:3306`
So the only missing component is a poller that ties them together. This
commit adds that poller as two Kubernetes CronJobs — matching the existing
infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than
introducing n8n or in-service polling.
## Flow
```
user: bd assign <id> agent
│
▼
Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐
│ │
▼ │
CronJob: beads-dispatcher │
1. GET beadboard/api/agent-status (busy? skip) │
2. bd query 'assignee=agent AND status=open' │
3. bd update -s in_progress (claim) │
4. POST beadboard/api/agent-dispatch │
5. bd note "dispatched: job=…" │
│ │
▼ │
claude-agent-service /execute │
beads-task-runner agent runs; notes/closes bead │
│ │
▼ │
done ──► next tick picks up the next bead ───────────────┘
CronJob: beads-reaper (every 10 min)
for bead (assignee=agent, status=in_progress, updated_at > 30 min):
bd note "reaper: no progress for Nm — blocking"
bd update -s blocked
```
## Decisions
- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd
client can set it (`bd assign <id> agent`).
- **Sequential dispatch** — matches the service's `asyncio.Lock`. With a
2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour.
Parallelism is a separate plan.
- **Fixed agent `beads-task-runner`** — read-only rails, matches the manual
Dispatch button. Broader-privilege agents stay manual via BeadBoard UI.
- **Image reuse** — the claude-agent-service image already ships `bd`, `jq`,
`curl`; a new CronJob-specific image would duplicate 400MB of infra tooling.
Mirror `claude_agent_service_image_tag` locally; bump on rebuild.
- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing
the image-seeded file. The script copies it into `/tmp/.beads/` because bd
may touch the parent dir and ConfigMap mounts are read-only.
- **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true.
When false, `suspend: true` on both CronJobs; manual Dispatch keeps working.
- **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved
`beads-task-runner` never trips the reaper. Failures trip it; pod crashes
(in-memory job state lost) also trip it.
## What is NOT in this change
- No Terraform apply — requires Vault OIDC + cluster access. Apply manually:
`cd infra/stacks/beads-server && scripts/tg apply`
- No change to `claude-agent-service/` (already ships bd/jq/curl)
- No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused)
- No change to the `beads-task-runner` agent definition (rails unchanged)
- Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan.
## Deviations from plan
Minor, documented in code comments:
- Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd
serializes `notes` as a string (not an array), and every `bd note` bumps
`updated_at` — equivalent for the reaper's purpose.
- ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU
`-d` and the image has python3.
- `HOME=/tmp` set as a safety net — bd may try to write state/lock files.
## Test plan
### Automated
```
$ cd infra/stacks/beads-server && terraform init -backend=false
Terraform has been successfully initialized!
$ terraform validate
Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated
Success! The configuration is valid, but there were some validation warnings as shown above.
$ terraform fmt stacks/beads-server/main.tf
# (no output — already formatted)
```
### Manual verification
1. **Apply**
```
vault login -method=oidc
cd infra/stacks/beads-server
scripts/tg apply
```
Expect: `kubernetes_config_map.beads_metadata`,
`kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper`
created. No changes to existing resources.
2. **CronJobs exist with right schedule**
```
kubectl -n beads-server get cronjob
```
Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`,
both with `SUSPEND=False`.
3. **End-to-end smoke**
```
bd create "auto-dispatch smoke test" \
-d "Read /etc/hostname inside the agent sandbox and close." \
--acceptance "bd note includes 'hostname=' line and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '{status, notes}'
```
Expect notes to contain `auto-dispatcher claimed at …` and
`dispatched: job=<uuid>`, status `in_progress`.
4. **Reaper smoke**
Assign + dispatch a long bead, then
`kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within
30 min + one reaper tick, `bd show <id>` shows `blocked` with a
`reaper: no progress for Nm — blocking` note.
5. **Kill switch**
```
cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false
kubectl -n beads-server get cronjob
```
Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify
nothing happens within 5 min. Re-apply with `=true` to re-enable.
Runbook with all above plus reaper semantics + design choices at
`infra/docs/runbooks/beads-auto-dispatch.md`.
Closes: code-8sm
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.
Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.
Updates the post-mortem P1 row to capture this for future readers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.
## Root cause
The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.
## What's captured
- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
red-herring suspicion of the new Rybbit Worker, cached sessions masking
the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
per-user symptoms fast; UI-managed Authentik config is invisible to git)
## Follow-ups not in this commit
- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
(see discussion in beads code-zru)
Closes: code-zru
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the `claude_oauth_token` Vault entries to the secrets table, a
new "OAuth token lifecycle" section explaining the two CLI auth modes
(`claude login` vs `claude setup-token`) and why we picked the latter
for headless use, the Ink 300-col PTY gotcha from today's harvest,
and the monitoring/rotation playbook for the new expiry alerts.
Follow-up to 8a054752 and 50dea8f0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the stale "Dev VM SSH key" secret entry with the current
`claude-agent-service` bearer token path (synced to both consumer +
caller namespaces). Adds an "n8n workflow gotchas" section documenting:
1. The workflow is DB-state, not Terraform-managed — the JSON in the
repo is a backup, not authoritative.
2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat
`='Bearer ' + $env.X` does NOT — costs silent 401s.
3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement.
4. 401-troubleshooting steps and the UPDATE pattern for in-place
workflow patches.
Follow-up to 99180bec which fixed the actual pipeline break.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.
## This change
Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
secret/n8n)
Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated
## What is NOT in this change
- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)
[ci skip]
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite of the user-facing documentation:
- How to report outages and request features
- Mermaid flow diagrams for both incident and feature request paths
- SLA expectations (automated vs human response times)
- Self-service checks before reporting
- Severity level definitions
- Status page explanation
- Full technical architecture section with component inventory
- Safety guardrails, labels, and commit conventions
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>