Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master

Reconciles the two live infra remotes after the pve-host logging change landed on forgejo (which was a commit behind origin). Non-destructive merge — keeps both eae35c51 (pfsense webmail SNI routing) and aac807fb (pve-host Loki shipping).
2026-06-10 19:35:55 +00:00 · 2026-06-10 19:35:55 +00:00 · 8304ef0f70
commit 8304ef0f70
parent aac807fb3a eae35c511a
4 changed files with 113 additions and 4 deletions
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -269,7 +269,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons

 - **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients
 - **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed)
- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL).
+- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25; since 2026-06-10 its :443 also works internally — pfSense carries an SNI-routed HAProxy frontend on 443 that sends hostname traffic to Traefik and bare-IP/no-SNI traffic to the webGUI, which moved to :8443; see `docs/runbooks/mailserver-pfsense-haproxy.md`). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL).
  - **devvm**: also covered by a `~viktorbarzin.me → 10.0.20.201` resolved routing domain (predates the pfSense override, provisioned by `setup-devvm.sh`) — redundant-but-harmless belt-and-suspenders.
  - **in-cluster PODS are ordinary internal clients too** (since 2026-06-10 evening): CoreDNS's dedicated `viktorbarzin.me:53` block (in `stacks/technitium`, TF-managed) forwards to the Technitium ClusterIP (`10.96.0.53`, same as the `.lan` block), so pods get the same split-horizon answers as everyone else. This works because on k8s 1.34 **pods CAN reach the ETP=Local Traefik LB IP** — kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path (verified from pods on three non-Traefik nodes; re-verify after major k8s upgrades — the canary is the uptime-kuma `[External]` fleet going red). forgejo stays pinned to Traefik's **ClusterIP** in the same block so CI pushes survive a Technitium outage. History: the block briefly forwarded to `8.8.8.8/1.1.1.1` (morning of 2026-06-10), which kept pods on public IPs and the broken TP-Link NAT loopback — 27 non-proxied `[External]` uptime-kuma monitors dark (beads code-yh33). Note: in-cluster `[External]` monitors now test DNS+Traefik+service via the internal path for ALL names, including Cloudflare-proxied ones — genuine edge-path fidelity is the job of a true external vantage (ha-london), not in-cluster probes.
  - **Trade-off**: `viktorbarzin.me` resolution via pfSense now depends on in-cluster Technitium (3 replicas). During a full cluster outage the zone SERVFAILs LAN-wide — acceptable, the services behind it are down anyway; node bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds.
--- a/docs/runbooks/mailserver-pfsense-haproxy.md
+++ b/docs/runbooks/mailserver-pfsense-haproxy.md
@ -55,7 +55,7 @@ External mail (WAN) path — PROXY v2
 │  pfSense WAN:{25,465,587,993}                                       │
 │      │  NAT rdr → 10.0.20.1:{same}                                  │
 │      ▼                                                              │
-│  pfSense HAProxy  (mode tcp, 4 frontends, 4 backend pools)          │
+│  pfSense HAProxy  (mode tcp, 5 frontends, 6 backend pools)          │
 │      │  data: send-proxy-v2 → :{30125..30128}  (PROXY-aware pod)   │
 │      │  health: TCP-check    → :{30145..30147}  (no-PROXY pod)     │
 │      │  inter 5000                                                  │
@ -113,6 +113,28 @@ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \
 # Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x
 ```

+## SNI-routed internal :443 frontend (2026-06-10)
+
+`internal_https_443` binds `10.0.20.1:443` + `10.0.10.1:443` and completes
+the internal port table of the mail front door so `mail.viktorbarzin.me`
+(internal A record → 10.0.20.1) serves webmail too. Routing (Viktor's
+design — route by what the client asked for):
+
+| Client connects with | Routed to |
+|---|---|
+| SNI = `pfsense.viktorbarzin.{lan,me}` | webgui backend `127.0.0.1:8443` |
+| any other SNI (hostnames, e.g. `mail.…`) | Traefik `10.0.20.203:443`, send-proxy-v2 |
+| no SNI (bare IP — `https://10.0.20.1`) | webgui backend `127.0.0.1:8443` |
+
+The **pfSense webGUI was moved to `:8443`** (config.xml
+`system.webgui.port`, 2026-06-10) to free the 443 socket; admin access by
+IP keeps working through the no-SNI route, and `:8443` remains a direct
+fallback if HAProxy is down. The `pfsense.viktorbarzin.me` Traefik ingress
+(stacks/reverse-proxy) targets `:8443` directly. Traefik leg mirrors the
+IPv6 bridge: send-proxy-v2 (Traefik trusts 10.0.20.1), **no health check**
+(PROXY-expecting receivers reject bare probes — gotcha above). All of this
+is declared in `pfsense-haproxy-bootstrap.php` — re-run to reset.
+
 ## Bootstrap / restore from scratch

 pfSense HAProxy config lives in `/cf/conf/config.xml` under