From 7d7a0ad4741e5659fb6a1c8f8b5b3a415419e9c8 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Wed, 3 Jun 2026 10:23:13 +0000
Subject: [PATCH] infra: fix stale Traefik LB-IP refs + accurate LB-IP registry

Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.

- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
  (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
  break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
  kubernetes_service data source -- cannot rot on a future renumber and avoids
  the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
  -> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
  KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
  registry + LB-IP renumber checklist (in-band + out-of-band consumers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/architecture/networking.md               |  38 ++++---
 docs/plans/2026-06-03-lb-ip-hygiene-design.md | 103 ++++++++++++++++++
 .../monitoring/prometheus_chart_values.tpl    |   2 +-
 stacks/woodpecker/main.tf                     |  28 +++--
 4 files changed, 147 insertions(+), 24 deletions(-)
 create mode 100644 docs/plans/2026-06-03-lb-ip-hygiene-design.md

diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md
index daf3b18c..434a6f79 100644
--- a/docs/architecture/networking.md
+++ b/docs/architecture/networking.md
@@ -267,25 +267,31 @@ The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/
 
 ### MetalLB & Load Balancing
 
-MetalLB v0.15.3 allocates IPs from the range 10.0.20.200-10.0.20.220 in **Layer 2 mode**. Most LoadBalancer services share **10.0.20.200** using the `metallb.io/allow-shared-ip: shared` annotation. Two services have **dedicated IPs** with `externalTrafficPolicy: Local` to preserve real client source IPs: **Traefik (10.0.20.203)** — so CrowdSec sees real public IPs on the direct-ingress path and QUIC/HTTP3 works (a shared IP forbids the mixed ETP that QUIC's UDP listener needs) — and **Technitium DNS (10.0.20.201)** for query logging.
+MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **four are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Three services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs).
 
-| Service | Namespace | IP | Ports |
-|---------|-----------|-----|-------|
-| traefik | traefik | **10.0.20.203 (dedicated, ETP=Local)** | 80, 443, 443/UDP (HTTP/3), 10200, 10300, 11434/TCP |
-| coturn | coturn | 10.0.20.200 (shared) | 3478/UDP (STUN/TURN), 49152-49252/UDP (relay) |
-| headscale | headscale | 10.0.20.200 (shared) | 41641/UDP, 3479/UDP |
-| windows-kms¹ | kms | 10.0.20.200 (shared) | 1688/TCP |
-| qbittorrent | servarr | 10.0.20.200 (shared) | 50000/TCP+UDP |
-| shadowsocks | shadowsocks | 10.0.20.200 (shared) | 8388/TCP+UDP |
-| torrserver-bt | tor-proxy | 10.0.20.200 (shared) | 5665/TCP |
-| wireguard | wireguard | 10.0.20.200 (shared) | 51820/UDP |
-| mailserver | mailserver | 10.0.20.200 (shared) | 25, 465, 587, 993/TCP |
-| xray-reality | xray | 10.0.20.200 (shared) | 7443/TCP |
-| **technitium-dns** | **technitium** | **10.0.20.201 (dedicated)** | **53/UDP+TCP** |
+> **Why not consolidate to fewer IPs?** The three dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 4-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`.
 
-pfSense aliases reference these IPs: `k8s_shared_lb` (10.0.20.200), `traefik_lb` (10.0.20.203), `technitium_dns` (10.0.20.201). NAT rules use aliases for maintainability — the WAN 443 (TCP+UDP) forward targets `traefik_lb`.
+| IP | ETP | Services (ns/name → ports) |
+|----|-----|----------------------------|
+| **10.0.20.200** (shared) | Cluster | dbaas/postgresql-lb→5432 · beads-server/dolt→3306 · coturn/coturn→3478 TCP+UDP, 49152-49252/UDP · headscale/headscale-server→41641/UDP, 3479/UDP · wireguard/wireguard→51820/UDP · servarr/qbittorrent-torrenting→50000 TCP+UDP · shadowsocks/shadowsocks→8388 TCP+UDP · tor-proxy/torrserver-bt→5665 TCP+UDP · xray/xray-reality→7443 |
+| **10.0.20.201** (dedicated) | Local | technitium/technitium-dns→53 UDP+TCP |
+| **10.0.20.202** (dedicated)¹ | Local | kms/windows-kms→1688 |
+| **10.0.20.203** (dedicated) | Local | traefik/traefik→80, 443, 443/UDP (HTTP/3), 10200 (piper), 10300 (whisper) |
 
-¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_shared_lb:1688` so any internet host can activate. The matching filter rule applies a per-source rate limit (`max-src-conn 50`, `max-src-conn-rate 10/60`) with `overload <virusprot>` flush — offenders are auto-added to pfSense's stock `virusprot` pf table for follow-on blocks. Operations (rate-limit tuning, log locations, revocation) are documented in `docs/runbooks/kms-public-exposure.md`.
+**Mailserver does NOT use a LB IP** — inbound mail enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}` → NodePorts `30125-30128` (PROXY-v2; see "Mail Server" below). (Earlier revisions of this table wrongly listed mailserver on `.200` and KMS on `.200` — both corrected 2026-06-03.)
+
+**pfSense aliases** map to these IPs: `k8s_shared_lb`→.200, `technitium_dns`→.201, `k8s_kms_lb`→.202, `traefik_lb`→.203 (plus a legacy `nginx`→.200 duplicate — cruft). NAT rules reference aliases, so repointing an alias cascades to its paired filter rule.
+
+¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_kms_lb` (.202) so any internet host can activate. The matching filter rule rate-limits per source (`max-src-conn 50`, `max-src-conn-rate 10/60`, `overload <virusprot>`). See `docs/runbooks/kms-public-exposure.md`.
+
+#### LB-IP renumber checklist
+
+These IPs are referenced by consumers that do **not** auto-follow when an IP moves — the 2026-05-30 Traefik `.200→.203` move broke five of them (cloudflared 502, woodpecker forge API, containerd pulls, the `.lan` + `.me` zones). **Before moving any LB IP, update every consumer below.** Bootstrap-critical literals (containerd mirror, PG state, node DNS) deliberately stay IP literals (DNS chicken-and-egg) — this list is their single source of truth.
+
+- **`.203` Traefik:** assigner `stacks/traefik/modules/traefik/main.tf` · split-horizon translation `stacks/technitium/modules/technitium/main.tf` (`externalToInternalTranslation`) · prometheus apex-alert summary `stacks/monitoring/.../prometheus_chart_values.tpl` · containerd Forgejo mirror `modules/create-template-vm/k8s-node-containerd-setup.sh` + `scripts/setup-forgejo-containerd-mirror.sh` (OOB, per node) · cloudflared origin (already IP-independent → `traefik.traefik.svc`) · woodpecker forge alias (now reads the Traefik **ClusterIP** dynamically — no literal) · pfSense NAT 80/443 → `traefik_lb`.
+- **`.201` Technitium:** assigner `stacks/technitium/modules/technitium/main.tf` · DNS records `config.tfvars` (ns1/ns2/`viktorbarzin.lan`, dnscrypt forwarder) · `modules/create-template-vm/cloud_init.yaml` FallbackDNS · `scripts/provision-k8s-worker` · pfSense NAT 53 (**literal `10.0.20.201`**, not the `technitium_dns` alias — known inconsistency).
+- **`.202` KMS:** assigner `stacks/kms/main.tf` · pfSense NAT 1688 → `k8s_kms_lb` · Cloudflare `vlmcs` public A → WAN → `.202`.
+- **`.200` shared:** the 9 assigners above · PG state backend `scripts/tg` + `scripts/migrate-state-to-pg` (`@10.0.20.200:5432`) · pfSense NAT (wireguard/shadowsocks/coturn/headscale-STUN/qbittorrent/xray) → `k8s_shared_lb`, outbound-NAT self rule, CrowdSec syslog `remoteserver .200:30514`.
 
 Critical services are scaled to **3 replicas**:
 - Traefik (PDB: minAvailable=2)
diff --git a/docs/plans/2026-06-03-lb-ip-hygiene-design.md b/docs/plans/2026-06-03-lb-ip-hygiene-design.md
new file mode 100644
index 00000000..d6154924
--- /dev/null
+++ b/docs/plans/2026-06-03-lb-ip-hygiene-design.md
@@ -0,0 +1,103 @@
+# L4 LoadBalancer IP review & pfSense hygiene — design + decisions
+
+**Date:** 2026-06-03
+**Status:** repo changes implemented; pfSense DHCP shrink pending live-change approval
+**Trigger:** "Review the L4 LB IPs we give away, consolidate, and use pfSense Virtual IPs instead of hardcoding IPs in rules."
+
+## TL;DR
+
+The headline ask — **consolidate to fewer MetalLB IPs** — is a verified dead end. The
+real, worthwhile outcome is a **single source of truth (this doc + the renumber
+checklist in `architecture/networking.md`) plus two stale-reference fixes**. We
+deliberately did **not** reduce the IP count and did **not** do the high-risk pfSense
+mail-VIP surgery.
+
+## Current state (verified live, 2026-06-03)
+
+MetalLB L2, pool `10.0.20.200-220` (21 IPs, **17 free**). Four in use:
+
+| IP | ETP | What | Why dedicated |
+|----|-----|------|---------------|
+| `.200` | Cluster (shared) | ~9 svcs: postgresql-lb (TF state), dolt, coturn, headscale, wireguard, qbittorrent, shadowsocks, torrserver, xray | already maximally consolidated (the 2026-03 "5→1" merge) |
+| `.201` | Local | technitium-dns | real client IP → network-scoped split-horizon |
+| `.202` | Local | windows-kms | real client IP → notifier source labeling |
+| `.203` | Local | traefik | real client IP (CrowdSec) + QUIC/HTTP3 (UDP) |
+
+## Why consolidation fails (the core finding)
+
+MetalLB L2 only lets multiple `ETP=Local` services **share** an IP if they have
+**identical pod selectors** (so traffic to the single announcing node lands on the
+right pods). Traefik / KMS / Technitium have disjoint selectors and disjoint pods,
+and a shared `ETP=Local` IP announces from one node (stateless `node+VIP` hash) —
+blackholing any service whose pods aren't there. Refs:
+[MetalLB L2 concepts](https://metallb.universe.tf/concepts/layer2/),
+[Usage / IP sharing](https://metallb.universe.tf/usage/),
+[issue #271](https://github.com/metallb/metallb/issues/271).
+
+Consequences:
+- **Traefik can never leave a dedicated `ETP=Local` IP** — QUIC's UDP listener needs it; QUIC can't traverse pfSense HAProxy either.
+- The trio is **fewer-IPs XOR client-IP preservation** — not both. The only "both" is making all three DaemonSets (breaks Technitium's primary/secondary AXFR design; burns resources to save 2 of 17 free IPs). Not worth it.
+
+**Decision (user, 2026-06-03):** keep all 4 dedicated, preserve client IPs everywhere. No MetalLB changes.
+
+## Why a doc registry instead of a `config.tfvars`/Terraform IP variable
+
+The cascade risk is **consumers that hardcode another service's IP and get forgotten**
+(the 2026-05-30 Traefik `.200→.203` move broke cloudflared, woodpecker, containerd,
+and the `.lan`+`.me` zones). A Terraform-var single-source was considered and rejected:
+
+1. Editing `terragrunt.hcl`/`config.tfvars` triggers the CI "global change → apply ALL
+   ~37 platform stacks" path (`.woodpecker/default.yml`) — a 37-stack apply for what
+   are no-op refactors (rendered IPs unchanged), risking unrelated drift surfacing.
+2. It can't cover the **out-of-band** consumers (cloudflared via CF-API, containerd
+   `hosts.toml` on each node) — which were half the 2026-05-30 breakage.
+3. Bootstrap-critical literals (PG state in `scripts/tg`, node DNS) must stay literals
+   (DNS chicken-and-egg) regardless.
+
+A **documentation registry** (the "LB-IP renumber checklist" in
+`architecture/networking.md`) covers *all* consumers — in-band and OOB — at zero
+apply-risk, and is the complete pre-move checklist. That is the single source of truth.
+
+## Changes made (minimal-hygiene scope)
+
+1. **`architecture/networking.md`** — rewrote the stale MetalLB section into an accurate
+   registry (it had KMS on `.200`, mailserver on a LB IP, "two dedicated" — all wrong)
+   + added the **renumber checklist**.
+2. **woodpecker** (`stacks/woodpecker/main.tf`) — the `forgejo.viktorbarzin.me`
+   hostAlias hardcoded the **dead** `10.0.20.200` (Traefik moved to `.203`; `.200:443`
+   refuses TLS). Now reads the Traefik **ClusterIP dynamically** (`data
+   "kubernetes_service" "traefik"`) so it can't rot on a future renumber and avoids the
+   ETP=Local hairpin trap. (Real fix — the next woodpecker apply would otherwise
+   re-pin the dead IP and break pipeline creation.)
+3. **monitoring** (`prometheus_chart_values.tpl`) — `ViktorBarzinApexDrift` alert
+   summary said "expected 10.0.20.200" (stale post-Traefik-move) → `.203`. Cosmetic
+   (alert logic was already correct) but prevents a misleading incident message.
+4. **`backend.tf`** — 72 stale generated copies were tracked in git with a plaintext
+   (Vault-rotated, ~expired) PG password + `.200` literal, despite already being in
+   `.gitignore`. `git rm --cached` (they regenerate from `PG_CONN_STR`). History scrub
+   deferred (creds rotate weekly → low urgency).
+5. **pfSense DHCP range** (`opt1`/K8s VLAN) — `.200-.254` overlaps the MetalLB pool
+   `.200-.220` (latent IP-conflict: DHCP could hand out a live LB IP). Plan: shrink to
+   start at `.221`. Verified zero leases/statics in the band. **PENDING** — live
+   pfSense change, applied separately after explicit approval (live network device).
+
+## Explicitly NOT done (rationale)
+
+- **No MetalLB IP merging** — infeasible without losing client-IP/QUIC/HA (above).
+- **No mail Virtual IP** — mail binds pfSense's own `10.0.20.1`, the most stable IP in
+  the system; the 2026-06-02 incident was a *DNS split-horizon* bug, not an IP move.
+  A mail VIP is 4 NAT + 5 filter + HAProxy cutover on the live mail path for marginal
+  "identity" benefit. Skipped.
+- **No `nginx`-alias delete / NAT literal→alias** — pfSense rule cosmetics; left for a
+  later pfSense-focused pass (would also need the web filter F#2/F#3 `nginx`→`traefik_lb`
+  repoint to avoid breaking 80/443).
+- **No Terraform IP variable** — see registry rationale above.
+
+## Known latent items (documented, not fixed here)
+
+- pfSense web filter rules F#2/F#3 reference `nginx` (.200) while their NAT targets
+  `traefik_lb` (.203) — inconsistent but currently passing; fix in a pfSense pass.
+- pfSense NAT 53 hardcodes literal `10.0.20.201` instead of the `technitium_dns` alias.
+- In-cluster `*.viktorbarzin.me` split-horizon still resolves some hosts to the dead
+  `.200` (beads `code-yh33`) — the woodpecker hostAlias is the per-app workaround.
+- CrowdSec syslog `remoteserver` doc/config drift (`.200` vs comment `.202`).
diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
index be54fdf7..444aab66 100755
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@@ -2455,7 +2455,7 @@ serverFiles:
             labels:
               severity: critical
             annotations:
-              summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
+              summary: "viktorbarzin.me apex A drifted from expected 10.0.20.203"
               description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."
           - alert: ViktorBarzinApexProbeStale
             expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900
diff --git a/stacks/woodpecker/main.tf b/stacks/woodpecker/main.tf
index 96803307..2b389cd8 100644
--- a/stacks/woodpecker/main.tf
+++ b/stacks/woodpecker/main.tf
@@ -178,20 +178,34 @@ resource "helm_release" "woodpecker" {
 # Keeps the OAuth/forge-API path off the WAN gateway (forgejo.viktorbarzin.me
 # resolves to the public IP via DNS, which round-trips through Cloudflare
 # and routinely tripped 30s context-deadline timeouts when fetching pipeline
-# config). 10.0.20.200 is the Traefik LB that fronts forgejo internally;
-# Traefik serves the *.viktorbarzin.me wildcard so SNI verification still
-# passes.
+# config). We pin forgejo.viktorbarzin.me to the in-cluster Traefik Service
+# ClusterIP (read dynamically below); Traefik serves the *.viktorbarzin.me
+# wildcard so SNI verification still passes. ClusterIP (not the LB IP) avoids
+# two traps: (1) the Traefik LB IP is ETP=Local and not reliably hairpin-
+# reachable from an arbitrary pod's node; (2) a hard-coded LB IP rots on every
+# Traefik renumber — this previously pinned 10.0.20.200, which went dead when
+# Traefik moved to its dedicated .203 (2026-05-30), the same failure class the
+# cloudflared origin hit (also fixed by targeting the in-cluster Service).
+data "kubernetes_service" "traefik" {
+  metadata {
+    name      = "traefik"
+    namespace = "traefik"
+  }
+}
+
 resource "null_resource" "woodpecker_server_host_alias" {
   triggers = {
-    # Re-run on every helm_release version bump or values change.
-    helm_version = helm_release.woodpecker.version
-    helm_values  = sha256(join("", helm_release.woodpecker.values))
+    # Re-run on helm version/values change, and when the Traefik ClusterIP
+    # changes (e.g. the Service is recreated) so the alias self-heals.
+    helm_version       = helm_release.woodpecker.version
+    helm_values        = sha256(join("", helm_release.woodpecker.values))
+    traefik_cluster_ip = data.kubernetes_service.traefik.spec[0].cluster_ip
   }
 
   provisioner "local-exec" {
     command     = <<-BASH
       set -euo pipefail
-      kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"10.0.20.200","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
+      kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"${data.kubernetes_service.traefik.spec[0].cluster_ip}","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
       kubectl -n woodpecker rollout status statefulset/woodpecker-server --timeout=120s
     BASH
     interpreter = ["/bin/bash", "-c"]