infra: fix stale Traefik LB-IP refs + accurate LB-IP registry
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled

Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.

- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
  (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
  break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
  kubernetes_service data source -- cannot rot on a future renumber and avoids
  the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
  -> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
  KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
  registry + LB-IP renumber checklist (in-band + out-of-band consumers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-03 10:23:13 +00:00
parent dcb7c74531
commit 7d7a0ad474
4 changed files with 147 additions and 24 deletions

View file

@ -267,25 +267,31 @@ The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/
### MetalLB & Load Balancing
MetalLB v0.15.3 allocates IPs from the range 10.0.20.200-10.0.20.220 in **Layer 2 mode**. Most LoadBalancer services share **10.0.20.200** using the `metallb.io/allow-shared-ip: shared` annotation. Two services have **dedicated IPs** with `externalTrafficPolicy: Local` to preserve real client source IPs: **Traefik (10.0.20.203)** — so CrowdSec sees real public IPs on the direct-ingress path and QUIC/HTTP3 works (a shared IP forbids the mixed ETP that QUIC's UDP listener needs) — and **Technitium DNS (10.0.20.201)** for query logging.
MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **four are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Three services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs).
| Service | Namespace | IP | Ports |
|---------|-----------|-----|-------|
| traefik | traefik | **10.0.20.203 (dedicated, ETP=Local)** | 80, 443, 443/UDP (HTTP/3), 10200, 10300, 11434/TCP |
| coturn | coturn | 10.0.20.200 (shared) | 3478/UDP (STUN/TURN), 49152-49252/UDP (relay) |
| headscale | headscale | 10.0.20.200 (shared) | 41641/UDP, 3479/UDP |
| windows-kms¹ | kms | 10.0.20.200 (shared) | 1688/TCP |
| qbittorrent | servarr | 10.0.20.200 (shared) | 50000/TCP+UDP |
| shadowsocks | shadowsocks | 10.0.20.200 (shared) | 8388/TCP+UDP |
| torrserver-bt | tor-proxy | 10.0.20.200 (shared) | 5665/TCP |
| wireguard | wireguard | 10.0.20.200 (shared) | 51820/UDP |
| mailserver | mailserver | 10.0.20.200 (shared) | 25, 465, 587, 993/TCP |
| xray-reality | xray | 10.0.20.200 (shared) | 7443/TCP |
| **technitium-dns** | **technitium** | **10.0.20.201 (dedicated)** | **53/UDP+TCP** |
> **Why not consolidate to fewer IPs?** The three dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 4-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`.
pfSense aliases reference these IPs: `k8s_shared_lb` (10.0.20.200), `traefik_lb` (10.0.20.203), `technitium_dns` (10.0.20.201). NAT rules use aliases for maintainability — the WAN 443 (TCP+UDP) forward targets `traefik_lb`.
| IP | ETP | Services (ns/name → ports) |
|----|-----|----------------------------|
| **10.0.20.200** (shared) | Cluster | dbaas/postgresql-lb→5432 · beads-server/dolt→3306 · coturn/coturn→3478 TCP+UDP, 49152-49252/UDP · headscale/headscale-server→41641/UDP, 3479/UDP · wireguard/wireguard→51820/UDP · servarr/qbittorrent-torrenting→50000 TCP+UDP · shadowsocks/shadowsocks→8388 TCP+UDP · tor-proxy/torrserver-bt→5665 TCP+UDP · xray/xray-reality→7443 |
| **10.0.20.201** (dedicated) | Local | technitium/technitium-dns→53 UDP+TCP |
| **10.0.20.202** (dedicated)¹ | Local | kms/windows-kms→1688 |
| **10.0.20.203** (dedicated) | Local | traefik/traefik→80, 443, 443/UDP (HTTP/3), 10200 (piper), 10300 (whisper) |
¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_shared_lb:1688` so any internet host can activate. The matching filter rule applies a per-source rate limit (`max-src-conn 50`, `max-src-conn-rate 10/60`) with `overload <virusprot>` flush — offenders are auto-added to pfSense's stock `virusprot` pf table for follow-on blocks. Operations (rate-limit tuning, log locations, revocation) are documented in `docs/runbooks/kms-public-exposure.md`.
**Mailserver does NOT use a LB IP** — inbound mail enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}` → NodePorts `30125-30128` (PROXY-v2; see "Mail Server" below). (Earlier revisions of this table wrongly listed mailserver on `.200` and KMS on `.200` — both corrected 2026-06-03.)
**pfSense aliases** map to these IPs: `k8s_shared_lb`→.200, `technitium_dns`→.201, `k8s_kms_lb`→.202, `traefik_lb`→.203 (plus a legacy `nginx`→.200 duplicate — cruft). NAT rules reference aliases, so repointing an alias cascades to its paired filter rule.
¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_kms_lb` (.202) so any internet host can activate. The matching filter rule rate-limits per source (`max-src-conn 50`, `max-src-conn-rate 10/60`, `overload <virusprot>`). See `docs/runbooks/kms-public-exposure.md`.
#### LB-IP renumber checklist
These IPs are referenced by consumers that do **not** auto-follow when an IP moves — the 2026-05-30 Traefik `.200→.203` move broke five of them (cloudflared 502, woodpecker forge API, containerd pulls, the `.lan` + `.me` zones). **Before moving any LB IP, update every consumer below.** Bootstrap-critical literals (containerd mirror, PG state, node DNS) deliberately stay IP literals (DNS chicken-and-egg) — this list is their single source of truth.
- **`.203` Traefik:** assigner `stacks/traefik/modules/traefik/main.tf` · split-horizon translation `stacks/technitium/modules/technitium/main.tf` (`externalToInternalTranslation`) · prometheus apex-alert summary `stacks/monitoring/.../prometheus_chart_values.tpl` · containerd Forgejo mirror `modules/create-template-vm/k8s-node-containerd-setup.sh` + `scripts/setup-forgejo-containerd-mirror.sh` (OOB, per node) · cloudflared origin (already IP-independent → `traefik.traefik.svc`) · woodpecker forge alias (now reads the Traefik **ClusterIP** dynamically — no literal) · pfSense NAT 80/443 → `traefik_lb`.
- **`.201` Technitium:** assigner `stacks/technitium/modules/technitium/main.tf` · DNS records `config.tfvars` (ns1/ns2/`viktorbarzin.lan`, dnscrypt forwarder) · `modules/create-template-vm/cloud_init.yaml` FallbackDNS · `scripts/provision-k8s-worker` · pfSense NAT 53 (**literal `10.0.20.201`**, not the `technitium_dns` alias — known inconsistency).
- **`.202` KMS:** assigner `stacks/kms/main.tf` · pfSense NAT 1688 → `k8s_kms_lb` · Cloudflare `vlmcs` public A → WAN → `.202`.
- **`.200` shared:** the 9 assigners above · PG state backend `scripts/tg` + `scripts/migrate-state-to-pg` (`@10.0.20.200:5432`) · pfSense NAT (wireguard/shadowsocks/coturn/headscale-STUN/qbittorrent/xray) → `k8s_shared_lb`, outbound-NAT self rule, CrowdSec syslog `remoteserver .200:30514`.
Critical services are scaled to **3 replicas**:
- Traefik (PDB: minAvailable=2)

View file

@ -0,0 +1,103 @@
# L4 LoadBalancer IP review & pfSense hygiene — design + decisions
**Date:** 2026-06-03
**Status:** repo changes implemented; pfSense DHCP shrink pending live-change approval
**Trigger:** "Review the L4 LB IPs we give away, consolidate, and use pfSense Virtual IPs instead of hardcoding IPs in rules."
## TL;DR
The headline ask — **consolidate to fewer MetalLB IPs** — is a verified dead end. The
real, worthwhile outcome is a **single source of truth (this doc + the renumber
checklist in `architecture/networking.md`) plus two stale-reference fixes**. We
deliberately did **not** reduce the IP count and did **not** do the high-risk pfSense
mail-VIP surgery.
## Current state (verified live, 2026-06-03)
MetalLB L2, pool `10.0.20.200-220` (21 IPs, **17 free**). Four in use:
| IP | ETP | What | Why dedicated |
|----|-----|------|---------------|
| `.200` | Cluster (shared) | ~9 svcs: postgresql-lb (TF state), dolt, coturn, headscale, wireguard, qbittorrent, shadowsocks, torrserver, xray | already maximally consolidated (the 2026-03 "5→1" merge) |
| `.201` | Local | technitium-dns | real client IP → network-scoped split-horizon |
| `.202` | Local | windows-kms | real client IP → notifier source labeling |
| `.203` | Local | traefik | real client IP (CrowdSec) + QUIC/HTTP3 (UDP) |
## Why consolidation fails (the core finding)
MetalLB L2 only lets multiple `ETP=Local` services **share** an IP if they have
**identical pod selectors** (so traffic to the single announcing node lands on the
right pods). Traefik / KMS / Technitium have disjoint selectors and disjoint pods,
and a shared `ETP=Local` IP announces from one node (stateless `node+VIP` hash) —
blackholing any service whose pods aren't there. Refs:
[MetalLB L2 concepts](https://metallb.universe.tf/concepts/layer2/),
[Usage / IP sharing](https://metallb.universe.tf/usage/),
[issue #271](https://github.com/metallb/metallb/issues/271).
Consequences:
- **Traefik can never leave a dedicated `ETP=Local` IP** — QUIC's UDP listener needs it; QUIC can't traverse pfSense HAProxy either.
- The trio is **fewer-IPs XOR client-IP preservation** — not both. The only "both" is making all three DaemonSets (breaks Technitium's primary/secondary AXFR design; burns resources to save 2 of 17 free IPs). Not worth it.
**Decision (user, 2026-06-03):** keep all 4 dedicated, preserve client IPs everywhere. No MetalLB changes.
## Why a doc registry instead of a `config.tfvars`/Terraform IP variable
The cascade risk is **consumers that hardcode another service's IP and get forgotten**
(the 2026-05-30 Traefik `.200→.203` move broke cloudflared, woodpecker, containerd,
and the `.lan`+`.me` zones). A Terraform-var single-source was considered and rejected:
1. Editing `terragrunt.hcl`/`config.tfvars` triggers the CI "global change → apply ALL
~37 platform stacks" path (`.woodpecker/default.yml`) — a 37-stack apply for what
are no-op refactors (rendered IPs unchanged), risking unrelated drift surfacing.
2. It can't cover the **out-of-band** consumers (cloudflared via CF-API, containerd
`hosts.toml` on each node) — which were half the 2026-05-30 breakage.
3. Bootstrap-critical literals (PG state in `scripts/tg`, node DNS) must stay literals
(DNS chicken-and-egg) regardless.
A **documentation registry** (the "LB-IP renumber checklist" in
`architecture/networking.md`) covers *all* consumers — in-band and OOB — at zero
apply-risk, and is the complete pre-move checklist. That is the single source of truth.
## Changes made (minimal-hygiene scope)
1. **`architecture/networking.md`** — rewrote the stale MetalLB section into an accurate
registry (it had KMS on `.200`, mailserver on a LB IP, "two dedicated" — all wrong)
+ added the **renumber checklist**.
2. **woodpecker** (`stacks/woodpecker/main.tf`) — the `forgejo.viktorbarzin.me`
hostAlias hardcoded the **dead** `10.0.20.200` (Traefik moved to `.203`; `.200:443`
refuses TLS). Now reads the Traefik **ClusterIP dynamically** (`data
"kubernetes_service" "traefik"`) so it can't rot on a future renumber and avoids the
ETP=Local hairpin trap. (Real fix — the next woodpecker apply would otherwise
re-pin the dead IP and break pipeline creation.)
3. **monitoring** (`prometheus_chart_values.tpl`) — `ViktorBarzinApexDrift` alert
summary said "expected 10.0.20.200" (stale post-Traefik-move) → `.203`. Cosmetic
(alert logic was already correct) but prevents a misleading incident message.
4. **`backend.tf`** — 72 stale generated copies were tracked in git with a plaintext
(Vault-rotated, ~expired) PG password + `.200` literal, despite already being in
`.gitignore`. `git rm --cached` (they regenerate from `PG_CONN_STR`). History scrub
deferred (creds rotate weekly → low urgency).
5. **pfSense DHCP range** (`opt1`/K8s VLAN) — `.200-.254` overlaps the MetalLB pool
`.200-.220` (latent IP-conflict: DHCP could hand out a live LB IP). Plan: shrink to
start at `.221`. Verified zero leases/statics in the band. **PENDING** — live
pfSense change, applied separately after explicit approval (live network device).
## Explicitly NOT done (rationale)
- **No MetalLB IP merging** — infeasible without losing client-IP/QUIC/HA (above).
- **No mail Virtual IP** — mail binds pfSense's own `10.0.20.1`, the most stable IP in
the system; the 2026-06-02 incident was a *DNS split-horizon* bug, not an IP move.
A mail VIP is 4 NAT + 5 filter + HAProxy cutover on the live mail path for marginal
"identity" benefit. Skipped.
- **No `nginx`-alias delete / NAT literal→alias** — pfSense rule cosmetics; left for a
later pfSense-focused pass (would also need the web filter F#2/F#3 `nginx``traefik_lb`
repoint to avoid breaking 80/443).
- **No Terraform IP variable** — see registry rationale above.
## Known latent items (documented, not fixed here)
- pfSense web filter rules F#2/F#3 reference `nginx` (.200) while their NAT targets
`traefik_lb` (.203) — inconsistent but currently passing; fix in a pfSense pass.
- pfSense NAT 53 hardcodes literal `10.0.20.201` instead of the `technitium_dns` alias.
- In-cluster `*.viktorbarzin.me` split-horizon still resolves some hosts to the dead
`.200` (beads `code-yh33`) — the woodpecker hostAlias is the per-app workaround.
- CrowdSec syslog `remoteserver` doc/config drift (`.200` vs comment `.202`).

View file

@ -2455,7 +2455,7 @@ serverFiles:
labels:
severity: critical
annotations:
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.203"
description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."
- alert: ViktorBarzinApexProbeStale
expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900

View file

@ -178,20 +178,34 @@ resource "helm_release" "woodpecker" {
# Keeps the OAuth/forge-API path off the WAN gateway (forgejo.viktorbarzin.me
# resolves to the public IP via DNS, which round-trips through Cloudflare
# and routinely tripped 30s context-deadline timeouts when fetching pipeline
# config). 10.0.20.200 is the Traefik LB that fronts forgejo internally;
# Traefik serves the *.viktorbarzin.me wildcard so SNI verification still
# passes.
# config). We pin forgejo.viktorbarzin.me to the in-cluster Traefik Service
# ClusterIP (read dynamically below); Traefik serves the *.viktorbarzin.me
# wildcard so SNI verification still passes. ClusterIP (not the LB IP) avoids
# two traps: (1) the Traefik LB IP is ETP=Local and not reliably hairpin-
# reachable from an arbitrary pod's node; (2) a hard-coded LB IP rots on every
# Traefik renumber this previously pinned 10.0.20.200, which went dead when
# Traefik moved to its dedicated .203 (2026-05-30), the same failure class the
# cloudflared origin hit (also fixed by targeting the in-cluster Service).
data "kubernetes_service" "traefik" {
metadata {
name = "traefik"
namespace = "traefik"
}
}
resource "null_resource" "woodpecker_server_host_alias" {
triggers = {
# Re-run on every helm_release version bump or values change.
helm_version = helm_release.woodpecker.version
helm_values = sha256(join("", helm_release.woodpecker.values))
# Re-run on helm version/values change, and when the Traefik ClusterIP
# changes (e.g. the Service is recreated) so the alias self-heals.
helm_version = helm_release.woodpecker.version
helm_values = sha256(join("", helm_release.woodpecker.values))
traefik_cluster_ip = data.kubernetes_service.traefik.spec[0].cluster_ip
}
provisioner "local-exec" {
command = <<-BASH
set -euo pipefail
kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"10.0.20.200","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"${data.kubernetes_service.traefik.spec[0].cluster_ip}","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
kubectl -n woodpecker rollout status statefulset/woodpecker-server --timeout=120s
BASH
interpreter = ["/bin/bash", "-c"]