From 7d7a0ad4741e5659fb6a1c8f8b5b3a415419e9c8 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 3 Jun 2026 10:23:13 +0000 Subject: [PATCH] infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 --- docs/architecture/networking.md | 38 ++++--- docs/plans/2026-06-03-lb-ip-hygiene-design.md | 103 ++++++++++++++++++ .../monitoring/prometheus_chart_values.tpl | 2 +- stacks/woodpecker/main.tf | 28 +++-- 4 files changed, 147 insertions(+), 24 deletions(-) create mode 100644 docs/plans/2026-06-03-lb-ip-hygiene-design.md diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index daf3b18c..434a6f79 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -267,25 +267,31 @@ The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/ ### MetalLB & Load Balancing -MetalLB v0.15.3 allocates IPs from the range 10.0.20.200-10.0.20.220 in **Layer 2 mode**. Most LoadBalancer services share **10.0.20.200** using the `metallb.io/allow-shared-ip: shared` annotation. Two services have **dedicated IPs** with `externalTrafficPolicy: Local` to preserve real client source IPs: **Traefik (10.0.20.203)** — so CrowdSec sees real public IPs on the direct-ingress path and QUIC/HTTP3 works (a shared IP forbids the mixed ETP that QUIC's UDP listener needs) — and **Technitium DNS (10.0.20.201)** for query logging. +MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **four are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Three services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs). -| Service | Namespace | IP | Ports | -|---------|-----------|-----|-------| -| traefik | traefik | **10.0.20.203 (dedicated, ETP=Local)** | 80, 443, 443/UDP (HTTP/3), 10200, 10300, 11434/TCP | -| coturn | coturn | 10.0.20.200 (shared) | 3478/UDP (STUN/TURN), 49152-49252/UDP (relay) | -| headscale | headscale | 10.0.20.200 (shared) | 41641/UDP, 3479/UDP | -| windows-kms¹ | kms | 10.0.20.200 (shared) | 1688/TCP | -| qbittorrent | servarr | 10.0.20.200 (shared) | 50000/TCP+UDP | -| shadowsocks | shadowsocks | 10.0.20.200 (shared) | 8388/TCP+UDP | -| torrserver-bt | tor-proxy | 10.0.20.200 (shared) | 5665/TCP | -| wireguard | wireguard | 10.0.20.200 (shared) | 51820/UDP | -| mailserver | mailserver | 10.0.20.200 (shared) | 25, 465, 587, 993/TCP | -| xray-reality | xray | 10.0.20.200 (shared) | 7443/TCP | -| **technitium-dns** | **technitium** | **10.0.20.201 (dedicated)** | **53/UDP+TCP** | +> **Why not consolidate to fewer IPs?** The three dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 4-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`. -pfSense aliases reference these IPs: `k8s_shared_lb` (10.0.20.200), `traefik_lb` (10.0.20.203), `technitium_dns` (10.0.20.201). NAT rules use aliases for maintainability — the WAN 443 (TCP+UDP) forward targets `traefik_lb`. +| IP | ETP | Services (ns/name → ports) | +|----|-----|----------------------------| +| **10.0.20.200** (shared) | Cluster | dbaas/postgresql-lb→5432 · beads-server/dolt→3306 · coturn/coturn→3478 TCP+UDP, 49152-49252/UDP · headscale/headscale-server→41641/UDP, 3479/UDP · wireguard/wireguard→51820/UDP · servarr/qbittorrent-torrenting→50000 TCP+UDP · shadowsocks/shadowsocks→8388 TCP+UDP · tor-proxy/torrserver-bt→5665 TCP+UDP · xray/xray-reality→7443 | +| **10.0.20.201** (dedicated) | Local | technitium/technitium-dns→53 UDP+TCP | +| **10.0.20.202** (dedicated)¹ | Local | kms/windows-kms→1688 | +| **10.0.20.203** (dedicated) | Local | traefik/traefik→80, 443, 443/UDP (HTTP/3), 10200 (piper), 10300 (whisper) | -¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_shared_lb:1688` so any internet host can activate. The matching filter rule applies a per-source rate limit (`max-src-conn 50`, `max-src-conn-rate 10/60`) with `overload ` flush — offenders are auto-added to pfSense's stock `virusprot` pf table for follow-on blocks. Operations (rate-limit tuning, log locations, revocation) are documented in `docs/runbooks/kms-public-exposure.md`. +**Mailserver does NOT use a LB IP** — inbound mail enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}` → NodePorts `30125-30128` (PROXY-v2; see "Mail Server" below). (Earlier revisions of this table wrongly listed mailserver on `.200` and KMS on `.200` — both corrected 2026-06-03.) + +**pfSense aliases** map to these IPs: `k8s_shared_lb`→.200, `technitium_dns`→.201, `k8s_kms_lb`→.202, `traefik_lb`→.203 (plus a legacy `nginx`→.200 duplicate — cruft). NAT rules reference aliases, so repointing an alias cascades to its paired filter rule. + +¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_kms_lb` (.202) so any internet host can activate. The matching filter rule rate-limits per source (`max-src-conn 50`, `max-src-conn-rate 10/60`, `overload `). See `docs/runbooks/kms-public-exposure.md`. + +#### LB-IP renumber checklist + +These IPs are referenced by consumers that do **not** auto-follow when an IP moves — the 2026-05-30 Traefik `.200→.203` move broke five of them (cloudflared 502, woodpecker forge API, containerd pulls, the `.lan` + `.me` zones). **Before moving any LB IP, update every consumer below.** Bootstrap-critical literals (containerd mirror, PG state, node DNS) deliberately stay IP literals (DNS chicken-and-egg) — this list is their single source of truth. + +- **`.203` Traefik:** assigner `stacks/traefik/modules/traefik/main.tf` · split-horizon translation `stacks/technitium/modules/technitium/main.tf` (`externalToInternalTranslation`) · prometheus apex-alert summary `stacks/monitoring/.../prometheus_chart_values.tpl` · containerd Forgejo mirror `modules/create-template-vm/k8s-node-containerd-setup.sh` + `scripts/setup-forgejo-containerd-mirror.sh` (OOB, per node) · cloudflared origin (already IP-independent → `traefik.traefik.svc`) · woodpecker forge alias (now reads the Traefik **ClusterIP** dynamically — no literal) · pfSense NAT 80/443 → `traefik_lb`. +- **`.201` Technitium:** assigner `stacks/technitium/modules/technitium/main.tf` · DNS records `config.tfvars` (ns1/ns2/`viktorbarzin.lan`, dnscrypt forwarder) · `modules/create-template-vm/cloud_init.yaml` FallbackDNS · `scripts/provision-k8s-worker` · pfSense NAT 53 (**literal `10.0.20.201`**, not the `technitium_dns` alias — known inconsistency). +- **`.202` KMS:** assigner `stacks/kms/main.tf` · pfSense NAT 1688 → `k8s_kms_lb` · Cloudflare `vlmcs` public A → WAN → `.202`. +- **`.200` shared:** the 9 assigners above · PG state backend `scripts/tg` + `scripts/migrate-state-to-pg` (`@10.0.20.200:5432`) · pfSense NAT (wireguard/shadowsocks/coturn/headscale-STUN/qbittorrent/xray) → `k8s_shared_lb`, outbound-NAT self rule, CrowdSec syslog `remoteserver .200:30514`. Critical services are scaled to **3 replicas**: - Traefik (PDB: minAvailable=2) diff --git a/docs/plans/2026-06-03-lb-ip-hygiene-design.md b/docs/plans/2026-06-03-lb-ip-hygiene-design.md new file mode 100644 index 00000000..d6154924 --- /dev/null +++ b/docs/plans/2026-06-03-lb-ip-hygiene-design.md @@ -0,0 +1,103 @@ +# L4 LoadBalancer IP review & pfSense hygiene — design + decisions + +**Date:** 2026-06-03 +**Status:** repo changes implemented; pfSense DHCP shrink pending live-change approval +**Trigger:** "Review the L4 LB IPs we give away, consolidate, and use pfSense Virtual IPs instead of hardcoding IPs in rules." + +## TL;DR + +The headline ask — **consolidate to fewer MetalLB IPs** — is a verified dead end. The +real, worthwhile outcome is a **single source of truth (this doc + the renumber +checklist in `architecture/networking.md`) plus two stale-reference fixes**. We +deliberately did **not** reduce the IP count and did **not** do the high-risk pfSense +mail-VIP surgery. + +## Current state (verified live, 2026-06-03) + +MetalLB L2, pool `10.0.20.200-220` (21 IPs, **17 free**). Four in use: + +| IP | ETP | What | Why dedicated | +|----|-----|------|---------------| +| `.200` | Cluster (shared) | ~9 svcs: postgresql-lb (TF state), dolt, coturn, headscale, wireguard, qbittorrent, shadowsocks, torrserver, xray | already maximally consolidated (the 2026-03 "5→1" merge) | +| `.201` | Local | technitium-dns | real client IP → network-scoped split-horizon | +| `.202` | Local | windows-kms | real client IP → notifier source labeling | +| `.203` | Local | traefik | real client IP (CrowdSec) + QUIC/HTTP3 (UDP) | + +## Why consolidation fails (the core finding) + +MetalLB L2 only lets multiple `ETP=Local` services **share** an IP if they have +**identical pod selectors** (so traffic to the single announcing node lands on the +right pods). Traefik / KMS / Technitium have disjoint selectors and disjoint pods, +and a shared `ETP=Local` IP announces from one node (stateless `node+VIP` hash) — +blackholing any service whose pods aren't there. Refs: +[MetalLB L2 concepts](https://metallb.universe.tf/concepts/layer2/), +[Usage / IP sharing](https://metallb.universe.tf/usage/), +[issue #271](https://github.com/metallb/metallb/issues/271). + +Consequences: +- **Traefik can never leave a dedicated `ETP=Local` IP** — QUIC's UDP listener needs it; QUIC can't traverse pfSense HAProxy either. +- The trio is **fewer-IPs XOR client-IP preservation** — not both. The only "both" is making all three DaemonSets (breaks Technitium's primary/secondary AXFR design; burns resources to save 2 of 17 free IPs). Not worth it. + +**Decision (user, 2026-06-03):** keep all 4 dedicated, preserve client IPs everywhere. No MetalLB changes. + +## Why a doc registry instead of a `config.tfvars`/Terraform IP variable + +The cascade risk is **consumers that hardcode another service's IP and get forgotten** +(the 2026-05-30 Traefik `.200→.203` move broke cloudflared, woodpecker, containerd, +and the `.lan`+`.me` zones). A Terraform-var single-source was considered and rejected: + +1. Editing `terragrunt.hcl`/`config.tfvars` triggers the CI "global change → apply ALL + ~37 platform stacks" path (`.woodpecker/default.yml`) — a 37-stack apply for what + are no-op refactors (rendered IPs unchanged), risking unrelated drift surfacing. +2. It can't cover the **out-of-band** consumers (cloudflared via CF-API, containerd + `hosts.toml` on each node) — which were half the 2026-05-30 breakage. +3. Bootstrap-critical literals (PG state in `scripts/tg`, node DNS) must stay literals + (DNS chicken-and-egg) regardless. + +A **documentation registry** (the "LB-IP renumber checklist" in +`architecture/networking.md`) covers *all* consumers — in-band and OOB — at zero +apply-risk, and is the complete pre-move checklist. That is the single source of truth. + +## Changes made (minimal-hygiene scope) + +1. **`architecture/networking.md`** — rewrote the stale MetalLB section into an accurate + registry (it had KMS on `.200`, mailserver on a LB IP, "two dedicated" — all wrong) + + added the **renumber checklist**. +2. **woodpecker** (`stacks/woodpecker/main.tf`) — the `forgejo.viktorbarzin.me` + hostAlias hardcoded the **dead** `10.0.20.200` (Traefik moved to `.203`; `.200:443` + refuses TLS). Now reads the Traefik **ClusterIP dynamically** (`data + "kubernetes_service" "traefik"`) so it can't rot on a future renumber and avoids the + ETP=Local hairpin trap. (Real fix — the next woodpecker apply would otherwise + re-pin the dead IP and break pipeline creation.) +3. **monitoring** (`prometheus_chart_values.tpl`) — `ViktorBarzinApexDrift` alert + summary said "expected 10.0.20.200" (stale post-Traefik-move) → `.203`. Cosmetic + (alert logic was already correct) but prevents a misleading incident message. +4. **`backend.tf`** — 72 stale generated copies were tracked in git with a plaintext + (Vault-rotated, ~expired) PG password + `.200` literal, despite already being in + `.gitignore`. `git rm --cached` (they regenerate from `PG_CONN_STR`). History scrub + deferred (creds rotate weekly → low urgency). +5. **pfSense DHCP range** (`opt1`/K8s VLAN) — `.200-.254` overlaps the MetalLB pool + `.200-.220` (latent IP-conflict: DHCP could hand out a live LB IP). Plan: shrink to + start at `.221`. Verified zero leases/statics in the band. **PENDING** — live + pfSense change, applied separately after explicit approval (live network device). + +## Explicitly NOT done (rationale) + +- **No MetalLB IP merging** — infeasible without losing client-IP/QUIC/HA (above). +- **No mail Virtual IP** — mail binds pfSense's own `10.0.20.1`, the most stable IP in + the system; the 2026-06-02 incident was a *DNS split-horizon* bug, not an IP move. + A mail VIP is 4 NAT + 5 filter + HAProxy cutover on the live mail path for marginal + "identity" benefit. Skipped. +- **No `nginx`-alias delete / NAT literal→alias** — pfSense rule cosmetics; left for a + later pfSense-focused pass (would also need the web filter F#2/F#3 `nginx`→`traefik_lb` + repoint to avoid breaking 80/443). +- **No Terraform IP variable** — see registry rationale above. + +## Known latent items (documented, not fixed here) + +- pfSense web filter rules F#2/F#3 reference `nginx` (.200) while their NAT targets + `traefik_lb` (.203) — inconsistent but currently passing; fix in a pfSense pass. +- pfSense NAT 53 hardcodes literal `10.0.20.201` instead of the `technitium_dns` alias. +- In-cluster `*.viktorbarzin.me` split-horizon still resolves some hosts to the dead + `.200` (beads `code-yh33`) — the woodpecker hostAlias is the per-app workaround. +- CrowdSec syslog `remoteserver` doc/config drift (`.200` vs comment `.202`). diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index be54fdf7..444aab66 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2455,7 +2455,7 @@ serverFiles: labels: severity: critical annotations: - summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200" + summary: "viktorbarzin.me apex A drifted from expected 10.0.20.203" description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console." - alert: ViktorBarzinApexProbeStale expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900 diff --git a/stacks/woodpecker/main.tf b/stacks/woodpecker/main.tf index 96803307..2b389cd8 100644 --- a/stacks/woodpecker/main.tf +++ b/stacks/woodpecker/main.tf @@ -178,20 +178,34 @@ resource "helm_release" "woodpecker" { # Keeps the OAuth/forge-API path off the WAN gateway (forgejo.viktorbarzin.me # resolves to the public IP via DNS, which round-trips through Cloudflare # and routinely tripped 30s context-deadline timeouts when fetching pipeline -# config). 10.0.20.200 is the Traefik LB that fronts forgejo internally; -# Traefik serves the *.viktorbarzin.me wildcard so SNI verification still -# passes. +# config). We pin forgejo.viktorbarzin.me to the in-cluster Traefik Service +# ClusterIP (read dynamically below); Traefik serves the *.viktorbarzin.me +# wildcard so SNI verification still passes. ClusterIP (not the LB IP) avoids +# two traps: (1) the Traefik LB IP is ETP=Local and not reliably hairpin- +# reachable from an arbitrary pod's node; (2) a hard-coded LB IP rots on every +# Traefik renumber — this previously pinned 10.0.20.200, which went dead when +# Traefik moved to its dedicated .203 (2026-05-30), the same failure class the +# cloudflared origin hit (also fixed by targeting the in-cluster Service). +data "kubernetes_service" "traefik" { + metadata { + name = "traefik" + namespace = "traefik" + } +} + resource "null_resource" "woodpecker_server_host_alias" { triggers = { - # Re-run on every helm_release version bump or values change. - helm_version = helm_release.woodpecker.version - helm_values = sha256(join("", helm_release.woodpecker.values)) + # Re-run on helm version/values change, and when the Traefik ClusterIP + # changes (e.g. the Service is recreated) so the alias self-heals. + helm_version = helm_release.woodpecker.version + helm_values = sha256(join("", helm_release.woodpecker.values)) + traefik_cluster_ip = data.kubernetes_service.traefik.spec[0].cluster_ip } provisioner "local-exec" { command = <<-BASH set -euo pipefail - kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"10.0.20.200","hostnames":["forgejo.viktorbarzin.me"]}]}}}}' + kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"${data.kubernetes_service.traefik.spec[0].cluster_ip}","hostnames":["forgejo.viktorbarzin.me"]}]}}}}' kubectl -n woodpecker rollout status statefulset/woodpecker-server --timeout=120s BASH interpreter = ["/bin/bash", "-c"]