docs/plans: Traefik dedicated IP + ETP=Local migration (design + plan)

Move Traefik off shared MetalLB IP 10.0.20.200 to a dedicated 10.0.20.203 with externalTrafficPolicy=Local, to (1) restore real client IPs for CrowdSec on the 24 non-proxied apps (currently SNAT'd to a node IP) and (2) enable QUIC. Forced off the shared IP because MetalLB forbids mixed ETP on a shared IP (10.0.20.200 also carries the Terraform state DB). In-place cutover selected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 00:27:04 +00:00 · 2026-05-30 00:27:04 +00:00 · 09a0c1fad4
commit 09a0c1fad4
parent 0f26bf030b
2 changed files with 234 additions and 0 deletions
--- a/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-design.md
+++ b/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-design.md
@ -0,0 +1,110 @@
+# Design: Dedicated MetalLB IP for Traefik with externalTrafficPolicy=Local
+
+**Date:** 2026-05-30
+**Status:** Draft — for review (no changes applied yet)
+**Author:** Viktor + Claude
+
+## Problem
+
+Two issues share one root cause on the Traefik ingress LoadBalancer:
+
+1. **CrowdSec is blind to real client IPs on the 24 non-proxied/direct apps.**
+   Traefik logs `10.0.20.103` (k8s-node3's IP) as the client for the
+   overwhelming majority of direct-app requests (measured: 2522 hits vs 3
+   real external IPs). Cause: the Traefik LB is `externalTrafficPolicy:
+   Cluster`, so kube-proxy SNATs every external client to the MetalLB-elected
+   node's IP before Traefik sees it. CrowdSec therefore makes ban decisions
+   against an internal node IP it would never block → **no effective IP-based
+   protection on the direct apps** (immich, forgejo, send, ytdlp, servarr,
+   ebooks, novelapp, freedify, affine, health, f1-stream, kms, k8s-portal,
+   etc. — 24 total).
+   *Proxied apps are unaffected — they arrive via the cloudflared tunnel and
+   get real IPs through Cloudflare's `X-Forwarded-For`.*
+
+2. **HTTP/3 / QUIC does not complete for the direct apps.** An external probe
+   (`http3check.net`) confirms "QUIC connection could not be established"
+   despite `Alt-Svc: h3` being advertised and UDP 443 reaching Traefik
+   (verified: pfSense NATs UDP 443 → Traefik LB; Traefik binds UDP 8443).
+   Same root cause: `ETP=Cluster` + 3 replicas means kube-proxy SNATs and can
+   spread the UDP flow across pods, which breaks the QUIC handshake.
+
+Both are fixed by `externalTrafficPolicy: Local` on the Traefik LB (no SNAT →
+real client IPs preserved → QUIC stays pinned to one pod).
+
+## Why we can't just flip ETP on the current IP
+
+Traefik currently shares MetalLB IP **`10.0.20.200`** with **9 other services**
+via `metallb.io/allow-shared-ip`:
+
+`dbaas/postgresql-lb` (**Terraform state backend**), `headscale/headscale-server`,
+`wireguard/wireguard`, `coturn/coturn`, `xray/xray-reality`,
+`shadowsocks/shadowsocks`, `beads-server/dolt`, `servarr/qbittorrent-torrenting`,
+`tor-proxy/torrserver-bt`.
+
+Per MetalLB docs, services sharing an IP **must all use `Cluster`** (or point
+to identical pods). Mixing `Local` and `Cluster` on a shared IP is **not
+allowed** and would break the IP allocation — taking down all ingress **and
+the Terraform state DB** (locking out `terragrunt` itself), plus VPN/DNS path.
+→ Traefik must move to its **own** IP.
+
+## Target state
+
+- New dedicated MetalLB IP **`10.0.20.203`** (free; pool is `10.0.20.200-220`),
+  **not** shared, `externalTrafficPolicy: Local`, for the Traefik LB.
+- `10.0.20.200` keeps the other 9 services unchanged (still all `Cluster`).
+- Internal split-horizon DNS apex `viktorbarzin.me A` → `10.0.20.203`
+  (currently `10.0.20.200`). All `*.viktorbarzin.me` CNAME → apex, so this one
+  record moves every internal ingress hostname.
+- pfSense: the WAN 443 (TCP **and** UDP) port-forward target moves from the
+  `<nginx>` alias to a **new pfSense alias** for `10.0.20.203`
+  (per request: define a VIP/alias, do **not** hardcode the IP in rules —
+  matches the existing `<nginx>` / `<k8s_shared_lb>` alias pattern).
+
+## Key decisions
+
+- **Dedicated IP, not shared** — forced by the MetalLB mixed-ETP rule above.
+- **`10.0.20.203`** — first free IP after technitium (.201) and kms (.202).
+- **pfSense reference by alias, not literal IP** (user requirement) — create
+  alias e.g. `traefik_lb` = `10.0.20.203`, reference it in the rdr + firewall
+  pass rule. One place to change later.
+- **Cutover style** — two options, decided at review (see plan):
+  - *In-place* (recommended for maintainability): change the Helm Service to
+    the new IP + ETP=Local in one edit; brief cutover window (mitigated by
+    pre-lowering DNS TTL + staging the pfSense change).
+  - *Additive* (zero-downtime): stand up a second LB Service on `.203`
+    (ETP=Local) alongside the existing `.200` one, cut DNS/pfSense over, then
+    retire Traefik from `.200`. More moving parts to maintain.
+
+## Risks & watch-items
+
+- **Terraform state backend lives on `.200`** — every phase must verify
+  `dbaas/postgresql-lb:5432` stays reachable. We never touch `.200`'s config,
+  only remove Traefik from it at the end; low risk but explicitly checked.
+- **Live-firewall edit** (pfSense rdr + alias) — done via the pfSense UI
+  (persisted in config.xml); CLI `pfctl` edits don't persist. Per the
+  network-device rule, this step is operator-driven/confirmed, not automated.
+- **CrowdSec behavior change** — once it sees *real* public IPs on direct
+  apps, it will start making real ban decisions there. Confirm the security
+  allowlist (source-IP allowlist `10.0.20.0/22`, `192.168.1.0/24`, tailnet;
+  identity `me@viktorbarzin.me`) is correct so family/legit IPs aren't banned.
+- **MetalLB ETP=Local node election** — `.203` is announced only from a node
+  running a ready Traefik pod. Traefik has 3 replicas (node4, node5, +1) and
+  PDB minAvailable=2, so ≥2 eligible nodes always exist; re-elects on failure.
+- **Cloudflare-proxied apps** route via the cloudflared tunnel → Traefik
+  ClusterIP, **not** the LB IP, so they are unaffected — verified in plan.
+- **Cutover window** for the in-place option — keep it short; have rollback
+  staged.
+
+## Out of scope
+
+- No change to the 9 services on `10.0.20.200`.
+- No change to Cloudflare-proxied apps' path.
+- No re-architecture of the pfSense↔K8s ingress beyond the 443 target move.
+
+## Affected docs (update on apply)
+
+- `.claude/CLAUDE.md` (Networking & Resilience / Service-Specific notes)
+- `docs/architecture/networking.md` (or equivalent — Traefik LB IP, ETP)
+- `docs/runbooks/` — add a short "Traefik LB IP / ETP" runbook entry
+- `.claude/reference/service-catalog.md` if it records LB IPs
+- memory: update the QUIC/ingress entries (ids 3241-3246)
--- a/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md
+++ b/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md
@ -0,0 +1,124 @@
+# Plan: Migrate Traefik to dedicated IP 10.0.20.203 + ETP=Local
+
+**Date:** 2026-05-30 · **Pairs with:** `2026-05-30-traefik-dedicated-ip-etp-local-design.md`
+**Status:** Draft — review required before executing. Nothing applied yet.
+
+Goal: real client IPs to CrowdSec + working QUIC on the 24 direct apps, by
+moving Traefik off the shared `10.0.20.200` onto its own `10.0.20.203` with
+`externalTrafficPolicy: Local`. Shared IP `.200` (incl. the TF state DB) is
+left untouched until the final cleanup step.
+
+> Recommended cutover: **in-place** (simplest, most maintainable) inside a short
+> planned window. Additive/zero-downtime variant noted at the end.
+
+## Phase 0 — Pre-flight (read-only, ~10 min)
+
+- [ ] Snapshot current state (already captured in chat; re-confirm at execution):
+  - Traefik svc: IP `10.0.20.200`, `allow-shared-ip=shared`, ETP=Cluster.
+  - `.200` shared by 10 services incl. `dbaas/postgresql-lb:5432` (TF state).
+  - DNS apex `viktorbarzin.me A = 10.0.20.200` (Technitium primary, split-horizon).
+  - pfSense rdr: WAN 443 tcp+udp → alias `<nginx>` (=10.0.20.200); `admin@10.0.20.1`.
+  - Traefik 3 replicas (node4, node5, +1), PDB minAvailable=2.
+- [ ] Confirm `10.0.20.203` still free in pool `10.0.20.200-220`.
+- [ ] **Lower DNS TTL** on the apex record to 60s (Technitium) ~30 min ahead of
+      cutover to shrink the window. (Restore to normal afterward.)
+- [ ] Baseline checks to compare against (run now, save output):
+  - `curl -sI https://immich.viktorbarzin.me` (direct app) → 200/redirect
+  - `curl -sI https://<a-proxied-app>` → 200 (proxied path)
+  - PG state reachable: `nc -vz 10.0.20.200 5432` (or a `terragrunt plan` no-op)
+  - Traefik access log shows `10.0.20.103` for a direct app (the bug we're fixing)
+  - `http3check.net` for immich → QUIC FAILS (baseline)
+
+## Phase 1 — Terraform: dedicated IP + ETP=Local (reversible)
+
+Edit `stacks/traefik/modules/traefik/main.tf`, Helm `service` block (~L165-173):
+
+```hcl
+service = {
+  type = "LoadBalancer"
+  annotations = {
+    "metallb.io/loadBalancerIPs" = "10.0.20.203"   # was 10.0.20.200
+    # allow-shared-ip REMOVED — Traefik no longer shares an IP
+  }
+  spec = {
+    externalTrafficPolicy = "Local"                 # was Cluster
+  }
+}
+```
+
+- [ ] `scripts/tg plan` in `stacks/traefik` — review: only the Traefik Service
+      changes (new IP, ETP, annotation removed). No change to other stacks.
+- [ ] `scripts/tg apply`.
+- [ ] **Immediately verify** (ingress is briefly broken until DNS+pfSense move):
+  - `kubectl get svc traefik -n traefik` → IP `10.0.20.203`, ETP=Local.
+  - `kubectl get svc -A | grep 10.0.20.200` → the other 9 services still hold `.200`.
+  - **`nc -vz 10.0.20.200 5432`** → TF state DB still reachable (critical).
+  - `curl -sI --resolve <app>:443:10.0.20.203 https://<direct-app>` → 200
+    (proves `.203` serves before DNS moves).
+
+**Rollback (Phase 1):** revert the three lines → `scripts/tg apply`. Back to `.200`.
+
+## Phase 2 — Internal DNS cutover (Technitium)
+
+- [ ] Update split-horizon apex: `viktorbarzin.me A → 10.0.20.203` (primary;
+      AXFR replicates to secondary/tertiary, or kick `technitium-zone-sync`).
+- [ ] Verify internal resolution: `dig +short immich.viktorbarzin.me` → `10.0.20.203`
+      from a cluster/LAN client; `curl -sI https://immich.viktorbarzin.me` → 200.
+
+**Rollback (Phase 2):** apex A → `10.0.20.200`.
+
+## Phase 3 — pfSense (live firewall — operator-driven, alias not literal)
+
+Per the "create a VIP/alias, don't hardcode" requirement:
+
+- [ ] **Create a pfSense Firewall Alias** (Firewall ▸ Aliases), type Host:
+      name `traefik_lb`, value `10.0.20.203`. *(This is the correct pfSense
+      object for a NAT-forward target — same kind as the existing `<nginx>`
+      alias. If a CARP/IP-Alias Virtual IP is intended instead, confirm at
+      review; a routed K8s LB IP normally uses an Alias, not a VIP.)*
+- [ ] **Repoint the 443 forward** (Firewall ▸ NAT ▸ Port Forward): change the
+      existing WAN `https` (TCP **and** UDP) rule's target from `nginx` →
+      `traefik_lb`. Leave the auto firewall rule linked. Do **not** touch the
+      `http-alt`/`7443` rules (those are xray on `<k8s_shared_lb>`).
+- [ ] Apply pfSense changes.
+- [ ] Verify externally:
+  - `http3check.net` for immich → **QUIC OK** (h3 established).
+  - External `curl` to a few direct apps → 200.
+  - Traefik access log now shows **real client IPs** for direct apps (not `10.0.20.103`).
+
+**Rollback (Phase 3):** point the 443 rule's target back to `nginx`.
+
+## Phase 4 — Verify CrowdSec + the fleet (the real prize)
+
+- [ ] Traefik logs: real public IPs on direct apps (sample several).
+- [ ] CrowdSec: confirm it now ingests real IPs (a test decision / metrics);
+      **confirm the source-IP allowlist** (`10.0.20.0/22`, `192.168.1.0/24`,
+      tailnet) is active so family/LAN aren't banned.
+- [ ] Proxied apps unaffected (spot-check 2-3 — still real IPs via Cloudflare).
+- [ ] All other `.200` services healthy (PG state, headscale, wireguard, coturn,
+      xray, etc.).
+- [ ] Restore DNS TTL to normal.
+
+## Phase 5 — Cleanup / docs
+
+- [ ] Confirm Traefik no longer answers on `.200` (it shouldn't after Phase 1).
+- [ ] Update docs (design doc "Affected docs" list): `.claude/CLAUDE.md`,
+      `docs/architecture/networking.md`, service-catalog, memory ids 3241-3246.
+- [ ] Commit TF + docs.
+
+## Rollback (full)
+
+Reverse order: pfSense 443 target → `nginx`; apex A → `.200`; revert the
+Traefik Service TF (IP `.200`, `allow-shared-ip=shared`, ETP=Cluster) → apply.
+kubectl/Helm reach the API server directly (not via Traefik), so control is
+retained even if ingress is down mid-cutover.
+
+## Additive (zero-downtime) variant — if the window is unacceptable
+
+Instead of editing the Helm Service in place: add a second raw
+`kubernetes_service` (type LoadBalancer, IP `.203`, ETP=Local, ports
+web/80→8000, websecure/443→8443 TCP, websecure-http3/443→8443 UDP, selector =
+Traefik pod labels). Both `.200` (old) and `.203` (new) serve Traefik. Cut
+DNS+pfSense to `.203`, verify, then convert the Helm Service to ClusterIP
+(drops `.200`). More config to carry long-term (a hand-maintained Service
+duplicating Helm) — weigh against the brief in-place window.