From 16c9aafafab5ba817e51b13e4c0848a6ea74cdc4 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 30 May 2026 08:12:57 +0000 Subject: [PATCH] docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2) Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 --- .claude/CLAUDE.md | 3 +- ...-30-traefik-dedicated-ip-etp-local-plan.md | 36 +++++++++++++++++++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index d579a35b..6282e551 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -124,7 +124,8 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle - **CrowdSec bouncer**: graceful degradation mode (fail-open on error). - **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits. - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. -- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik. +- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. +- **Traefik LB IP = `10.0.20.203`, `externalTrafficPolicy: Local`** (dedicated, NOT the shared `.200`). Moved off the shared `.200` on 2026-05-30 so direct/non-proxied apps preserve the **real client IP for CrowdSec** (ETP=Cluster SNAT'd them to the node IP) and so QUIC works. **The shared `10.0.20.200` keeps the other 10 LB services** (PG state-backend `postgresql-lb`, headscale, wireguard, coturn, xray, etc. — all ETP=Cluster; MetalLB forbids mixed ETP on a shared IP, hence Traefik's own IP). **cloudflared targets the in-cluster Traefik Service** (`https://traefik.traefik.svc.cluster.local:443`, remote/dashboard tunnel config — edit via CF Global API Key in `secret/platform`), so proxied apps are decoupled from the LB IP. pfSense WAN 443 (tcp+udp) NAT → alias `traefik_lb` (`.203`). Internal split-horizon apex `viktorbarzin.me A` → `.203`. Full runbook + post-mortem: `docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*`. - **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x. ## Service-Specific Notes diff --git a/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md b/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md index fb78448f..6fa9d70a 100644 --- a/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md +++ b/docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md @@ -161,3 +161,39 @@ before retrying: **Left in place for retry:** pfSense alias `traefik_lb` (=`10.0.20.203`, NAT reverted to `nginx`); pfSense `config.xml` backups `config.xml.bak-traefik-*`. + +## Attempt 2 — 2026-05-30 — SUCCESS + +Live and verified, **no proxied/Vault outage** this time. Key change vs attempt 1: +**decouple cloudflared from the LB IP FIRST**, so moving Traefik no longer +touches the proxied path or Vault's ingress. + +Executed order (all lessons applied — `tg` always run to a file, creds +pre-fetched while Vault up): +1. **Cloudflare tunnel ingress repointed** `https://10.0.20.200:443` → + `https://traefik.traefik.svc.cluster.local:443` (both `*.viktorbarzin.me` + and apex rules; `noTLSVerify` kept; catch-all 404 kept). Done via the + **Cloudflare Global API Key** (`secret/platform` → `cloudflare_api_key`, + email `vbarzin@gmail.com`, `X-Auth-Email`+`X-Auth-Key` headers — NOT the + tunnel token, which is not an API credential). Tunnel: account + `02e035473cfc4834fb10c5d35470d8b4`, id `75182cd7-bb91-4310-b961-5d8967da8b41`. + → proxied apps now IP-independent. +2. Traefik Service → `10.0.20.203` + `ETP=Local` (single service; `tg apply`). + Proxied apps + Vault stayed up (cloudflared → ClusterIP). +3. Technitium apex `viktorbarzin.me A` → `10.0.20.203` (ttl 60). +4. pfSense 443 (tcp+udp) NAT `nginx` → `traefik_lb` (`.203`); `/etc/rc.filter_configure`. + +**Verified:** proxied 307/200 throughout; direct apps 200; **real external +client IPs now reach Traefik/CrowdSec** (`216.73.217.51`, `54.x`, `52.x` — not +node `10.0.20.103`); PG state DB OK; TF state reconciled (`tg apply` exit 0). + +**Notes / follow-ups:** +- **Out-of-band (not in TF):** the cloudflared tunnel ingress (remote/dashboard + config) and the pfSense `traefik_lb` alias + NAT. Codify the tunnel config in + TF (`cloudflare_zero_trust_tunnel_cloudflared_config`) so `→ClusterIP` is + declarative — pre-existing gap (tunnel was already remote-managed). +- **QUIC:** infra correct (ETP=Local + UDP 443 → `.203` + Traefik h3 listener). + `http3check.net` is unreliable here — it hits the IPv6 AAAA + (`2001:470:6e:43d::2`, separate HE-tunnel path, unchanged) and fails before + reaching Traefik. Confirm QUIC from a real device (Chrome → Protocol `h3`). +- pfSense `nginx` alias (=`.200`) is now unused; `traefik_lb` (=`.203`) is live.