docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2)
Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
0c01adac95
commit
16c9aafafa
2 changed files with 38 additions and 1 deletions
|
|
@ -124,7 +124,8 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
- **CrowdSec bouncer**: graceful degradation mode (fail-open on error).
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
|
||||
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
|
||||
- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik.
|
||||
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
|
||||
- **Traefik LB IP = `10.0.20.203`, `externalTrafficPolicy: Local`** (dedicated, NOT the shared `.200`). Moved off the shared `.200` on 2026-05-30 so direct/non-proxied apps preserve the **real client IP for CrowdSec** (ETP=Cluster SNAT'd them to the node IP) and so QUIC works. **The shared `10.0.20.200` keeps the other 10 LB services** (PG state-backend `postgresql-lb`, headscale, wireguard, coturn, xray, etc. — all ETP=Cluster; MetalLB forbids mixed ETP on a shared IP, hence Traefik's own IP). **cloudflared targets the in-cluster Traefik Service** (`https://traefik.traefik.svc.cluster.local:443`, remote/dashboard tunnel config — edit via CF Global API Key in `secret/platform`), so proxied apps are decoupled from the LB IP. pfSense WAN 443 (tcp+udp) NAT → alias `traefik_lb` (`.203`). Internal split-horizon apex `viktorbarzin.me A` → `.203`. Full runbook + post-mortem: `docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*`.
|
||||
- **IPAM & DNS auto-registration**: pfSense Kea DHCP serves all 3 subnets (VLAN 10, VLAN 20, 192.168.1.x). Kea DDNS auto-registers every DHCP client in Technitium (RFC 2136, A+PTR). CronJob `phpipam-pfsense-import` (hourly) pulls Kea leases + ARP into phpIPAM via SSH (passive, no scanning). CronJob `phpipam-dns-sync` (15min) bidirectional sync phpIPAM ↔ Technitium. 42 MAC reservations for 192.168.1.x.
|
||||
|
||||
## Service-Specific Notes
|
||||
|
|
|
|||
|
|
@ -161,3 +161,39 @@ before retrying:
|
|||
|
||||
**Left in place for retry:** pfSense alias `traefik_lb` (=`10.0.20.203`, NAT
|
||||
reverted to `nginx`); pfSense `config.xml` backups `config.xml.bak-traefik-*`.
|
||||
|
||||
## Attempt 2 — 2026-05-30 — SUCCESS
|
||||
|
||||
Live and verified, **no proxied/Vault outage** this time. Key change vs attempt 1:
|
||||
**decouple cloudflared from the LB IP FIRST**, so moving Traefik no longer
|
||||
touches the proxied path or Vault's ingress.
|
||||
|
||||
Executed order (all lessons applied — `tg` always run to a file, creds
|
||||
pre-fetched while Vault up):
|
||||
1. **Cloudflare tunnel ingress repointed** `https://10.0.20.200:443` →
|
||||
`https://traefik.traefik.svc.cluster.local:443` (both `*.viktorbarzin.me`
|
||||
and apex rules; `noTLSVerify` kept; catch-all 404 kept). Done via the
|
||||
**Cloudflare Global API Key** (`secret/platform` → `cloudflare_api_key`,
|
||||
email `vbarzin@gmail.com`, `X-Auth-Email`+`X-Auth-Key` headers — NOT the
|
||||
tunnel token, which is not an API credential). Tunnel: account
|
||||
`02e035473cfc4834fb10c5d35470d8b4`, id `75182cd7-bb91-4310-b961-5d8967da8b41`.
|
||||
→ proxied apps now IP-independent.
|
||||
2. Traefik Service → `10.0.20.203` + `ETP=Local` (single service; `tg apply`).
|
||||
Proxied apps + Vault stayed up (cloudflared → ClusterIP).
|
||||
3. Technitium apex `viktorbarzin.me A` → `10.0.20.203` (ttl 60).
|
||||
4. pfSense 443 (tcp+udp) NAT `nginx` → `traefik_lb` (`.203`); `/etc/rc.filter_configure`.
|
||||
|
||||
**Verified:** proxied 307/200 throughout; direct apps 200; **real external
|
||||
client IPs now reach Traefik/CrowdSec** (`216.73.217.51`, `54.x`, `52.x` — not
|
||||
node `10.0.20.103`); PG state DB OK; TF state reconciled (`tg apply` exit 0).
|
||||
|
||||
**Notes / follow-ups:**
|
||||
- **Out-of-band (not in TF):** the cloudflared tunnel ingress (remote/dashboard
|
||||
config) and the pfSense `traefik_lb` alias + NAT. Codify the tunnel config in
|
||||
TF (`cloudflare_zero_trust_tunnel_cloudflared_config`) so `→ClusterIP` is
|
||||
declarative — pre-existing gap (tunnel was already remote-managed).
|
||||
- **QUIC:** infra correct (ETP=Local + UDP 443 → `.203` + Traefik h3 listener).
|
||||
`http3check.net` is unreliable here — it hits the IPv6 AAAA
|
||||
(`2001:470:6e:43d::2`, separate HE-tunnel path, unchanged) and fails before
|
||||
reaching Traefik. Confirm QUIC from a real device (Chrome → Protocol `h3`).
|
||||
- pfSense `nginx` alias (=`.200`) is now unused; `traefik_lb` (=`.203`) is live.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue