Viktor Barzin 1473a94f29 docs/plans: Traefik dedicated-IP cutover attempt 1 post-mortem (rolled back)

Attempt rolled back to .200 baseline. Root blocker: cloudflared is a
token/dashboard-managed tunnel whose ingress targets the Traefik LB IP
(10.0.20.200), so moving Traefik to .203 took down all proxied apps. Retry
must also repoint the tunnel ingress (Cloudflare API). Also documents the
vault-ingress circular dep, SIGPIPE->stuck PG state-lock gotcha, and the
ETP=Local hairpin caveat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-30 01:27:29 +00:00

8.5 KiB

Raw Blame History

Plan: Migrate Traefik to dedicated IP 10.0.20.203 + ETP=Local

Date: 2026-05-30 · Pairs with: 2026-05-30-traefik-dedicated-ip-etp-local-design.md Status: Draft — review required before executing. Nothing applied yet.

Goal: real client IPs to CrowdSec + working QUIC on the 24 direct apps, by moving Traefik off the shared 10.0.20.200 onto its own 10.0.20.203 with externalTrafficPolicy: Local. Shared IP .200 (incl. the TF state DB) is left untouched until the final cleanup step.

Recommended cutover: in-place (simplest, most maintainable) inside a short planned window. Additive/zero-downtime variant noted at the end.

Phase 0 — Pre-flight (read-only, ~10 min)

Snapshot current state (already captured in chat; re-confirm at execution):
- Traefik svc: IP 10.0.20.200, allow-shared-ip=shared, ETP=Cluster.
- .200 shared by 10 services incl. dbaas/postgresql-lb:5432 (TF state).
- DNS apex viktorbarzin.me A = 10.0.20.200 (Technitium primary, split-horizon).
- pfSense rdr: WAN 443 tcp+udp → alias <nginx> (=10.0.20.200); admin@10.0.20.1.
- Traefik 3 replicas (node4, node5, +1), PDB minAvailable=2.
Confirm 10.0.20.203 still free in pool 10.0.20.200-220.
Lower DNS TTL on the apex record to 60s (Technitium) ~30 min ahead of cutover to shrink the window. (Restore to normal afterward.)
Baseline checks to compare against (run now, save output):
- curl -sI https://immich.viktorbarzin.me (direct app) → 200/redirect
- curl -sI https://<a-proxied-app> → 200 (proxied path)
- PG state reachable: nc -vz 10.0.20.200 5432 (or a terragrunt plan no-op)
- Traefik access log shows 10.0.20.103 for a direct app (the bug we're fixing)
- http3check.net for immich → QUIC FAILS (baseline)

Phase 1 — Terraform: dedicated IP + ETP=Local (reversible)

Edit stacks/traefik/modules/traefik/main.tf, Helm service block (~L165-173):

service = {
  type = "LoadBalancer"
  annotations = {
    "metallb.io/loadBalancerIPs" = "10.0.20.203"   # was 10.0.20.200
    # allow-shared-ip REMOVED — Traefik no longer shares an IP
  }
  spec = {
    externalTrafficPolicy = "Local"                 # was Cluster
  }
}

scripts/tg plan in stacks/traefik — review: only the Traefik Service changes (new IP, ETP, annotation removed). No change to other stacks.
scripts/tg apply.
Immediately verify (ingress is briefly broken until DNS+pfSense move):
- kubectl get svc traefik -n traefik → IP 10.0.20.203, ETP=Local.
- kubectl get svc -A | grep 10.0.20.200 → the other 9 services still hold .200.
- nc -vz 10.0.20.200 5432 → TF state DB still reachable (critical).
- curl -sI --resolve <app>:443:10.0.20.203 https://<direct-app> → 200 (proves .203 serves before DNS moves).

Rollback (Phase 1): revert the three lines → scripts/tg apply. Back to .200.

Phase 2 — Internal DNS cutover (Technitium)

Update split-horizon apex: viktorbarzin.me A → 10.0.20.203 (primary; AXFR replicates to secondary/tertiary, or kick technitium-zone-sync).
Verify internal resolution: dig +short immich.viktorbarzin.me → 10.0.20.203 from a cluster/LAN client; curl -sI https://immich.viktorbarzin.me → 200.

Rollback (Phase 2): apex A → 10.0.20.200.

Phase 3 — pfSense (live firewall — operator-driven, alias not literal)

Per the "create a VIP/alias, don't hardcode" requirement:

Create a pfSense Firewall Alias (Firewall ▸ Aliases), type Host: name traefik_lb, value 10.0.20.203. (This is the correct pfSense object for a NAT-forward target — same kind as the existing <nginx> alias. If a CARP/IP-Alias Virtual IP is intended instead, confirm at review; a routed K8s LB IP normally uses an Alias, not a VIP.)
Repoint the 443 forward (Firewall ▸ NAT ▸ Port Forward): change the existing WAN https (TCP and UDP) rule's target from nginx → traefik_lb. Leave the auto firewall rule linked. Do not touch the http-alt/7443 rules (those are xray on <k8s_shared_lb>).
Apply pfSense changes.
Verify externally:
- http3check.net for immich → QUIC OK (h3 established).
- External curl to a few direct apps → 200.
- Traefik access log now shows real client IPs for direct apps (not 10.0.20.103).

Rollback (Phase 3): point the 443 rule's target back to nginx.

Phase 4 — Verify CrowdSec + the fleet (the real prize)

Traefik logs: real public IPs on direct apps (sample several).
CrowdSec: confirm it now ingests real IPs (a test decision / metrics); confirm the source-IP allowlist (10.0.20.0/22, 192.168.1.0/24, tailnet) is active so family/LAN aren't banned.
Proxied apps unaffected (spot-check 2-3 — still real IPs via Cloudflare).
All other .200 services healthy (PG state, headscale, wireguard, coturn, xray, etc.).
Restore DNS TTL to normal.

Phase 5 — Cleanup / docs

Confirm Traefik no longer answers on .200 (it shouldn't after Phase 1).
Update docs (design doc "Affected docs" list): .claude/CLAUDE.md, docs/architecture/networking.md, service-catalog, memory ids 3241-3246.
Commit TF + docs.

Rollback (full)

Reverse order: pfSense 443 target → nginx; apex A → .200; revert the Traefik Service TF (IP .200, allow-shared-ip=shared, ETP=Cluster) → apply. kubectl/Helm reach the API server directly (not via Traefik), so control is retained even if ingress is down mid-cutover.

Additive (zero-downtime) variant — if the window is unacceptable

Instead of editing the Helm Service in place: add a second raw kubernetes_service (type LoadBalancer, IP .203, ETP=Local, ports web/80→8000, websecure/443→8443 TCP, websecure-http3/443→8443 UDP, selector = Traefik pod labels). Both .200 (old) and .203 (new) serve Traefik. Cut DNS+pfSense to .203, verify, then convert the Helm Service to ClusterIP (drops .200). More config to carry long-term (a hand-maintained Service duplicating Helm) — weigh against the brief in-place window.

Attempt 1 — 2026-05-30 — ROLLED BACK (post-mortem)

First execution was rolled back to the .200 baseline; all service restored, TF state reconciled (No changes). The cutover achieved its primary goal mid-flight (real external client IPs reached CrowdSec — confirmed real IPs like 34.107.119.124 in Traefik logs instead of node 10.0.20.103), but a missed dependency took proxied apps down, forcing rollback. Fix the plan before retrying:

BLOCKER — cloudflared targets the LB IP. The cloudflared tunnel is token-based / Cloudflare-dashboard-managed (args: [tunnel] + TUNNEL_TOKEN; no local config.yaml). Its ingress sends *.viktorbarzin.me to the Traefik LB IP 10.0.20.200. Moving Traefik to .203 left cloudflared pointing at a dead IP → every proxied app (vault, home, …) went down. The retry MUST also repoint the tunnel ingress .200 → .203 in Cloudflare (API/dashboard) as part of the same cutover — ideally point cloudflared at the Traefik ClusterIP/service so it's IP-independent.
Vault-ingress circular dependency. Fetching the Technitium password from Vault during the window failed (Vault's ingress was down). Fix used: pre-fetch all creds before touching Traefik (worked). The DNS step then restored Vault.
SIGPIPE → stuck PG state locks. Piping scripts/tg through head/grep (early pipe close) SIGPIPE-killed terragrunt before it released the PG advisory lock, leaving an idle terraform_state connection holding the lock (force-unlock can't release another session's advisory lock). Always run tg to a file, never pipe through early-closing filters. Clear a stuck one by terminating the idle backend: pg_terminate_backend(<pid>) for the idle conn holding pg_locks.objid of the workspace.
ETP=Local + hairpin. Internal hosts that resolve *.viktorbarzin.me via public DNS and hairpin (e.g. the devvm) become flaky under ETP=Local. True external clients and internal-direct (.203) clients work. Ensure such hosts resolve internally (Technitium split-horizon).
QUIC verification. http3check.net was unreliable here (failed on TCP while real clients got 200s) — don't rely on it; confirm from a real device on cellular.

Left in place for retry: pfSense alias traefik_lb (=10.0.20.203, NAT reverted to nginx); pfSense config.xml backups config.xml.bak-traefik-*.

8.5 KiB Raw Blame History