The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.
Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.
Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).
Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.8 KiB
Post-Mortem: Cloudflare Tunnel Pointed at Traefik's Old LB IP → Full External 502
| Field | Value |
|---|---|
| Date | 2026-06-01 |
| Duration | Misconfiguration latent since 2026-05-30 08:09Z (Traefik LB-IP move). Confirmed external outage in cloudflared logs from ~20:58Z; root-caused and fixed at 21:15Z; all pods converged by 21:16Z. Detection→fix window ~17 min. |
| Severity | SEV1 — every Cloudflare-proxied hostname (viktorbarzin.me + all *.viktorbarzin.me) returned HTTP 502 to external clients. Internal/LAN access was unaffected (split-horizon → Traefik direct), which is why it stayed hidden. |
| Affected Services | All external ingress: viktorbarzin.me, nextcloud, vault, authentik, vaultwarden, immich, linkwarden, nas, technitium, terminal, speedtest, and every other proxied app. |
| Issue | None filed (diagnosed and fixed in-session). |
| Status | Resolved. |
| Recurrence count | 1st of this kind. Same class as the 2026-06-01 forgejo-registry .200→.203 redirect breakage (containerd mirror) — both are fallout from the 2026-05-30 Traefik LB-IP move leaving a hard-coded 10.0.20.200 reference behind. |
Summary
On 2026-05-30 (commit 0c01adac) Traefik was moved off the shared MetalLB IP 10.0.20.200 onto its own dedicated IP 10.0.20.203 (with externalTrafficPolicy: Local). The Cloudflare tunnel's ingress rules — Terraform-managed in stacks/cloudflared/modules/cloudflared/cloudflare.tf — still routed *.viktorbarzin.me and viktorbarzin.me to https://10.0.20.200:443. After the move, nothing serves HTTPS on .200:443 (the shared IP keeps only the non-HTTP LB services: postgresql-lb, headscale, wireguard, coturn, xray). cloudflared therefore could not reach its origin (connect: no route to host / i/o timeout), and Cloudflare returned 502 for the entire public surface.
The fix: repoint both ingress rules at the in-cluster Traefik Service DNS https://traefik.traefik.svc.cluster.local:443 — the design the docs already described (CLAUDE.md "Networking" §) but which the code never actually implemented. Service DNS decouples the tunnel from the LB IP, so a future Traefik IP change cannot reproduce this.
Impact
- User-facing: 100% of externally-reachable services returned 502 via Cloudflare. LAN/internal access (which resolves
*.viktorbarzin.me→10.0.20.203via Technitium split-horizon, bypassing Cloudflare) kept working — this masked the outage. - Blast radius: every proxied hostname. Origin (Traefik) was healthy the entire time — purely a tunnel-origin routing fault.
- Data loss: none.
- Collateral: Vault's own public hostname (
vault.viktorbarzin.me) was also 502, creating a bootstrap problem for the fix —terragrunt applyneeds Vault for the PG state-backend creds, but Vault was only reachable via the broken tunnel from the dev box. Worked around with a temporary/etc/hostsentry pointingvault.viktorbarzin.me→10.0.20.203(internal Traefik), removed after the apply.
Root Cause
A hard-coded LB IP (10.0.20.200) in the tunnel origin survived the Traefik dedicated-IP migration. The 2026-05-30 migration updated Traefik's Service and the split-horizon DNS but did not grep for every consumer of the old .200 HTTPS endpoint. The cloudflared tunnel origin (and, separately, the containerd forgejo-registry redirect — fixed earlier the same day in 42db69a2) were missed.
Contributing factors:
- Docs described intent as reality. CLAUDE.md stated cloudflared targets
traefik.traefik.svc.cluster.local:443"so proxied apps are decoupled from the LB IP." The code used a raw IP. The doc gave false confidence that the decoupling existed. - No guard tied the tunnel origin to Traefik's actual address; a stale value plans/applies cleanly.
- Detection gap (masking). Split-horizon means LAN users never see external-only breakage. The
[External]Uptime-Kuma monitors +ExternalAccessDivergencealert are the only signal for this failure mode.
Timeline (UTC)
| Time | Event |
|---|---|
| 2026-05-30 08:09 | Commit 0c01adac — Traefik moves to dedicated LB IP 10.0.20.203. .200:443 stops serving HTTPS. Tunnel origin still .200. Outage latent from here. |
| 2026-06-01 ~20:51 | Keel auto-patches the cloudflared image; all 3 pods roll (coincidental — not the cause; the misconfig predates it). |
| 2026-06-01 ~20:58 | cloudflared logs show every proxied hostname failing: originService=https://10.0.20.200:443 … no route to host / i/o timeout. |
| 2026-06-01 ~21:08 | User reports "no ingress coming in." Investigation starts. |
| 21:09 | Isolated: origin healthy (direct to .203 → 200/302), public path → 502. cloudflared logs pin origin to dead .200:443. |
| 21:10 | Confirmed tunnel config is Terraform-managed (cloudflare_zero_trust_tunnel_cloudflared_config.sof), origin = .200 on both ingress rules. |
| 21:13 | Vault unreachable via public name (circular dep); worked around with temp /etc/hosts → .203. tg init -reconfigure (rotated PG backend creds). |
| 21:15:25 | Targeted apply: both ingress origins → https://traefik.traefik.svc.cluster.local:443. Apply complete! 1 changed. |
| 21:15:34–50 | cloudflared pushes config version=253; pods converge. |
| 21:16 | 10/10 curls to viktorbarzin.me → 200; 0 .200 errors across all pods; vault.viktorbarzin.me via real Cloudflare path → 200. Temp hosts entry removed. Resolved. |
Resolution
Changed both ingress_rule blocks in cloudflare.tf from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443 (no_tls_verify = true retained). Applied surgically with -target on the tunnel config resource only, to avoid touching two pre-existing, unrelated drift items the full plan surfaced (see below).
Pre-existing drift (NOT part of this incident, left untouched)
The full cloudflared stack plan showed two extra in-place changes, deliberately not applied:
kubernetes_deployment.cloudflared— TF would strip Keel's runtime annotations (keel.sh/policy|pollSchedule|trigger|update-time). The deployment ignoresdns_configbut notmetadata.annotations, so Keel's enrollment annotations look like drift. Self-healing (Keel re-adds within its 1h poll), but a clean fix is to addmetadata[0].annotations(and the template equivalent) toignore_changes, or codify the policy annotation in TF.cloudflare_record.mail_domainkey_rspamd— cosmetic re-chunking of the DKIM TXT record (identical key, different 255-char split). Benign.
Action Items
- Repoint tunnel origin to Traefik Service DNS (this fix).
- Post-mortem written; CLAUDE.md networking claim is now actually true.
- Pin exact outage-start via Uptime-Kuma
[External]monitor history /ExternalAccessDivergencefiring time (confirm whether it began at the 05-30 move and went unnoticed, or at a later tunnel re-apply). - Verify
ExternalAccessDivergenceis wired to a channel that gets seen — this is the only alert that catches external-only breakage; it apparently did not prompt action for ≤2.5 days. - Migration checklist: when an LB IP changes, grep the whole repo for the old IP before declaring done (this and the forgejo redirect were both missed
.200references on 2026-05-30). - (Optional) Address the cloudflared Keel-annotation drift so the stack plans clean.
Lessons
- Reference shared infra (Traefik) by stable Service DNS, not LB IP, from anything that can use cluster DNS. IPs are migration landmines.
- Keep docs honest: a doc that describes intended design as current reality hides exactly this class of bug.
- External-only outages are invisible from the LAN (split-horizon). The
[External]divergence signal is load-bearing — it must be trustworthy and seen.