Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).
Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.
Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.
Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).
Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>