docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip]

Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).

Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-01 21:25:33 +00:00
parent f807050eb5
commit 9fb3e6e851
2 changed files with 40 additions and 35 deletions

View file

@ -216,7 +216,7 @@ If the activation surface needs to come down (abuse, legal, audit):
The k8s service stays reachable on the LAN
(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
via Traefik on `10.0.20.200:443`) — only the WAN port-forward is removed.
via Traefik on `10.0.20.203:443`) — only the WAN port-forward is removed.
To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
port `1688`) and the filter rule with the same per-source caps. The alias