infra: fix stale Traefik LB-IP refs + accurate LB-IP registry
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled

Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.

- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
  (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
  break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
  kubernetes_service data source -- cannot rot on a future renumber and avoids
  the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
  -> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
  KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
  registry + LB-IP renumber checklist (in-band + out-of-band consumers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-03 10:23:13 +00:00
parent dcb7c74531
commit 7d7a0ad474
4 changed files with 147 additions and 24 deletions

View file

@ -2455,7 +2455,7 @@ serverFiles:
labels:
severity: critical
annotations:
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
summary: "viktorbarzin.me apex A drifted from expected 10.0.20.203"
description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."
- alert: ViktorBarzinApexProbeStale
expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900

View file

@ -178,20 +178,34 @@ resource "helm_release" "woodpecker" {
# Keeps the OAuth/forge-API path off the WAN gateway (forgejo.viktorbarzin.me
# resolves to the public IP via DNS, which round-trips through Cloudflare
# and routinely tripped 30s context-deadline timeouts when fetching pipeline
# config). 10.0.20.200 is the Traefik LB that fronts forgejo internally;
# Traefik serves the *.viktorbarzin.me wildcard so SNI verification still
# passes.
# config). We pin forgejo.viktorbarzin.me to the in-cluster Traefik Service
# ClusterIP (read dynamically below); Traefik serves the *.viktorbarzin.me
# wildcard so SNI verification still passes. ClusterIP (not the LB IP) avoids
# two traps: (1) the Traefik LB IP is ETP=Local and not reliably hairpin-
# reachable from an arbitrary pod's node; (2) a hard-coded LB IP rots on every
# Traefik renumber this previously pinned 10.0.20.200, which went dead when
# Traefik moved to its dedicated .203 (2026-05-30), the same failure class the
# cloudflared origin hit (also fixed by targeting the in-cluster Service).
data "kubernetes_service" "traefik" {
metadata {
name = "traefik"
namespace = "traefik"
}
}
resource "null_resource" "woodpecker_server_host_alias" {
triggers = {
# Re-run on every helm_release version bump or values change.
helm_version = helm_release.woodpecker.version
helm_values = sha256(join("", helm_release.woodpecker.values))
# Re-run on helm version/values change, and when the Traefik ClusterIP
# changes (e.g. the Service is recreated) so the alias self-heals.
helm_version = helm_release.woodpecker.version
helm_values = sha256(join("", helm_release.woodpecker.values))
traefik_cluster_ip = data.kubernetes_service.traefik.spec[0].cluster_ip
}
provisioner "local-exec" {
command = <<-BASH
set -euo pipefail
kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"10.0.20.200","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"${data.kubernetes_service.traefik.spec[0].cluster_ip}","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
kubectl -n woodpecker rollout status statefulset/woodpecker-server --timeout=120s
BASH
interpreter = ["/bin/bash", "-c"]