forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.
Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.
hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
b6976ce014
commit
1ee1bf0817
7 changed files with 135 additions and 66 deletions
|
|
@ -59,28 +59,55 @@ CoreDNS forgejo rewrite (2026-06-04) covers pods only, not kubelet.
|
|||
|
||||
## Fix
|
||||
|
||||
`/etc/hosts` pin on every k8s node (hot, no drain, no containerd restart):
|
||||
**Initial mitigation (same morning):** `/etc/hosts` pin
|
||||
`10.0.20.203 forgejo.viktorbarzin.me` on every node — restored service
|
||||
immediately (resolve + token + blob legs all internal with correct SNI).
|
||||
|
||||
**Superseded same day (Viktor: "no hardcoded IPs in nodes") by a DNS-based
|
||||
fix.** Discovery: Technitium's split-horizon zone *already* resolves
|
||||
`forgejo.viktorbarzin.me → CNAME viktorbarzin.me → A <live Traefik IP>` —
|
||||
the `technitium-ingress-dns-sync` CronJob auto-CNAMEs every ingress host
|
||||
hourly, the apex A record tracks the live Traefik LB IP, and the
|
||||
`viktorbarzin-apex-probe` canary alerts on drift. The nodes simply never
|
||||
queried Technitium (resolv chain: pfSense + public AdGuard fallback). The
|
||||
devvm already solved this with a systemd-resolved **routing domain**
|
||||
drop-in; the same was rolled to all 7 nodes:
|
||||
|
||||
```
|
||||
10.0.20.203 forgejo.viktorbarzin.me # forgejo-internal-pin (managed: setup-forgejo-containerd-mirror.sh)
|
||||
# /etc/systemd/resolved.conf.d/viktorbarzin.conf
|
||||
[Resolve]
|
||||
DNS=10.0.20.201
|
||||
Domains=~viktorbarzin.me
|
||||
```
|
||||
|
||||
Go's resolver (containerd) consults `/etc/hosts` first, so resolve + token
|
||||
+ blob legs all go to internal Traefik with correct SNI and a valid
|
||||
wildcard cert (no `skip_verify` needed on this path). Applied live to all
|
||||
7 nodes; persisted in `modules/create-template-vm/k8s-node-containerd-setup.sh`
|
||||
(new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing-node
|
||||
rollout). hosts.toml mirror left in place (harmless, uniform config).
|
||||
The `/etc/hosts` pins were then removed (verified `getent` still returns
|
||||
the Traefik IP via DNS, and `crictl pull` succeeds). On node5/6 the
|
||||
cloud-init `global-dns.conf` (`DNS=8.8.8.8 1.1.1.1`) was demoted to
|
||||
`FallbackDNS=` only — public servers in the global set merge with and
|
||||
race the routing domain. That file's original justification ("Technitium
|
||||
NXDOMAINs forgejo.viktorbarzin.me") was obsolete: the ingress-dns-sync
|
||||
has since added forgejo to the zone — a stale comment that actively
|
||||
pointed new nodes at the hairpin.
|
||||
|
||||
**Renumber hazard:** the pin hardcodes Traefik's LB IP, same as the
|
||||
hosts.toml mirror and the 5 literals broken by the 2026-05-30 `.200→.203`
|
||||
move. Any future Traefik LB renumber must update both (grep nodes for
|
||||
`forgejo-internal-pin`).
|
||||
Persisted in `modules/create-template-vm/cloud_init.yaml` (new nodes; DNS
|
||||
drop-ins) and `scripts/setup-forgejo-containerd-mirror.sh` (existing-node
|
||||
rollout). hosts.toml mirror left in place but documented as vestigial.
|
||||
|
||||
**Renumber hazard: resolved.** A future Traefik LB renumber propagates
|
||||
via the apex A record automatically (drift probe alerts if it doesn't);
|
||||
only the vestigial hosts.toml literal goes stale. **New trade-off:**
|
||||
`*.viktorbarzin.me` resolution from nodes now depends on in-cluster
|
||||
Technitium (3 replicas); in a full cluster outage these names SERVFAIL —
|
||||
acceptable, the services are down anyway, and bootstrap images pull via
|
||||
the IP-addressed `10.0.20.10` mirrors.
|
||||
|
||||
## Verification
|
||||
|
||||
- `getent hosts forgejo.viktorbarzin.me` → `10.0.20.203` on all 7 nodes;
|
||||
`curl https://forgejo.viktorbarzin.me/v2/` → 401 (internal route, valid TLS).
|
||||
- `getent hosts forgejo.viktorbarzin.me` → `10.0.20.203` on all 7 nodes
|
||||
**with no `/etc/hosts` entry** (pure DNS via the routing domain);
|
||||
`resolvectl status` shows `~viktorbarzin.me` routed to `10.0.20.201`;
|
||||
general resolution (`getent hosts google.com`) intact on every node;
|
||||
`crictl pull` of the tuya_bridge image succeeds via the DNS path.
|
||||
- tuya-bridge pod Running; `/health` `ok=true`; 27/27 devices
|
||||
`success=true`; 7/7 `*_tuya_cloud_up` gauges = 1; no tuya-related alerts.
|
||||
|
||||
|
|
@ -90,10 +117,16 @@ move. Any future Traefik LB renumber must update both (grep nodes for
|
|||
latency bomb with the blast delayed until the cache misses.
|
||||
- Registry token realms are absolute URLs: any "redirect the registry"
|
||||
scheme must also redirect the *name*, not just the endpoint.
|
||||
- The remaining hairpin-exposed leg is **devvm git** (manual `/etc/hosts`
|
||||
workaround documented in memory); a durable LAN-wide fix would need
|
||||
pfSense Unbound host overrides (live network device — deliberate,
|
||||
separate change).
|
||||
- Before inventing a redirect mechanism, check what the DNS authority
|
||||
already serves: the Technitium split-horizon zone had the correct,
|
||||
auto-maintained answer all along — the clients just weren't asking it.
|
||||
- Stale config comments are load-bearing: the obsolete "Technitium
|
||||
NXDOMAINs forgejo" comment in cloud-init steered new nodes onto public
|
||||
DNS, recreating the hairpin exposure on every node added after it.
|
||||
- All `10.0.x` legs are now DNS-routed (nodes + devvm via routing domain,
|
||||
pods via CoreDNS rewrite). pfSense Unbound host overrides remain an
|
||||
option for other LAN segments if a non-Technitium client ever needs
|
||||
internal answers (live network device — deliberate, separate change).
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue