infra/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md
Viktor Barzin 1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00

138 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-06-10 — tuya-bridge down 7.5h: forgejo image pulls ride the public-IP hairpin
## Impact
- `tuya-bridge` (Flask/tinytuya bridge feeding HA-Sofia's ATS, fuse-main,
fuse-garage and 4 thermostat REST sensors) unavailable ~02:1509:50 EEST.
HA REST sensors 503'd; the official-tuya integration devices were
unaffected (hybrid architecture limited the blast radius to the 3 power
devices' advanced telemetry + thermostats extras).
- Third incident from the same root cause class:
Woodpecker buildkit pushes (2026-06-04, code-yh33), tripit
ImagePullBackOff on node2/node3 + devvm git timeouts (2026-06-09),
tuya-bridge (this one).
## Timeline (EEST)
- **02:15** — tuya-bridge pod rescheduled onto `k8s-node3` (its previous
node5/6-era home was rebuilt 14d ago; the forgejo-path image was never
cached on node3 — only stale `docker.io/*` copies). Kubelet must pull
`forgejo.viktorbarzin.me/viktor/tuya_bridge:3216c87a`.
- **02:15→09:30** — 51 consecutive pull failures:
`dial tcp 176.12.22.76:443: i/o timeout` → ImagePullBackOff. HA shows
503s (emo observed at 02:20).
- **09:40** — investigation: forgejo healthy via internal Traefik
(`10.0.20.203`), manifest exists; node3's hosts.toml mirror present and
correct; bare-IP request to the mirror returns **404 from Traefik**;
registry auth realm is the **absolute** public URL.
- **09:48** — `/etc/hosts` pin `10.0.20.203 forgejo.viktorbarzin.me` added
on node3; `crictl pull` succeeds immediately; pod replaced → Running;
`/health` ok; all 27 device `getstatus()` calls succeed; all 7
`*_tuya_cloud_up` Prometheus gauges = 1.
- **10:05** — pin rolled to all 7 nodes; provisioning scripts + docs updated.
## Root cause
Fresh kubelet pulls of `forgejo.viktorbarzin.me` images depend on pfSense
NAT reflection of the public IP `176.12.22.76`, which is intermittently
broken from the `10.0.20.0/24` network. The containerd
`certs.d/.../hosts.toml` mirror that was *believed* to keep pulls internal
cannot do so, for two independent reasons:
1. **Traefik routes by Host/SNI.** The mirror entry
`[host."https://10.0.20.203"]` makes containerd dial the bare IP (no
SNI, `Host: 10.0.20.203`) — no Traefik router matches → **404** → con-
tainerd treats the mirror as a miss and falls back to
`server = "https://forgejo.viktorbarzin.me"` → public DNS → hairpin.
2. **The Bearer auth realm is absolute.** `/v2/` challenges with
`realm="https://forgejo.viktorbarzin.me/v2/token"`; containerd fetches
that URL verbatim — this leg never goes through the mirror at all.
So every fresh pull silently depended on hairpin luck. Cached images masked
the problem; it only fired when a pod landed on a node without the image
(node rebuilds, new nodes, evictions, new tags).
Why DNS-side fixes don't reach this path: nodes resolve via systemd-resolved
→ pfSense (10.0.20.1) + public fallback (94.140.14.14), so Technitium
split-horizon (scoped to `192.168.1.0/24` clients) never applies; the
CoreDNS forgejo rewrite (2026-06-04) covers pods only, not kubelet.
## Fix
**Initial mitigation (same morning):** `/etc/hosts` pin
`10.0.20.203 forgejo.viktorbarzin.me` on every node — restored service
immediately (resolve + token + blob legs all internal with correct SNI).
**Superseded same day (Viktor: "no hardcoded IPs in nodes") by a DNS-based
fix.** Discovery: Technitium's split-horizon zone *already* resolves
`forgejo.viktorbarzin.me → CNAME viktorbarzin.me → A <live Traefik IP>`
the `technitium-ingress-dns-sync` CronJob auto-CNAMEs every ingress host
hourly, the apex A record tracks the live Traefik LB IP, and the
`viktorbarzin-apex-probe` canary alerts on drift. The nodes simply never
queried Technitium (resolv chain: pfSense + public AdGuard fallback). The
devvm already solved this with a systemd-resolved **routing domain**
drop-in; the same was rolled to all 7 nodes:
```
# /etc/systemd/resolved.conf.d/viktorbarzin.conf
[Resolve]
DNS=10.0.20.201
Domains=~viktorbarzin.me
```
The `/etc/hosts` pins were then removed (verified `getent` still returns
the Traefik IP via DNS, and `crictl pull` succeeds). On node5/6 the
cloud-init `global-dns.conf` (`DNS=8.8.8.8 1.1.1.1`) was demoted to
`FallbackDNS=` only — public servers in the global set merge with and
race the routing domain. That file's original justification ("Technitium
NXDOMAINs forgejo.viktorbarzin.me") was obsolete: the ingress-dns-sync
has since added forgejo to the zone — a stale comment that actively
pointed new nodes at the hairpin.
Persisted in `modules/create-template-vm/cloud_init.yaml` (new nodes; DNS
drop-ins) and `scripts/setup-forgejo-containerd-mirror.sh` (existing-node
rollout). hosts.toml mirror left in place but documented as vestigial.
**Renumber hazard: resolved.** A future Traefik LB renumber propagates
via the apex A record automatically (drift probe alerts if it doesn't);
only the vestigial hosts.toml literal goes stale. **New trade-off:**
`*.viktorbarzin.me` resolution from nodes now depends on in-cluster
Technitium (3 replicas); in a full cluster outage these names SERVFAIL —
acceptable, the services are down anyway, and bootstrap images pull via
the IP-addressed `10.0.20.10` mirrors.
## Verification
- `getent hosts forgejo.viktorbarzin.me``10.0.20.203` on all 7 nodes
**with no `/etc/hosts` entry** (pure DNS via the routing domain);
`resolvectl status` shows `~viktorbarzin.me` routed to `10.0.20.201`;
general resolution (`getent hosts google.com`) intact on every node;
`crictl pull` of the tuya_bridge image succeeds via the DNS path.
- tuya-bridge pod Running; `/health` `ok=true`; 27/27 devices
`success=true`; 7/7 `*_tuya_cloud_up` gauges = 1; no tuya-related alerts.
## Lessons
- A mirror that *can* fall back to a broken path is not a fix — it's a
latency bomb with the blast delayed until the cache misses.
- Registry token realms are absolute URLs: any "redirect the registry"
scheme must also redirect the *name*, not just the endpoint.
- Before inventing a redirect mechanism, check what the DNS authority
already serves: the Technitium split-horizon zone had the correct,
auto-maintained answer all along — the clients just weren't asking it.
- Stale config comments are load-bearing: the obsolete "Technitium
NXDOMAINs forgejo" comment in cloud-init steered new nodes onto public
DNS, recreating the hairpin exposure on every node added after it.
- All `10.0.x` legs are now DNS-routed (nodes + devvm via routing domain,
pods via CoreDNS rewrite). pfSense Unbound host overrides remain an
option for other LAN segments if a non-Technitium client ever needs
internal answers (live network device — deliberate, separate change).
## Related
- Beads `code-2or8` (Tuya Cloud subscription) — verified resolved during
this incident: subscription is active again, all gauges green; closed.
- 2026-06-09 tripit ImagePullBackOff — same cause, self-recovered when the
hairpin flapped back; the two `ScrapeTargetDown[tripit]` alerts firing
during this investigation were scrapes of *Completed* cronjob pod
endpoints (separate monitoring wart, not this outage).