infra/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md
Viktor Barzin b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00

105 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-06-10 — tuya-bridge down 7.5h: forgejo image pulls ride the public-IP hairpin
## Impact
- `tuya-bridge` (Flask/tinytuya bridge feeding HA-Sofia's ATS, fuse-main,
fuse-garage and 4 thermostat REST sensors) unavailable ~02:1509:50 EEST.
HA REST sensors 503'd; the official-tuya integration devices were
unaffected (hybrid architecture limited the blast radius to the 3 power
devices' advanced telemetry + thermostats extras).
- Third incident from the same root cause class:
Woodpecker buildkit pushes (2026-06-04, code-yh33), tripit
ImagePullBackOff on node2/node3 + devvm git timeouts (2026-06-09),
tuya-bridge (this one).
## Timeline (EEST)
- **02:15** — tuya-bridge pod rescheduled onto `k8s-node3` (its previous
node5/6-era home was rebuilt 14d ago; the forgejo-path image was never
cached on node3 — only stale `docker.io/*` copies). Kubelet must pull
`forgejo.viktorbarzin.me/viktor/tuya_bridge:3216c87a`.
- **02:15→09:30** — 51 consecutive pull failures:
`dial tcp 176.12.22.76:443: i/o timeout` → ImagePullBackOff. HA shows
503s (emo observed at 02:20).
- **09:40** — investigation: forgejo healthy via internal Traefik
(`10.0.20.203`), manifest exists; node3's hosts.toml mirror present and
correct; bare-IP request to the mirror returns **404 from Traefik**;
registry auth realm is the **absolute** public URL.
- **09:48** — `/etc/hosts` pin `10.0.20.203 forgejo.viktorbarzin.me` added
on node3; `crictl pull` succeeds immediately; pod replaced → Running;
`/health` ok; all 27 device `getstatus()` calls succeed; all 7
`*_tuya_cloud_up` Prometheus gauges = 1.
- **10:05** — pin rolled to all 7 nodes; provisioning scripts + docs updated.
## Root cause
Fresh kubelet pulls of `forgejo.viktorbarzin.me` images depend on pfSense
NAT reflection of the public IP `176.12.22.76`, which is intermittently
broken from the `10.0.20.0/24` network. The containerd
`certs.d/.../hosts.toml` mirror that was *believed* to keep pulls internal
cannot do so, for two independent reasons:
1. **Traefik routes by Host/SNI.** The mirror entry
`[host."https://10.0.20.203"]` makes containerd dial the bare IP (no
SNI, `Host: 10.0.20.203`) — no Traefik router matches → **404** → con-
tainerd treats the mirror as a miss and falls back to
`server = "https://forgejo.viktorbarzin.me"` → public DNS → hairpin.
2. **The Bearer auth realm is absolute.** `/v2/` challenges with
`realm="https://forgejo.viktorbarzin.me/v2/token"`; containerd fetches
that URL verbatim — this leg never goes through the mirror at all.
So every fresh pull silently depended on hairpin luck. Cached images masked
the problem; it only fired when a pod landed on a node without the image
(node rebuilds, new nodes, evictions, new tags).
Why DNS-side fixes don't reach this path: nodes resolve via systemd-resolved
→ pfSense (10.0.20.1) + public fallback (94.140.14.14), so Technitium
split-horizon (scoped to `192.168.1.0/24` clients) never applies; the
CoreDNS forgejo rewrite (2026-06-04) covers pods only, not kubelet.
## Fix
`/etc/hosts` pin on every k8s node (hot, no drain, no containerd restart):
```
10.0.20.203 forgejo.viktorbarzin.me # forgejo-internal-pin (managed: setup-forgejo-containerd-mirror.sh)
```
Go's resolver (containerd) consults `/etc/hosts` first, so resolve + token
+ blob legs all go to internal Traefik with correct SNI and a valid
wildcard cert (no `skip_verify` needed on this path). Applied live to all
7 nodes; persisted in `modules/create-template-vm/k8s-node-containerd-setup.sh`
(new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing-node
rollout). hosts.toml mirror left in place (harmless, uniform config).
**Renumber hazard:** the pin hardcodes Traefik's LB IP, same as the
hosts.toml mirror and the 5 literals broken by the 2026-05-30 `.200→.203`
move. Any future Traefik LB renumber must update both (grep nodes for
`forgejo-internal-pin`).
## Verification
- `getent hosts forgejo.viktorbarzin.me``10.0.20.203` on all 7 nodes;
`curl https://forgejo.viktorbarzin.me/v2/` → 401 (internal route, valid TLS).
- tuya-bridge pod Running; `/health` `ok=true`; 27/27 devices
`success=true`; 7/7 `*_tuya_cloud_up` gauges = 1; no tuya-related alerts.
## Lessons
- A mirror that *can* fall back to a broken path is not a fix — it's a
latency bomb with the blast delayed until the cache misses.
- Registry token realms are absolute URLs: any "redirect the registry"
scheme must also redirect the *name*, not just the endpoint.
- The remaining hairpin-exposed leg is **devvm git** (manual `/etc/hosts`
workaround documented in memory); a durable LAN-wide fix would need
pfSense Unbound host overrides (live network device — deliberate,
separate change).
## Related
- Beads `code-2or8` (Tuya Cloud subscription) — verified resolved during
this incident: subscription is active again, all gauges green; closed.
- 2026-06-09 tripit ImagePullBackOff — same cause, self-recovered when the
hairpin flapped back; the two `ScrapeTargetDown[tripit]` alerts firing
during this investigation were scrapes of *Completed* cronjob pod
endpoints (separate monitoring wart, not this outage).