From 2b8c0def30cd304582f739716e719eff6a723dc6 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 10 Jun 2026 08:32:34 +0000 Subject: [PATCH] dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 2 +- docs/architecture/dns.md | 26 +++++---- ...-06-10-tuya-bridge-forgejo-pull-hairpin.md | 53 ++++++++++++++----- docs/runbooks/pfsense-unbound.md | 49 +++++++++++++++++ modules/create-template-vm/cloud_init.yaml | 48 ++++++----------- .../k8s-node-containerd-setup.sh | 6 +-- scripts/setup-forgejo-containerd-mirror.sh | 51 ++++++------------ stacks/technitium/modules/technitium/main.tf | 48 +++++++++++++---- 8 files changed, 182 insertions(+), 101 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index f3f736e8..d3ecbb1f 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. -- **Private registry**: `forgejo.viktorbarzin.me/viktor/` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin by a systemd-resolved routing domain on every node (`/etc/systemd/resolved.conf.d/viktorbarzin.conf`: `DNS=10.0.20.201` + `Domains=~viktorbarzin.me`) — `*.viktorbarzin.me` lookups go to Technitium, whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). No hardcoded service IPs on nodes; the devvm uses the same drop-in. The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **Do not put public servers in the nodes' global resolved `DNS=` set** — they merge with the routing-domain set and race it (this was the node5/6 cloud-init `global-dns.conf` bug; now demoted to `FallbackDNS=`). In-cluster pods (notably Woodpecker buildkit build pods pushing images) resolve `forgejo.viktorbarzin.me` via a CoreDNS `rewrite name exact ... traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`), since they use neither the node routing domain nor the containerd mirror; without it, buildkit pushes intermittently timed out on the public-IP hairpin (added 2026-06-04, beads code-yh33). **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. DNS drop-in + mirror source lives in `modules/create-template-vm/{cloud_init.yaml,k8s-node-containerd-setup.sh}` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag (so `--cache-from`/`--cache-to` refs survive retention — added 2026-06-09); **went live (DRY_RUN=false) 2026-06-09** after verifying 0 running images on the delete set — the registry PVC is at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so live retention is what keeps it from filling. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. +- **Private registry**: `forgejo.viktorbarzin.me/viktor/` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are deliberately carved out** — the Traefik LB IP is ETP=Local and unreachable from pods, so CoreDNS has a dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`): forgejo pinned to Traefik's **ClusterIP** (TF-interpolated from the live Service; replaces the old `rewrite ... traefik.traefik.svc.cluster.local` in `.:53`), all other `.me` names forwarded to `8.8.8.8/1.1.1.1` (pods keep public answers; beads code-yh33). Do NOT remove that block while the pfSense override exists. **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag (so `--cache-from`/`--cache-to` refs survive retention — added 2026-06-09); **went live (DRY_RUN=false) 2026-06-09** after verifying 0 running images on the delete set — the registry PVC is at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so live retention is what keeps it from filling. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index 65074f5b..05a30ccc 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -269,11 +269,11 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons - **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients - **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed) -- **Not affected**: 10.0.x.x and K8s clients — these resolve non-proxied domains to the public IP and rely on pfSense NAT reflection, which is **intermittently broken** (observed i/o timeouts to `176.12.22.76:443` from k8s nodes and the devvm, 2026-06-04 → 2026-06-10). Hairpin-sensitive paths on this network route `*.viktorbarzin.me` to Technitium instead, via a systemd-resolved **routing domain** (`/etc/systemd/resolved.conf.d/viktorbarzin.conf`: `DNS=10.0.20.201`, `Domains=~viktorbarzin.me`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary) — no hardcoded service IPs on clients: - - **k8s nodes (kubelet image pulls of `forgejo.viktorbarzin.me`)**: routing-domain drop-in on all 7 nodes (2026-06-10, replacing a same-day `/etc/hosts` pin; deployed via `modules/create-template-vm/cloud_init.yaml` for new nodes, `scripts/setup-forgejo-containerd-mirror.sh` rollout for existing ones). The containerd hosts.toml mirror alone is insufficient — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry's Bearer auth realm is an absolute public URL fetched outside the mirror. Caution: public servers must NOT sit in the nodes' global resolved `DNS=` set — they merge with and race the routing domain (the old node5/6 `global-dns.conf` did exactly this; now `FallbackDNS=` only). Root cause analysis: `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`. - - **devvm**: same `viktorbarzin.conf` drop-in (predates the node rollout; provisioned by `setup-devvm.sh`). - - **in-cluster pods → forgejo**: CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` (2026-06-04, beads code-yh33) — pods bypass node resolved entirely. - - **Trade-off**: `*.viktorbarzin.me` resolution from nodes/devvm now depends on in-cluster Technitium (3 replicas). During a full cluster outage these names SERVFAIL — acceptable, the services behind them are down anyway; bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds. +- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL). + - **devvm**: also covered by a `~viktorbarzin.me → 10.0.20.201` resolved routing domain (predates the pfSense override, provisioned by `setup-devvm.sh`) — redundant-but-harmless belt-and-suspenders. + - **in-cluster PODS are deliberately carved out**: the Traefik LB IP is `externalTrafficPolicy=Local` and unreachable from pods, so the `.203` answers pfSense now returns must NOT reach them. CoreDNS has a dedicated `viktorbarzin.me:53` block (in `stacks/technitium`, TF-managed): forgejo is pinned to Traefik's **ClusterIP** (interpolated from the live Service at plan time) and all other `.me` names forward to `8.8.8.8/1.1.1.1` — preserving pods' pre-existing public-IP behavior (beads code-yh33). + - **Trade-off**: `viktorbarzin.me` resolution via pfSense now depends on in-cluster Technitium (3 replicas). During a full cluster outage the zone SERVFAILs LAN-wide — acceptable, the services behind it are down anyway; node bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds. + - **Residual nondeterminism**: nodes keep `94.140.14.14` as a secondary resolver (netplan/qm `--nameserver`). If systemd-resolved fails over to it during a pfSense DNS blip, `.me` answers are public again until it switches back — a rare, self-healing window, accepted. Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h). @@ -462,10 +462,18 @@ The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus ### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails) -Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries -directly instead of forwarding to Technitium, so the Technitium Split Horizon -post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied -services break hairpin on LAN clients again. Options: +**Since 2026-06-10 this is largely solved at the resolver**: pfSense Unbound +carries a domain override forwarding the entire `viktorbarzin.me` zone to +Technitium, so ANY client that queries pfSense (all VLANs + 192.168.1.x +clients pointed at `192.168.1.2`) gets the internal Traefik answer. If +hairpin still fails for a client, first check which resolver it actually +uses — clients on the TP-Link's own DHCP DNS (router/ISP) bypass pfSense +entirely. Options for those: + +(Historical context: 2026-04-19 Workstream D made Unbound answer LAN +queries directly, which had removed the Technitium Split Horizon +post-processing from the LAN path until the 2026-06-10 domain override +restored internal answers at the zone level.) 1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent. 2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `.viktorbarzin.me → 10.0.20.203` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver. diff --git a/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md b/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md index 09dce83d..6f07bc52 100644 --- a/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md +++ b/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md @@ -89,25 +89,50 @@ NXDOMAINs forgejo.viktorbarzin.me") was obsolete: the ingress-dns-sync has since added forgejo to the zone — a stale comment that actively pointed new nodes at the hairpin. -Persisted in `modules/create-template-vm/cloud_init.yaml` (new nodes; DNS -drop-ins) and `scripts/setup-forgejo-containerd-mirror.sh` (existing-node -rollout). hosts.toml mirror left in place but documented as vestigial. +**Final architecture (same day, round 3 — Viktor: "no customization, +everything handled by the DNS infra"):** the routing-domain drop-ins were +ALSO removed; nodes are now completely stock. Two resolver-side changes +replaced them: + +1. **pfSense Unbound domain override** `viktorbarzin.me → 10.0.20.201` + (forward-zone to Technitium). Every Unbound client on every VLAN gets + the internal split-horizon answers with zero per-host config. No + DNSSEC complications (zone unsigned), private-IP answers pass, mail's + non-Traefik record (`→ 10.0.20.1`) verified working. Runbook: + `docs/runbooks/pfsense-unbound.md`; on-box backup + `config.xml.bak-2026-06-10-pre-me-forward`. +2. **CoreDNS pod carve-out** (TF, `stacks/technitium`): a dedicated + `viktorbarzin.me:53` server block pins forgejo to Traefik's + **ClusterIP** (interpolated from the live Service — pods cannot reach + the ETP=Local LB IP that pfSense now returns) and forwards all other + `.me` names to `8.8.8.8/1.1.1.1`, preserving pods' pre-existing + public-IP behavior. Replaces the old forgejo rewrite in `.:53`. + +node5/6 were also re-pointed from link-DNS=Technitium to +`10.0.20.1 94.140.14.14` (netplan + `qm set --nameserver` on PVE VMs +205/206) for fleet parity, and their `global-dns.conf` was deleted. **Renumber hazard: resolved.** A future Traefik LB renumber propagates via the apex A record automatically (drift probe alerts if it doesn't); -only the vestigial hosts.toml literal goes stale. **New trade-off:** -`*.viktorbarzin.me` resolution from nodes now depends on in-cluster -Technitium (3 replicas); in a full cluster outage these names SERVFAIL — -acceptable, the services are down anyway, and bootstrap images pull via -the IP-addressed `10.0.20.10` mirrors. +only the vestigial hosts.toml literal goes stale. **Trade-offs:** +`viktorbarzin.me` resolution via pfSense depends on in-cluster Technitium +(3 replicas) — SERVFAIL during a full cluster outage (services down +anyway; bootstrap images pull via the IP-addressed `10.0.20.10` mirrors). +Nodes keep `94.140.14.14` as secondary DNS: a resolved failover during a +pfSense blip briefly re-exposes public answers — rare, self-healing, +accepted. -## Verification +## Verification (final architecture) -- `getent hosts forgejo.viktorbarzin.me` → `10.0.20.203` on all 7 nodes - **with no `/etc/hosts` entry** (pure DNS via the routing domain); - `resolvectl status` shows `~viktorbarzin.me` routed to `10.0.20.201`; - general resolution (`getent hosts google.com`) intact on every node; - `crictl pull` of the tuya_bridge image succeeds via the DNS path. +- All 7 nodes stock (no pins, no drop-ins); `getent hosts + forgejo.viktorbarzin.me` → `10.0.20.203` via pfSense → Technitium; + general resolution intact; `crictl pull` succeeds end-to-end. +- pfSense: forgejo/immich/vault → apex CNAME → `.203`; mail → + `10.0.20.1` (`:993` verified); `google.com` public; `.lan` auth-zone + unaffected. +- Pods: forgejo → `10.111.111.95` (Traefik ClusterIP), + immich → `176.12.22.76` (public, status quo) — verified in-pod after + CoreDNS reload. - tuya-bridge pod Running; `/health` `ok=true`; 27/27 devices `success=true`; 7/7 `*_tuya_cloud_up` gauges = 1; no tuya-related alerts. diff --git a/docs/runbooks/pfsense-unbound.md b/docs/runbooks/pfsense-unbound.md index 19e0e5dc..bfa2a19e 100644 --- a/docs/runbooks/pfsense-unbound.md +++ b/docs/runbooks/pfsense-unbound.md @@ -93,6 +93,55 @@ Verify via the Technitium API: curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer ``` +## Domain Override: viktorbarzin.me → Technitium (2026-06-10) + +`$config['unbound']['domainoverrides']` carries one entry forwarding the +whole `viktorbarzin.me` zone to Technitium `10.0.20.201` (forward-zone, not +AXFR). Every Unbound client — all VLANs + 192.168.1.x via the WAN listener — +gets Technitium's internal split-horizon answers: ingress hosts CNAME to the +zone apex whose A record auto-tracks the live Traefik LB IP +(`technitium-ingress-dns-sync` + `viktorbarzin-apex-probe` canary). This is +what keeps kubelet forgejo image pulls (and everything else on 10.0.x) off +the broken public NAT-hairpin with zero per-host DNS config — see +`docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`. + +Notes: + +- The domain is NOT DNSSEC-signed (no DS records), so no `domain-insecure` + needed; private-IP answers pass without `private-domain` custom options + (verified empirically — pfSense handles domain overrides correctly). +- **Cluster-outage behavior**: the zone SERVFAILs while Technitium is down + (forward-zone, no local copy). Deliberate — the services are down anyway. + Contrast with `viktorbarzin.lan`, which is AXFR-slaved to survive outages. +- **In-cluster pods must NOT see these answers** (Traefik LB is ETP=Local, + unreachable from pods). CoreDNS has a dedicated `viktorbarzin.me:53` + carve-out (stacks/technitium) — do not remove it while this override exists. +- Added with the standard SSH + PHP pattern (see "host override" memories / + this file's style): + +```php +require_once("config.inc"); require_once("unbound.inc"); +global $config; +$config["unbound"]["domainoverrides"][] = [ + "domain" => "viktorbarzin.me", "ip" => "10.0.20.201", + "descr" => "...", "tls_hostname" => "", +]; +write_config("add viktorbarzin.me domain override -> Technitium"); +services_unbound_configure(); +``` + +Rollback: remove the entry from the array (match on `domain`), then +`write_config()` + `services_unbound_configure()`. Pre-change backup: +`/cf/conf/config.xml.bak-2026-06-10-pre-me-forward` (on-box). + +Verify: + +``` +dig +short @10.0.20.1 forgejo.viktorbarzin.me # apex CNAME + live Traefik IP +dig +short @10.0.20.1 mail.viktorbarzin.me # 10.0.20.1 (non-Traefik record) +dig +short @10.0.20.1 google.com # public, unaffected +``` + ## Operational Checks ```bash diff --git a/modules/create-template-vm/cloud_init.yaml b/modules/create-template-vm/cloud_init.yaml index fef4d391..c70057e8 100644 --- a/modules/create-template-vm/cloud_init.yaml +++ b/modules/create-template-vm/cloud_init.yaml @@ -90,37 +90,23 @@ runcmd: - sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf - systemctl restart systemd-journald %{if is_k8s_template} - # systemd-resolved split DNS, two drop-ins (2026-06-10, replaces the - # public-first global DNS that was here before): - # - # viktorbarzin.conf — routing domain ~viktorbarzin.me -> Technitium - # (10.0.20.201). The technitium-ingress-dns-sync CronJob keeps a CNAME - # for every ingress host (incl. forgejo.viktorbarzin.me) chained to the - # zone apex, whose A record auto-tracks the live Traefik LB IP (canary: - # viktorbarzin-apex-probe). Keeps kubelet pulls of forgejo images off - # the flaky public NAT-hairpin with no hardcoded service IPs. (The old - # comment claiming Technitium NXDOMAINs forgejo.viktorbarzin.me is - # obsolete — ingress-dns-sync added it to the split-horizon zone. See - # docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md.) - # - # global-dns.conf — emergency fallback only. Public servers must NOT - # sit in the global DNS= set: they merge with viktorbarzin.conf's set - # and race the ~viktorbarzin.me routing domain, intermittently - # returning the public IP again. General resolution uses the - # link-level DNS from Proxmox's `qm set --nameserver`. - - mkdir -p /etc/systemd/resolved.conf.d - - | - cat > /etc/systemd/resolved.conf.d/viktorbarzin.conf <<'EOF' - [Resolve] - DNS=10.0.20.201 - Domains=~viktorbarzin.me - EOF - - | - cat > /etc/systemd/resolved.conf.d/global-dns.conf <<'EOF' - [Resolve] - FallbackDNS=8.8.8.8 1.1.1.1 - EOF - - systemctl restart systemd-resolved + # Node DNS is intentionally STOCK — no resolved drop-ins, no /etc/hosts + # pins. Link nameservers come from Proxmox `qm set --nameserver + # "10.0.20.1 94.140.14.14"` (pfSense + public secondary; set this when + # cloning a new node VM). Internal split-horizon for *.viktorbarzin.me + # happens at pfSense Unbound: a domain override forwards the zone to + # Technitium (10.0.20.201), whose split-horizon zone CNAMEs every ingress + # host to the apex A record that auto-tracks the live Traefik LB IP — so + # every VLAN client, nodes included, gets internal answers with zero + # per-host config (2026-06-10; runbook: docs/runbooks/pfsense-unbound.md). + # Pods are carved out separately (CoreDNS `viktorbarzin.me:53` block: + # public answers + forgejo pinned to Traefik's ClusterIP — the LB IP is + # ETP=Local and unreachable from pods; stacks/technitium). + # History: a global-dns.conf drop-in (public DNS primary) lived here until + # 2026-06-10. Its rationale ("Technitium NXDOMAINs forgejo.viktorbarzin.me") + # had long been obsolete, and it steered fresh forgejo pulls onto the broken + # public NAT-hairpin (7.5h tuya-bridge outage — see + # docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md). # Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight # Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico, # and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a diff --git a/modules/create-template-vm/k8s-node-containerd-setup.sh b/modules/create-template-vm/k8s-node-containerd-setup.sh index c4a108a7..98a71739 100755 --- a/modules/create-template-vm/k8s-node-containerd-setup.sh +++ b/modules/create-template-vm/k8s-node-containerd-setup.sh @@ -54,9 +54,9 @@ GHCR # Host/SNI and 404s the mirror's bare-IP requests, and the registry's # Bearer auth realm is the absolute https://forgejo.viktorbarzin.me/v2/token # URL fetched outside the mirror). What actually keeps forgejo pulls -# internal is the systemd-resolved routing domain ~viktorbarzin.me -> -# Technitium (viktorbarzin.conf, written by cloud_init.yaml), which -# resolves forgejo to the live Traefik LB via the split-horizon zone. +# internal is the pfSense Unbound domain override forwarding +# viktorbarzin.me -> Technitium, whose split-horizon zone serves the live +# Traefik LB IP (no node-side DNS config at all). # Kept for config uniformity; harmless. See # docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md. mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me diff --git a/scripts/setup-forgejo-containerd-mirror.sh b/scripts/setup-forgejo-containerd-mirror.sh index e89db277..4c543d15 100755 --- a/scripts/setup-forgejo-containerd-mirror.sh +++ b/scripts/setup-forgejo-containerd-mirror.sh @@ -1,23 +1,23 @@ #!/usr/bin/env bash -# One-shot deployment of the forgejo pull path across every k8s node: -# systemd-resolved routing domain ~viktorbarzin.me -> Technitium, plus the -# (vestigial) containerd hosts.toml entry. Cloud-init only fires on VM -# provision, so existing nodes need this manual rollout. +# One-shot deployment of the (vestigial) forgejo containerd hosts.toml entry +# across every k8s node, plus cleanup of legacy node-side DNS customization. +# Cloud-init only fires on VM provision, so existing nodes need this manual +# rollout. # -# The routing domain is what actually makes pulls hairpin-proof: Technitium's -# split-horizon zone resolves forgejo.viktorbarzin.me (CNAME, auto-synced from -# ingresses) to the zone apex whose A record tracks the live Traefik LB IP — -# no hardcoded service IPs on nodes. The hosts.toml mirror alone CANNOT do -# this: Traefik 404s its bare-IP requests (no Host/SNI match) and the registry -# Bearer auth realm is the absolute public URL fetched outside the mirror -# (2026-06-10 tuya-bridge outage; see -# docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md). +# Node DNS is intentionally STOCK: internal split-horizon for +# *.viktorbarzin.me happens at pfSense Unbound (domain override -> +# Technitium), whose split-horizon zone serves the live Traefik LB IP for +# every ingress host — nodes need no resolved drop-ins or /etc/hosts pins. +# The hosts.toml mirror alone CANNOT keep pulls internal: Traefik 404s its +# bare-IP requests (no Host/SNI match) and the registry Bearer auth realm is +# the absolute public URL fetched outside the mirror (2026-06-10 tuya-bridge +# outage; see docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md). # # What it does, per node: # 1. drain (ignore-daemonsets, delete-emptydir-data) -# 2. ssh in: write /etc/systemd/resolved.conf.d/viktorbarzin.conf (routing -# domain), neuter any public global-dns.conf to FallbackDNS-only, drop -# legacy forgejo-internal-pin /etc/hosts lines, restart systemd-resolved, +# 2. ssh in: remove legacy DNS customization (forgejo-internal-pin +# /etc/hosts lines, viktorbarzin.conf / global-dns.conf resolved +# drop-ins), restart systemd-resolved, # write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml # 3. systemctl restart containerd # 4. uncordon @@ -51,26 +51,9 @@ for n in $NODES; do ssh -o StrictHostKeyChecking=accept-new "wizard@$n" sudo bash < /etc/systemd/resolved.conf.d/viktorbarzin.conf <<'CONF' -# Route *.viktorbarzin.me to Technitium (split-horizon zone -> live Traefik LB), -# so kubelet image pulls of forgejo.viktorbarzin.me never traverse the public -# NAT-hairpin. Everything else uses the link DNS. -# Managed: setup-forgejo-containerd-mirror.sh / cloud_init.yaml -[Resolve] -DNS=10.0.20.201 -Domains=~viktorbarzin.me -CONF -# Public servers in the global DNS= set would race the routing domain — -# demote any legacy global-dns.conf to emergency fallback only. -if [ -f /etc/systemd/resolved.conf.d/global-dns.conf ]; then - cat > /etc/systemd/resolved.conf.d/global-dns.conf <<'CONF' -# Emergency fallback only (used when no link DNS is configured at all). -[Resolve] -FallbackDNS=8.8.8.8 1.1.1.1 -CONF -fi sed -i '/forgejo-internal-pin/d' /etc/hosts +rm -f /etc/systemd/resolved.conf.d/viktorbarzin.conf \ + /etc/systemd/resolved.conf.d/global-dns.conf systemctl restart systemd-resolved mkdir -p "$CERTS_DIR" cat > "$CERTS_DIR/hosts.toml" <<'TOML' diff --git a/stacks/technitium/modules/technitium/main.tf b/stacks/technitium/modules/technitium/main.tf index 087cef3b..0e6a8d2b 100644 --- a/stacks/technitium/modules/technitium/main.tf +++ b/stacks/technitium/modules/technitium/main.tf @@ -33,6 +33,18 @@ module "tls_secret" { tls_secret_name = var.tls_secret_name } +# Traefik Service ClusterIP for the CoreDNS viktorbarzin.me block below. +# Pods cannot use the Traefik LB IP (.203, externalTrafficPolicy=Local — only +# nodes with a local Traefik endpoint answer), so in-cluster answers must +# target the ClusterIP. Read from the live Service so a recreate can never +# leave a stale literal (same pattern as the woodpecker-server hostAlias fix). +data "kubernetes_service" "traefik" { + metadata { + name = "traefik" + namespace = "traefik" + } +} + # CoreDNS Corefile - manages cluster DNS resolution # The viktorbarzin.lan block forwards to Technitium via ClusterIP (stable, LB-independent). # A template regex in the viktorbarzin.lan block short-circuits junk queries @@ -60,15 +72,6 @@ resource "kubernetes_config_map" "coredns" { fallthrough in-addr.arpa ip6.arpa ttl 30 } - # Pin forgejo.viktorbarzin.me to the in-cluster Traefik Service so pod - # builds/pulls/pushes resolve to its ClusterIP, not the public IP that - # hairpins through the WAN gateway and intermittently times out buildkit - # pushes (woodpecker build pods don't use the node containerd mirror that - # fixes kubelet pulls). Service-name target auto-tracks the ClusterIP (no - # rot); Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. The - # woodpecker-server hostAlias (main.tf) becomes belt-and-suspenders. - # (beads code-yh33 — in-cluster *.viktorbarzin.me hairpin) - rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local prometheus :9153 forward . 10.0.20.1 8.8.8.8 1.1.1.1 { policy sequential @@ -84,6 +87,33 @@ resource "kubernetes_config_map" "coredns" { reload loadbalance } + # Dedicated zone for *.viktorbarzin.me as seen by PODS. Needed because + # pfSense Unbound (first upstream of .:53) forwards this zone to + # Technitium since 2026-06-10, whose answers point at the Traefik LB + # (.203) — unreachable from pods (externalTrafficPolicy=Local). Pods + # therefore keep PUBLIC answers via 8.8.8.8/1.1.1.1 (their pre-existing + # behavior), except forgejo.viktorbarzin.me which is pinned to Traefik's + # ClusterIP (hosts plugin; the kubernetes plugin isn't in this block so + # a Service-name rewrite cannot resolve here). Replaces the old rewrite + # in .:53 (beads code-yh33 — in-cluster *.viktorbarzin.me hairpin). + viktorbarzin.me:53 { + errors + hosts { + ${data.kubernetes_service.traefik.spec.0.cluster_ip} forgejo.viktorbarzin.me + fallthrough + } + forward . 8.8.8.8 1.1.1.1 { + policy sequential + health_check 5s + max_fails 2 + } + cache { + success 10000 300 6 + denial 10000 300 60 + serve_stale 86400s + } + reload + } viktorbarzin.lan:53 { #log errors