From eb6ceac5f549071e9d009b22d627494e1901555c Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 15:43:49 +0000 Subject: [PATCH] =?UTF-8?q?[dns]=20static-client=20DNS=20=E2=80=94=20Proxm?= =?UTF-8?q?ox=20host,=20registry=20VM=20dual-resolver=20setup=20(WS=20F)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8 --- docs/runbooks/proxmox-host.md | 103 ++++++++++++++++++++++++ docs/runbooks/registry-vm.md | 147 ++++++++++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 docs/runbooks/proxmox-host.md create mode 100644 docs/runbooks/registry-vm.md diff --git a/docs/runbooks/proxmox-host.md b/docs/runbooks/proxmox-host.md new file mode 100644 index 00000000..449ab772 --- /dev/null +++ b/docs/runbooks/proxmox-host.md @@ -0,0 +1,103 @@ +# Runbook: Proxmox host (pve, 192.168.1.127) + +Last updated: 2026-04-19 + +The Proxmox host is a baremetal hypervisor on the storage LAN +(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every +Kubernetes node VM and the NFS exports that back PVCs. It does **not** +receive DHCP — its network config is static in +`/etc/network/interfaces` (ifupdown). Because of that, DNS must be +configured manually and stays out of the scope of Kea/DHCP-DDNS. + +## DNS configuration + +The host uses a plain `/etc/resolv.conf` with two nameservers. No +`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing +manages `/etc/resolv.conf`; it is a regular file owned by root. + +### Why plain `/etc/resolv.conf` and not systemd-resolved + +1. Installing `systemd-resolved` on an active Proxmox node during + business hours is the kind of change that risks breaking the NFS + server or VM networking. PVE's Debian base does not ship + `systemd-resolved` by default. +2. The ifupdown `/etc/network/interfaces` file does not manage + `/etc/resolv.conf` here — ifupdown's resolvconf integration is + only active if the `resolvconf` package is installed, which it is + not (`dpkg -l resolvconf` returns `un`). +3. A plain file is the simplest mental model and avoids a second + layer of "which tool is running now" confusion during an incident. + +If you ever want to migrate to `systemd-resolved`, install the +package, enable the service, symlink `/etc/resolv.conf` to +`/run/systemd/resolve/stub-resolv.conf`, and drop the config in +`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this +during a maintenance window, not reactively. + +### Current state + +``` +# /etc/resolv.conf +search viktorbarzin.lan +nameserver 192.168.1.2 +nameserver 94.140.14.14 +options timeout:2 attempts:2 +``` + +| Field | Value | Purpose | +|---|---|---| +| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` | +| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable | +| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone | +| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency | + +### Verification commands + +```sh +ssh root@192.168.1.127 ' + cat /etc/resolv.conf # should show the two nameservers + dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4) + dig +short github.com # expect an A record +' +``` + +Simulated failover — force the primary unreachable and verify the +fallback answers: + +```sh +ssh root@192.168.1.127 ' + ip route add blackhole 192.168.1.2 + dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned + ip route del blackhole 192.168.1.2 # cleanup +' +``` + +Expected behaviour: the first `dig` prints a warning about the UDP +setup failing for 192.168.1.2 and then prints the GitHub A record +(answered by 94.140.14.14). + +## Rollback + +A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`, +and `/etc/network/interfaces.d/` lives at +`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the +host. To roll back: + +```sh +ssh root@192.168.1.127 ' + # pick the backup you want (there may be multiple if this runbook has been applied more than once) + BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1) + tar -xzf "$BACKUP" -C / + cat /etc/resolv.conf +' +``` + +No service restart is needed — glibc re-reads `/etc/resolv.conf` per +lookup. + +## Related docs + +- `docs/architecture/dns.md` — where each resolver IP lives and which + subnet it serves. +- `docs/runbooks/nfs-prerequisites.md` — other operations on this + host; read before adding new NFS exports. diff --git a/docs/runbooks/registry-vm.md b/docs/runbooks/registry-vm.md new file mode 100644 index 00000000..4c6fcd16 --- /dev/null +++ b/docs/runbooks/registry-vm.md @@ -0,0 +1,147 @@ +# Runbook: Registry VM (docker-registry, 10.0.20.10) + +Last updated: 2026-04-19 + +The registry VM hosts `registry.viktorbarzin.me` (private Docker +registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04 +VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan +config (no DHCP). Because it sits on a subnet that only has pfSense +as its gateway, its DNS must be statically configured. + +## DNS configuration + +Ubuntu ships `systemd-resolved` and uses netplan to declare per-link +`nameservers`. Netplan writes systemd-networkd or NetworkManager +configs that resolved reads at runtime. There is **no automatic +merging** of netplan DNS with the `[Resolve]` section of +`/etc/systemd/resolved.conf` — per-link settings override the global +ones. So both layers must be in sync: + +| Layer | File | Role | +|---|---|---| +| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` | +| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` | + +### Current state + +`/etc/systemd/resolved.conf.d/10-internal-dns.conf`: + +```ini +[Resolve] +DNS=10.0.20.1 +FallbackDNS=94.140.14.14 +Domains=viktorbarzin.lan +``` + +`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified): + +```yaml +nameservers: + addresses: + - 10.0.20.1 + - 94.140.14.14 + search: + - viktorbarzin.lan +``` + +`resolvectl status` output after the change: + +``` +Global + resolv.conf mode: stub + Current DNS Server: 10.0.20.1 + DNS Servers: 10.0.20.1 + Fallback DNS Servers: 94.140.14.14 + DNS Domain: viktorbarzin.lan + +Link 2 (eth0) + Current Scopes: DNS + Current DNS Server: 10.0.20.1 + DNS Servers: 10.0.20.1 94.140.14.14 + DNS Domain: viktorbarzin.lan +``` + +| Field | Value | Purpose | +|---|---|---| +| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` | +| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) | +| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone | + +### Why this matters for the registry + +Container builds on this VM reference `.lan` hostnames (Technitium, +NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the +hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant: + +1. Internal hostname lookups silently failed (slow timeout) — the + VM could not resolve `idrac.viktorbarzin.lan` or any internal + helper. +2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS + entirely. + +With the new config the VM can resolve both zones and keeps working +if the primary DNS server is unreachable. + +## Apply / re-apply + +```sh +ssh root@10.0.20.10 ' + netplan generate + netplan apply + systemctl restart systemd-resolved + resolvectl status | head -20 +' +``` + +`netplan apply` is not disruptive when only `nameservers` change — it +does not bounce the link. + +## Verification + +```sh +ssh root@10.0.20.10 ' + dig +short idrac.viktorbarzin.lan # 192.168.1.4 + dig +short github.com # GitHub A record + dig +short registry.viktorbarzin.me # 10.0.20.10 + external A +' +``` + +Fallback test — blackhole the primary and confirm external lookups +still succeed through 94.140.14.14: + +```sh +ssh root@10.0.20.10 ' + ip route add blackhole 10.0.20.1 + dig +short +time=5 +tries=2 github.com # should still answer + ip route del blackhole 10.0.20.1 +' +``` + +Internal lookups do fail during the blackhole (the fallback is a +public resolver and does not know about the internal zone), which is +expected — the fallback buys availability for external pulls, not +internal hostnames. + +## Rollback + +A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`, +and `/etc/netplan/` lives at +`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the +VM. To roll back: + +```sh +ssh root@10.0.20.10 ' + BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1) + tar -xzf "$BACKUP" -C / + rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf + netplan apply + systemctl restart systemd-resolved + resolvectl status | head -10 +' +``` + +## Related docs + +- `docs/architecture/dns.md` — resolver IP assignments per subnet. +- `.claude/CLAUDE.md` (at repo root) — notes on the private registry + and `containerd` `hosts.toml` redirects.