[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F)
Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now
has a primary internal resolver + external fallback (AdGuard) so DNS
keeps working if the primary resolver IP is unreachable.
New config:
- Proxmox host (192.168.1.127): plain /etc/resolv.conf with
nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard).
Previously: single nameserver 192.168.1.1 — could not resolve
internal .lan names at all. Documented in
docs/runbooks/proxmox-host.md.
- Registry VM (10.0.20.10): systemd-resolved drop-in at
/etc/systemd/resolved.conf.d/10-internal-dns.conf
(DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan)
plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml.
Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan
hostnames would fail to resolve. Documented in
docs/runbooks/registry-vm.md.
- TrueNAS (10.0.10.15): host unreachable during this session
("No route to host" on 10.0.10.0/24). Deferred best-effort per
WS F instructions; noted on the beads task.
Both hosts have pre-change backups at /root/dns-backups/ for
one-command rollback. Fallback behaviour was validated by routing
each primary to a blackhole and confirming dig answered from the
fallback.
Both runbooks include the verified resolvectl / resolv.conf state,
the fallback-test procedure, and the rollback steps.
Closes: code-dw8
This commit is contained in:
parent
3b54983a9f
commit
eb6ceac5f5
2 changed files with 250 additions and 0 deletions
103
docs/runbooks/proxmox-host.md
Normal file
103
docs/runbooks/proxmox-host.md
Normal file
|
|
@ -0,0 +1,103 @@
|
||||||
|
# Runbook: Proxmox host (pve, 192.168.1.127)
|
||||||
|
|
||||||
|
Last updated: 2026-04-19
|
||||||
|
|
||||||
|
The Proxmox host is a baremetal hypervisor on the storage LAN
|
||||||
|
(192.168.1.0/24) with a single IP `192.168.1.127`. It hosts every
|
||||||
|
Kubernetes node VM and the NFS exports that back PVCs. It does **not**
|
||||||
|
receive DHCP — its network config is static in
|
||||||
|
`/etc/network/interfaces` (ifupdown). Because of that, DNS must be
|
||||||
|
configured manually and stays out of the scope of Kea/DHCP-DDNS.
|
||||||
|
|
||||||
|
## DNS configuration
|
||||||
|
|
||||||
|
The host uses a plain `/etc/resolv.conf` with two nameservers. No
|
||||||
|
`systemd-resolved`, no `resolvconf`, no NetworkManager — nothing
|
||||||
|
manages `/etc/resolv.conf`; it is a regular file owned by root.
|
||||||
|
|
||||||
|
### Why plain `/etc/resolv.conf` and not systemd-resolved
|
||||||
|
|
||||||
|
1. Installing `systemd-resolved` on an active Proxmox node during
|
||||||
|
business hours is the kind of change that risks breaking the NFS
|
||||||
|
server or VM networking. PVE's Debian base does not ship
|
||||||
|
`systemd-resolved` by default.
|
||||||
|
2. The ifupdown `/etc/network/interfaces` file does not manage
|
||||||
|
`/etc/resolv.conf` here — ifupdown's resolvconf integration is
|
||||||
|
only active if the `resolvconf` package is installed, which it is
|
||||||
|
not (`dpkg -l resolvconf` returns `un`).
|
||||||
|
3. A plain file is the simplest mental model and avoids a second
|
||||||
|
layer of "which tool is running now" confusion during an incident.
|
||||||
|
|
||||||
|
If you ever want to migrate to `systemd-resolved`, install the
|
||||||
|
package, enable the service, symlink `/etc/resolv.conf` to
|
||||||
|
`/run/systemd/resolve/stub-resolv.conf`, and drop the config in
|
||||||
|
`/etc/systemd/resolved.conf.d/10-internal-dns.conf` — but do this
|
||||||
|
during a maintenance window, not reactively.
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
```
|
||||||
|
# /etc/resolv.conf
|
||||||
|
search viktorbarzin.lan
|
||||||
|
nameserver 192.168.1.2
|
||||||
|
nameserver 94.140.14.14
|
||||||
|
options timeout:2 attempts:2
|
||||||
|
```
|
||||||
|
|
||||||
|
| Field | Value | Purpose |
|
||||||
|
|---|---|---|
|
||||||
|
| Primary | `192.168.1.2` | pfSense LAN interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||||
|
| Fallback | `94.140.14.14` | AdGuard public DNS — recursive only, used if pfSense LAN IP unreachable |
|
||||||
|
| `search` | `viktorbarzin.lan` | Unqualified names (`technitium`, `idrac`, etc.) resolve against the internal zone |
|
||||||
|
| `timeout:2 attempts:2` | — | Cap glibc resolver at 2s per server, 2 tries — reasonable fallback latency |
|
||||||
|
|
||||||
|
### Verification commands
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@192.168.1.127 '
|
||||||
|
cat /etc/resolv.conf # should show the two nameservers
|
||||||
|
dig +short idrac.viktorbarzin.lan # expect an A record (192.168.1.4)
|
||||||
|
dig +short github.com # expect an A record
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
Simulated failover — force the primary unreachable and verify the
|
||||||
|
fallback answers:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@192.168.1.127 '
|
||||||
|
ip route add blackhole 192.168.1.2
|
||||||
|
dig +short +time=3 github.com # glibc times out on primary, tries 94.140.14.14 → A record returned
|
||||||
|
ip route del blackhole 192.168.1.2 # cleanup
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected behaviour: the first `dig` prints a warning about the UDP
|
||||||
|
setup failing for 192.168.1.2 and then prints the GitHub A record
|
||||||
|
(answered by 94.140.14.14).
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
A pre-change backup of `/etc/resolv.conf`, `/etc/network/interfaces`,
|
||||||
|
and `/etc/network/interfaces.d/` lives at
|
||||||
|
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||||
|
host. To roll back:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@192.168.1.127 '
|
||||||
|
# pick the backup you want (there may be multiple if this runbook has been applied more than once)
|
||||||
|
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||||
|
tar -xzf "$BACKUP" -C /
|
||||||
|
cat /etc/resolv.conf
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
No service restart is needed — glibc re-reads `/etc/resolv.conf` per
|
||||||
|
lookup.
|
||||||
|
|
||||||
|
## Related docs
|
||||||
|
|
||||||
|
- `docs/architecture/dns.md` — where each resolver IP lives and which
|
||||||
|
subnet it serves.
|
||||||
|
- `docs/runbooks/nfs-prerequisites.md` — other operations on this
|
||||||
|
host; read before adding new NFS exports.
|
||||||
147
docs/runbooks/registry-vm.md
Normal file
147
docs/runbooks/registry-vm.md
Normal file
|
|
@ -0,0 +1,147 @@
|
||||||
|
# Runbook: Registry VM (docker-registry, 10.0.20.10)
|
||||||
|
|
||||||
|
Last updated: 2026-04-19
|
||||||
|
|
||||||
|
The registry VM hosts `registry.viktorbarzin.me` (private Docker
|
||||||
|
registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04
|
||||||
|
VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
|
||||||
|
config (no DHCP). Because it sits on a subnet that only has pfSense
|
||||||
|
as its gateway, its DNS must be statically configured.
|
||||||
|
|
||||||
|
## DNS configuration
|
||||||
|
|
||||||
|
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
|
||||||
|
`nameservers`. Netplan writes systemd-networkd or NetworkManager
|
||||||
|
configs that resolved reads at runtime. There is **no automatic
|
||||||
|
merging** of netplan DNS with the `[Resolve]` section of
|
||||||
|
`/etc/systemd/resolved.conf` — per-link settings override the global
|
||||||
|
ones. So both layers must be in sync:
|
||||||
|
|
||||||
|
| Layer | File | Role |
|
||||||
|
|---|---|---|
|
||||||
|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
|
||||||
|
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Resolve]
|
||||||
|
DNS=10.0.20.1
|
||||||
|
FallbackDNS=94.140.14.14
|
||||||
|
Domains=viktorbarzin.lan
|
||||||
|
```
|
||||||
|
|
||||||
|
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
nameservers:
|
||||||
|
addresses:
|
||||||
|
- 10.0.20.1
|
||||||
|
- 94.140.14.14
|
||||||
|
search:
|
||||||
|
- viktorbarzin.lan
|
||||||
|
```
|
||||||
|
|
||||||
|
`resolvectl status` output after the change:
|
||||||
|
|
||||||
|
```
|
||||||
|
Global
|
||||||
|
resolv.conf mode: stub
|
||||||
|
Current DNS Server: 10.0.20.1
|
||||||
|
DNS Servers: 10.0.20.1
|
||||||
|
Fallback DNS Servers: 94.140.14.14
|
||||||
|
DNS Domain: viktorbarzin.lan
|
||||||
|
|
||||||
|
Link 2 (eth0)
|
||||||
|
Current Scopes: DNS
|
||||||
|
Current DNS Server: 10.0.20.1
|
||||||
|
DNS Servers: 10.0.20.1 94.140.14.14
|
||||||
|
DNS Domain: viktorbarzin.lan
|
||||||
|
```
|
||||||
|
|
||||||
|
| Field | Value | Purpose |
|
||||||
|
|---|---|---|
|
||||||
|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
||||||
|
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
|
||||||
|
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
|
||||||
|
|
||||||
|
### Why this matters for the registry
|
||||||
|
|
||||||
|
Container builds on this VM reference `.lan` hostnames (Technitium,
|
||||||
|
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
|
||||||
|
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
|
||||||
|
|
||||||
|
1. Internal hostname lookups silently failed (slow timeout) — the
|
||||||
|
VM could not resolve `idrac.viktorbarzin.lan` or any internal
|
||||||
|
helper.
|
||||||
|
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
|
||||||
|
entirely.
|
||||||
|
|
||||||
|
With the new config the VM can resolve both zones and keeps working
|
||||||
|
if the primary DNS server is unreachable.
|
||||||
|
|
||||||
|
## Apply / re-apply
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@10.0.20.10 '
|
||||||
|
netplan generate
|
||||||
|
netplan apply
|
||||||
|
systemctl restart systemd-resolved
|
||||||
|
resolvectl status | head -20
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
`netplan apply` is not disruptive when only `nameservers` change — it
|
||||||
|
does not bounce the link.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@10.0.20.10 '
|
||||||
|
dig +short idrac.viktorbarzin.lan # 192.168.1.4
|
||||||
|
dig +short github.com # GitHub A record
|
||||||
|
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
Fallback test — blackhole the primary and confirm external lookups
|
||||||
|
still succeed through 94.140.14.14:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@10.0.20.10 '
|
||||||
|
ip route add blackhole 10.0.20.1
|
||||||
|
dig +short +time=5 +tries=2 github.com # should still answer
|
||||||
|
ip route del blackhole 10.0.20.1
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
Internal lookups do fail during the blackhole (the fallback is a
|
||||||
|
public resolver and does not know about the internal zone), which is
|
||||||
|
expected — the fallback buys availability for external pulls, not
|
||||||
|
internal hostnames.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
|
||||||
|
and `/etc/netplan/` lives at
|
||||||
|
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
||||||
|
VM. To roll back:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ssh root@10.0.20.10 '
|
||||||
|
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
||||||
|
tar -xzf "$BACKUP" -C /
|
||||||
|
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
|
||||||
|
netplan apply
|
||||||
|
systemctl restart systemd-resolved
|
||||||
|
resolvectl status | head -10
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related docs
|
||||||
|
|
||||||
|
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
|
||||||
|
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
|
||||||
|
and `containerd` `hosts.toml` redirects.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue