From f6685a23a9490fa083ebd3699410299aca1aff46 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 16:12:23 +0000 Subject: [PATCH] [dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Workstream E of the DNS hardening push. Two independent pfSense-side changes to eliminate single-point DNS failures and the unauthenticated RFC 2136 update vector. Part 1 — Multi-IP DHCP option 6 - Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24 got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark. - After: - 10.0.10/24 -> [10.0.10.1, 94.140.14.14] - 10.0.20/24 -> [10.0.20.1, 94.140.14.14] - 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2, 94.140.14.14] so the end state is consistent across all three subnets. - Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's services_kea4_configure() implodes the array into "data: a, b" on the "domain-name-servers" option-data entry (services.inc L1214). - Verified: - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" + "nameserver 94.140.14.14" after networkd renew. - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved restart. - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`; dig @10.0.20.1 google.com -> "no servers could be reached"; dig @94.140.14.14 google.com -> 216.58.204.110; system resolver (getent hosts) succeeds via the fallback IP. Blackhole route removed. Part 2 — TSIG-signed Kea DHCP-DDNS - Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated update vector was latent (DDNS wiring in Kea DHCP4 is actually off today — "DDNS: disabled" in dhcpd.log) but would activate as soon as anyone turned on ddnsupdate on LAN/OPT1. - Generated HMAC-SHA256 secret, base64-encoded 32 random bytes. - Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27). - Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium instances via /api/settings/set (tsigKeys[]). - Updated kea-dhcp-ddns.conf on pfSense with tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …} and key-name: kea-ddns on each forward-ddns / reverse-ddns domain. Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - Configured viktorbarzin.lan + 10.0.10.in-addr.arpa + 20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary: - update = UseSpecifiedNetworkACL - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2] - updateSecurityPolicies = [{tsigKeyName: kea-ddns, domain: "*.", allowedTypes: [ANY]}] Technitium requires BOTH a source-IP match AND a valid TSIG signature. - Verified TSIG end-to-end: - Signed A-record update from pfSense -> "successfully processed", dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo: hmac-sha256; TSIG Error: NoError; RCODE: NoError"). - Signed PTR update same zone pattern -> dig -x returns tsig-test FQDN. - Unsigned update from pfSense IP (in ACL) -> "update failed: REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic Updates Security Policy"). - Test records cleaned up via signed nsupdate. Safety - pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip (145898 bytes, pre-change snapshot — keep 30d). - DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on pfSense; not committed to git. Docs - architecture/dns.md: zone dynamic-updates section records the TSIG policy; Incident History gets a WS E entry. - architecture/networking.md: DHCP Coverage table now shows the DNS option 6 values per subnet; pfSense block notes the TSIG-signed DDNS and config backup path. - runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers key rotation, emergency bypass, and enforcement-verification. Closes: code-o6j Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/architecture/dns.md | 9 ++-- docs/architecture/networking.md | 27 ++++++----- docs/runbooks/pfsense-unbound.md | 79 ++++++++++++++++++++++++++++++++ 3 files changed, 100 insertions(+), 15 deletions(-) diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index f2a7688a..97e609f3 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -1,6 +1,6 @@ # DNS Architecture -Last updated: 2026-04-19 (WS C — NodeLocal DNSCache deployed; WS D — pfSense Unbound replaces dnsmasq) +Last updated: 2026-04-19 (WS C — NodeLocal DNSCache deployed; WS D — pfSense Unbound replaces dnsmasq; WS E — Kea multi-IP DHCP option 6 + TSIG-signed DDNS) ## Overview @@ -212,7 +212,7 @@ All three pods share the `dns-server=true` label, so the DNS LoadBalancer (10.0. | `0.168.192.in-addr.arpa` | Primary | PTR | Reverse DNS for Valchedrym site | | `emrsn.org` | Primary (stub) | — | Returns NXDOMAIN locally (avoids 27K+ daily corporate query floods) | -**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) for Kea DDNS RFC 2136 updates. +**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) **AND require a valid TSIG signature** on `viktorbarzin.lan`, `10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, `1.168.192.in-addr.arpa`. Policy: `updateSecurityPolicies = [{tsigKeyName: "kea-ddns", domain: "*.", allowedTypes: ["ANY"]}]`. Unsigned updates from the allowlisted pfSense source IPs are refused ("Dynamic Updates Security Policy"). TSIG key `kea-ddns` (HMAC-SHA256) present on primary/secondary/tertiary; secret in Vault `secret/viktor/kea_ddns_tsig_secret`. Applied 2026-04-19 (WS E, bd `code-o6j`). ### Resolver Settings @@ -375,8 +375,8 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`) Devices get automatic DNS registration without manual intervention. See [networking.md § IPAM & DNS Auto-Registration](networking.md#ipam--dns-auto-registration) for the full data flow diagram. Summary: -1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets) -2. **Kea DDNS** sends RFC 2136 dynamic update to Technitium (A + PTR records) — immediate +1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets). DHCP option 6 (DNS servers) is pushed with two IPs per internal subnet: internal resolver + AdGuard public fallback (`94.140.14.14`) — clients survive an internal DNS outage. +2. **Kea DDNS** sends **TSIG-signed** RFC 2136 dynamic update to Technitium (A + PTR records) — immediate. Key `kea-ddns` (HMAC-SHA256); Technitium enforces both source-IP ACL and TSIG signature on `viktorbarzin.lan` + reverse zones. 3. **phpipam-pfsense-import** CronJob (5min) pulls Kea leases + ARP table into phpIPAM 4. **phpipam-dns-sync** CronJob (15min) pushes named phpIPAM hosts → Technitium A + PTR, pulls Technitium PTR → phpIPAM hostnames @@ -502,6 +502,7 @@ For external `.viktorbarzin.me` records: - **2026-04-14 (SEV1)**: NFS `fsid=0` caused Technitium primary data loss on restart. Fixed by migrating all 3 instances to `proxmox-lvm-encrypted`, adding zone-sync CronJob (30min AXFR). See [post-mortem](../post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md). - **2026-04-19 (hardening, not outage)**: Workstream D — pfSense Unbound replaces dnsmasq as the pfSense DNS service. Unbound AXFR-slaves `viktorbarzin.lan` from Technitium so LAN-side resolution survives a full K8s outage. WAN NAT rdr `192.168.1.2:53 → 10.0.20.201` removed (Unbound listens on WAN directly). DoT upstream via Cloudflare. See `docs/runbooks/pfsense-unbound.md` and bd `code-k0d`. +- **2026-04-19 (hardening, not outage)**: Workstream E — Kea DHCP now pushes TWO DNS IPs (internal + AdGuard public fallback `94.140.14.14`) via option 6 to the internal subnets (10.0.10/24, 10.0.20/24); 192.168.1/24 was already dual-IP (served by TP-Link). Kea DHCP-DDNS now TSIG-signs its RFC 2136 updates (key `kea-ddns`, HMAC-SHA256) and the Technitium zones require both source-IP ACL AND TSIG signature. See `docs/runbooks/pfsense-unbound.md` § "Kea DHCP-DDNS TSIG" and bd `code-o6j`. ## Related diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index d0779a17..d9723578 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -1,6 +1,6 @@ # Networking Architecture -Last updated: 2026-04-12 +Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed) ## Overview @@ -144,14 +144,14 @@ flowchart LR ### DHCP Coverage -| Subnet | DHCP Server | Reservations | DDNS | Notes | -|--------|------------|--------------|------|-------| -| 10.0.10.0/24 (Mgmt) | Kea on pfSense | 4 (devvm, truenas, pxe, ha) | Yes | VMs with static MACs | -| 10.0.20.0/24 (K8s) | Kea on pfSense | 7 (master, nodes 1-5, registry) | Yes | K8s cluster nodes | -| 192.168.1.0/24 (LAN) | Kea on pfSense | 42 (all home devices) | Yes | TP-Link is dumb AP only | -| 10.3.2.0/24 (VPN) | Static | — | No | WireGuard peers | -| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | No | Remote site | -| 192.168.8.0/24 (London) | GL-iNet | — | No | Remote site | +| Subnet | DHCP Server | DNS option 6 | Reservations | DDNS | Notes | +|--------|------------|--------------|--------------|------|-------| +| 10.0.10.0/24 (Mgmt) | Kea on pfSense | `10.0.10.1, 94.140.14.14` | 4 (devvm, truenas, pxe, ha) | Yes (TSIG) | VMs with static MACs | +| 10.0.20.0/24 (K8s) | Kea on pfSense | `10.0.20.1, 94.140.14.14` | 7 (master, nodes 1-5, registry) | Yes (TSIG) | K8s cluster nodes | +| 192.168.1.0/24 (LAN) | **TP-Link AP** | `192.168.1.2, 94.140.14.14` | 42 (all home devices) | Yes | pfSense Kea WAN is disabled | +| 10.3.2.0/24 (VPN) | Static | — | — | No | WireGuard peers | +| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | — | No | Remote site | +| 192.168.8.0/24 (London) | GL-iNet | — | — | No | Remote site | ## How It Works @@ -314,9 +314,14 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **pfSense**: - Config: Not Terraform-managed (pfSense web UI / config.xml) -- DHCP: Kea DHCP4 on all 3 subnets (VLAN 10, VLAN 20, WAN/LAN 192.168.1.0/24) +- DHCP: Kea DHCP4 on the two internal VLANs (VLAN 10 = 10.0.10.0/24, VLAN 20 = 10.0.20.0/24). WAN/192.168.1.0/24 is served by the TP-Link dumb AP — pfSense's Kea WAN subnet is disabled. +- **DNS option 6** (per-subnet, WS E 2026-04-19): + - 10.0.10.0/24 → `10.0.10.1, 94.140.14.14` (internal Unbound + AdGuard Home public fallback) + - 10.0.20.0/24 → `10.0.20.1, 94.140.14.14` + - 192.168.1.0/24 → `192.168.1.2, 94.140.14.14` (served by TP-Link, unchanged by WS E) + - Rationale: clients survive an internal resolver outage by falling through to AdGuard (`94.140.14.14`) — confirmed via null-route drill on 2026-04-19. - 42 MAC→IP reservations for 192.168.1.0/24 (all known home devices) -- DHCP DDNS: Kea DHCP-DDNS sends RFC 2136 updates to Technitium on every lease grant (forward A + reverse PTR) +- DHCP DDNS: Kea DHCP-DDNS sends **TSIG-signed** RFC 2136 updates to Technitium (key `kea-ddns`, HMAC-SHA256; secret in Vault `secret/viktor/kea_ddns_tsig_secret`). Zone `viktorbarzin.lan` + reverse zones require both a pfSense-source IP AND a valid TSIG signature. Config: `/usr/local/etc/kea/kea-dhcp-ddns.conf` (hand-managed on pfSense; pre-WS-E backup at `kea-dhcp-ddns.conf.2026-04-19-pre-tsig`). - Firewall rules: Allow K8s egress, block inter-VLAN by default **Technitium**: diff --git a/docs/runbooks/pfsense-unbound.md b/docs/runbooks/pfsense-unbound.md index 25d51a3b..19e0e5dc 100644 --- a/docs/runbooks/pfsense-unbound.md +++ b/docs/runbooks/pfsense-unbound.md @@ -195,6 +195,85 @@ for the exact shape — tracker `1775670025`). `forward-addr: 1.1.1.1@853` (no `#`) which Cloudflare rejects with certificate hostname mismatch. +## Kea DHCP-DDNS TSIG (WS E, 2026-04-19) + +Kea DHCP-DDNS on pfSense signs its RFC 2136 dynamic updates with an +HMAC-SHA256 TSIG key (`kea-ddns`). Technitium's `viktorbarzin.lan` zone +and reverse zones (`10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, +`1.168.192.in-addr.arpa`) require both a pfSense-source IP (10.0.20.1 / +10.0.10.1 / 192.168.1.2) AND a valid TSIG signature. + +### Config locations + +| Side | File | Notes | +|------|------|-------| +| pfSense | `/usr/local/etc/kea/kea-dhcp-ddns.conf` | Hand-managed. Pre-WS-E backup: `.2026-04-19-pre-tsig`. Daemon: `kea-dhcp-ddns` (`pkill -x kea-dhcp-ddns && /usr/local/sbin/kea-dhcp-ddns -c /usr/local/etc/kea/kea-dhcp-ddns.conf -d &`) | +| Technitium | Zone options API: `POST /api/zones/options/set?zone=&updateSecurityPolicies=kea-ddns\|*.\|ANY&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&update=UseSpecifiedNetworkACL` | Set on primary; replicates to secondary/tertiary via AXFR | +| Technitium settings | TSIG keys array: `POST /api/settings/set` with `tsigKeys: [{keyName: "kea-ddns", sharedSecret: , algorithmName: "hmac-sha256"}]` | Must be set on all 3 Technitium instances (primary, secondary, tertiary) | +| Vault | `secret/viktor/kea_ddns_tsig_secret` | Authoritative copy of the base64 secret | + +### Rotating the TSIG key + +1. Generate a new base64 32-byte secret: `openssl rand -base64 32` (any base64-encoded blob of reasonable length works; HMAC-SHA256 truncates/pads internally). +2. Write it to Vault: `vault kv patch secret/viktor kea_ddns_tsig_secret=`. +3. Add the new key under a **new name** (e.g., `kea-ddns-v2`) via the Technitium settings API on all 3 instances. Do NOT overwrite `kea-ddns` while Kea still uses it — you'd orphan in-flight updates. +4. Update `/usr/local/etc/kea/kea-dhcp-ddns.conf` on pfSense to reference both keys in `tsig-keys`, set `key-name: kea-ddns-v2` on each `forward-ddns` / `reverse-ddns` domain, restart `kea-dhcp-ddns`. +5. Update each affected zone's `updateSecurityPolicies` to use the new key name. +6. After a lease-renewal cycle (default Kea lease = 7200s / 2h), verify with `kubectl -n technitium exec -- grep "TSIG KeyName: kea-ddns-v2" /etc/dns/logs/.log`. +7. Remove the old `kea-ddns` key from Technitium settings + Kea config. + +### Emergency TSIG bypass (if rotation breaks DDNS) + +If DDNS updates are failing and you cannot quickly fix the key, temporarily +downgrade the zone policy to IP-ACL only (pfSense source IPs) without +TSIG: + +```bash +kubectl -n technitium port-forward pod/ 5380:5380 & +TOKEN=$(curl -s -X POST http://127.0.0.1:5380/api/user/login \ + -d "user=admin&pass=$(vault kv get -field=technitium_password secret/platform)&includeInfo=false" | jq -r .token) + +for Z in viktorbarzin.lan 10.0.10.in-addr.arpa 20.0.10.in-addr.arpa 1.168.192.in-addr.arpa; do + curl -s -X POST "http://127.0.0.1:5380/api/zones/options/set?token=$TOKEN&zone=$Z&update=UseSpecifiedNetworkACL&updateNetworkACL=10.0.20.1,10.0.10.1,192.168.1.2&updateSecurityPolicies=" +done +``` + +This clears `updateSecurityPolicies` while keeping the IP ACL. Updates +now flow unsigned from pfSense IPs — **weaker** than TSIG but restores +service. Re-enable TSIG as soon as the key issue is resolved. + +### Verify TSIG is enforced + +```bash +# Unsigned update should fail +nsupdate < /tmp/kea-ddns.key <