From 0f6321ce865544f76d880105d994e3c4fb9c8270 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 15:46:41 +0000 Subject: [PATCH] =?UTF-8?q?[dns]=20NodeLocal=20DNSCache=20=E2=80=94=20depl?= =?UTF-8?q?oy=20DaemonSet=20to=20all=20nodes=20(WS=20C)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds per-node DNS cache that transparently intercepts pod queries on 10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet rollout needed for transparent mode. Layout mirrors existing stacks (technitium, descheduler, kured): stacks/nodelocal-dns/ main.tf # module wiring + IP params modules/nodelocal-dns/main.tf # SA, Services, ConfigMap, DS Key decisions: - Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1 - Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception) - Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods (separate ClusterIP avoids cache looping back through itself) - viktorbarzin.lan zone forwards directly to Technitium ClusterIP (10.96.0.53), bypassing CoreDNS for internal names - priorityClassName: system-node-critical - tolerations: operator=Exists (runs on master + all tainted nodes) - No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi - Kyverno dns_config drift suppressed on the DaemonSet - Kubelet clusterDNS NOT changed — transparent mode is sufficient; rolling 5 nodes just to switch to 169.254.20.10 has no additional benefit and expanding blast radius for no reason. Verified: - DaemonSet 5/5 Ready across k8s-master + 4 workers - dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4 - dig @169.254.20.10 github.com -> 140.82.121.3 - Deleted all 3 CoreDNS pods; cached queries still resolved via NodeLocal DNSCache (resilience confirmed) Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table, graph diagram, stacks table; rewrites pod DNS resolution paths to show the cache layer; adds troubleshooting entry. Closes: code-2k6 --- docs/architecture/dns.md | 57 ++- stacks/nodelocal-dns/main.tf | 16 + .../modules/nodelocal-dns/main.tf | 359 ++++++++++++++++++ stacks/nodelocal-dns/terragrunt.hcl | 11 + 4 files changed, 429 insertions(+), 14 deletions(-) create mode 100644 stacks/nodelocal-dns/main.tf create mode 100644 stacks/nodelocal-dns/modules/nodelocal-dns/main.tf create mode 100644 stacks/nodelocal-dns/terragrunt.hcl diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index d0491ca0..acf5abee 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -1,10 +1,10 @@ # DNS Architecture -Last updated: 2026-04-19 +Last updated: 2026-04-19 (NodeLocal DNSCache deployed — Workstream C) ## Overview -DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes. +DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes. A **NodeLocal DNSCache** DaemonSet runs on every node and transparently intercepts pod DNS traffic, caching responses locally so pods keep resolving even during CoreDNS, Technitium, or pfSense disruptions. ## Architecture Diagram @@ -29,7 +29,9 @@ graph TB end subgraph "Kubernetes Cluster" + NodeLocalDNS[NodeLocal DNSCache
DaemonSet, 5 nodes
169.254.20.10 + 10.96.0.10] CoreDNS[CoreDNS
kube-system
.:53 + viktorbarzin.lan:53] + KubeDNSUpstream[kube-dns-upstream
ClusterIP, selects CoreDNS pods] subgraph "Technitium HA (namespace: technitium)" Primary[Primary
technitium] @@ -59,6 +61,8 @@ graph TB pf_dnsmasq -->|.viktorbarzin.lan| LB_DNS pf_dnsmasq -->|public queries| CF + NodeLocalDNS -->|cache miss| KubeDNSUpstream + KubeDNSUpstream --> CoreDNS CoreDNS -->|.viktorbarzin.lan| ClusterIP CoreDNS -->|public queries| pf_dnsmasq @@ -80,6 +84,7 @@ graph TB |-----------|----------|---------|---------| | Technitium DNS | K8s namespace `technitium` | 14.3.0 | Primary internal DNS + recursive resolver | | CoreDNS | K8s `kube-system` | Cluster default | K8s service discovery + forwarding to Technitium | +| NodeLocal DNSCache | K8s `kube-system` (DaemonSet) | `k8s-dns-node-cache:1.23.1` | Per-node DNS cache, transparent interception on 10.96.0.10 + 169.254.20.10. Insulates pods from CoreDNS/Technitium/pfSense disruption. | | Cloudflare DNS | SaaS | N/A | Public domain management (~50 domains) | | pfSense dnsmasq | 10.0.20.1 | pfSense 2.7.x | DNS forwarder for management VLAN | | Kea DHCP-DDNS | 10.0.20.1 | pfSense 2.7.x | Automatic DNS registration on DHCP lease | @@ -90,6 +95,7 @@ graph TB | Stack | Path | DNS Resources | |-------|------|---------------| | Technitium | `stacks/technitium/` | 3 deployments, services, PVCs, 4 CronJobs, CoreDNS ConfigMap | +| NodeLocal DNSCache | `stacks/nodelocal-dns/` | DaemonSet (5 pods), ConfigMap, kube-dns-upstream Service, headless metrics Service | | Cloudflared | `stacks/cloudflared/` | Cloudflare DNS records (A, AAAA, CNAME, MX, TXT), tunnel config | | phpIPAM | `stacks/phpipam/` | dns-sync CronJob, pfsense-import CronJob | | pfSense | `stacks/pfsense/` | VM config (DNS config is via pfSense web UI) | @@ -99,10 +105,12 @@ graph TB ### K8s Pod → Internal Domain (.viktorbarzin.lan) ``` -Pod → CoreDNS (kube-dns:53) - → template: if 2+ labels before .viktorbarzin.lan → NXDOMAIN (ndots:5 junk filter) - → forward to Technitium ClusterIP (10.96.0.53) - → Technitium resolves from viktorbarzin.lan zone +Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10) + → cache hit: serve locally (TTL 30s / stale up to 86400s via CoreDNS upstream) + → cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly) + → CoreDNS: template matches 2+ labels before .viktorbarzin.lan → NXDOMAIN + → CoreDNS: forward to Technitium ClusterIP (10.96.0.53) + → Technitium resolves from viktorbarzin.lan zone ``` The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.viktorbarzin.lan` (caused by K8s search domain expansion) by returning NXDOMAIN for any query with 2+ labels before `.viktorbarzin.lan`. Only single-label queries (e.g., `idrac.viktorbarzin.lan`) reach Technitium. @@ -110,9 +118,11 @@ The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com. ### K8s Pod → Public Domain ``` -Pod → CoreDNS (kube-dns:53) - → forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1 - → pfSense dnsmasq → Cloudflare (1.1.1.1) +Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10) + → cache hit: serve locally + → cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly) + → CoreDNS: forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1 + → pfSense dnsmasq → Cloudflare (1.1.1.1) ``` ### LAN Client (192.168.1.x) → Any Domain @@ -252,6 +262,23 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h). +## NodeLocal DNSCache + +A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both: + +- **169.254.20.10** — the canonical link-local IP from the upstream docs +- **10.96.0.10** — the `kube-dns` ClusterIP, so existing pods (which already use this as their nameserver) hit the on-node cache with no kubelet change + +Cache misses go to a separate `kube-dns-upstream` ClusterIP service (not `kube-dns`, to avoid looping back to ourselves) that selects the CoreDNS pods directly via `k8s-app=kube-dns`. + +Priority class is `system-node-critical`; tolerations are permissive (`operator: Exists`) so the DaemonSet runs on tainted master and other reserved nodes. Kyverno `dns_config` drift is suppressed via `ignore_changes` on the DaemonSet. + +**Caching**: `cluster.local:53` caches 9984 success / 9984 denial entries with 30s/5s TTLs. Other zones cache 30s. If CoreDNS is killed, nodes keep answering cached names — verified on 2026-04-19 by deleting all three CoreDNS pods and running `dig @169.254.20.10 idrac.viktorbarzin.lan` + `dig @169.254.20.10 github.com` from a pod (both returned answers). + +**Kubelet clusterDNS**: **Unchanged** — still `10.96.0.10`. NodeLocal DNSCache co-listens on that IP so traffic interception is transparent; switching kubelet to `169.254.20.10` would require a rolling reconfigure of every node and provides no additional cache benefit over transparent mode. + +**Metrics**: A headless Service `node-local-dns` (ClusterIP `None`) exposes each pod on port `9253` for Prometheus scraping (annotated `prometheus.io/scrape=true`). + ## CoreDNS Configuration CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment). @@ -401,11 +428,13 @@ The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus ### DNS Not Resolving Internal Domains -1. Check Technitium pods: `kubectl get pod -n technitium` -2. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true` -3. Test from a pod: `kubectl exec -it -- nslookup idrac.viktorbarzin.lan 10.96.0.53` -4. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns` -5. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal` +1. Check NodeLocal DNSCache pods first — pod queries go through these: `kubectl -n kube-system get pod -l k8s-app=node-local-dns -o wide` +2. Check Technitium pods: `kubectl get pod -n technitium` +3. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true` +4. Test via NodeLocal DNSCache from a pod: `kubectl exec -it -- dig @169.254.20.10 idrac.viktorbarzin.lan` +5. Bypass NodeLocal DNSCache (test CoreDNS directly): `kubectl exec -it -- dig @ idrac.viktorbarzin.lan` (`kubectl get svc -n kube-system kube-dns-upstream`) +6. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns` +7. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal` ### LAN Clients Can't Resolve diff --git a/stacks/nodelocal-dns/main.tf b/stacks/nodelocal-dns/main.tf new file mode 100644 index 00000000..93626347 --- /dev/null +++ b/stacks/nodelocal-dns/main.tf @@ -0,0 +1,16 @@ +module "nodelocal_dns" { + source = "./modules/nodelocal-dns" + + # Canonical link-local IP from upstream NodeLocal DNSCache docs. + link_local_ip = "169.254.20.10" + + # kube-dns ClusterIP — co-listened so transparent interception works + # without mutating kubelet clusterDNS on every node. + kube_dns_ip = "10.96.0.10" + + # Technitium ClusterIP — upstream for .viktorbarzin.lan. + technitium_ip = "10.96.0.53" + + image = "registry.k8s.io/dns/k8s-dns-node-cache:1.23.1" + tier = local.tiers.core +} diff --git a/stacks/nodelocal-dns/modules/nodelocal-dns/main.tf b/stacks/nodelocal-dns/modules/nodelocal-dns/main.tf new file mode 100644 index 00000000..3dce76bc --- /dev/null +++ b/stacks/nodelocal-dns/modules/nodelocal-dns/main.tf @@ -0,0 +1,359 @@ +// NodeLocal DNSCache — per-node DNS cache as a DaemonSet. +// +// Why: insulates pods from transient CoreDNS / pfSense issues. Each node +// runs a CoreDNS-based cache listening on the link-local IP (169.254.20.10) +// AND on the kube-dns ClusterIP (10.96.0.10) via hostNetwork + NET_ADMIN +// iptables NOTRACK rules. Pods already use 10.96.0.10 as their resolver +// (verified in /etc/resolv.conf), so traffic is transparently intercepted +// on the node and served from the local cache — no kubelet clusterDNS +// change required. +// +// Upstream CoreDNS is reached via a separate headless service +// `kube-dns-upstream` that selects the CoreDNS pods directly (distinct +// ClusterIP from kube-dns so we can forward without looping back to +// ourselves). +// +// Sources: +// https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ +// https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml + +variable "link_local_ip" { + type = string + default = "169.254.20.10" +} + +variable "kube_dns_ip" { + type = string + default = "10.96.0.10" +} + +variable "technitium_ip" { + type = string + default = "10.96.0.53" +} + +variable "image" { + type = string + default = "registry.k8s.io/dns/k8s-dns-node-cache:1.23.1" +} + +variable "tier" { + type = string + default = "0-core" +} + +locals { + namespace = "kube-system" + app_label = "node-local-dns" +} + +// --------------------------------------------------------------------------- +// ServiceAccount + RBAC +// --------------------------------------------------------------------------- + +resource "kubernetes_service_account" "node_local_dns" { + metadata { + name = "node-local-dns" + namespace = local.namespace + labels = { + "k8s-app" = local.app_label + } + } +} + +// --------------------------------------------------------------------------- +// Upstream service — routes cache misses to CoreDNS pods (not the kube-dns +// ClusterIP, because we're co-listening on that IP ourselves). +// --------------------------------------------------------------------------- + +resource "kubernetes_service" "kube_dns_upstream" { + metadata { + name = "kube-dns-upstream" + namespace = local.namespace + labels = { + "k8s-app" = "kube-dns" + "kubernetes.io/cluster-service" = "true" + "kubernetes.io/name" = "KubeDNSUpstream" + } + } + spec { + selector = { + "k8s-app" = "kube-dns" + } + port { + name = "dns" + port = 53 + protocol = "UDP" + target_port = "53" + } + port { + name = "dns-tcp" + port = 53 + protocol = "TCP" + target_port = "53" + } + } +} + +// --------------------------------------------------------------------------- +// Headless service — Prometheus metrics scrape target (one endpoint per node). +// --------------------------------------------------------------------------- + +resource "kubernetes_service" "node_local_dns" { + metadata { + name = "node-local-dns" + namespace = local.namespace + labels = { + "k8s-app" = local.app_label + "kubernetes.io/cluster-service" = "true" + } + annotations = { + "prometheus.io/port" = "9253" + "prometheus.io/scrape" = "true" + } + } + spec { + cluster_ip = "None" + selector = { + "k8s-app" = local.app_label + } + port { + name = "metrics" + port = 9253 + target_port = "9253" + } + } +} + +// --------------------------------------------------------------------------- +// Corefile — inline here so changes are reviewable via Terraform plan. +// The node-cache binary does string replacement for __PILLAR__ tokens at +// startup; we pre-fill LOCAL/DNS_SERVER with our real IPs and leave +// __PILLAR__CLUSTER__DNS__ for the runtime substitution from +// kube-dns-upstream endpoints. +// --------------------------------------------------------------------------- + +resource "kubernetes_config_map" "node_local_dns" { + metadata { + name = "node-local-dns" + namespace = local.namespace + labels = { + "k8s-app" = local.app_label + } + } + data = { + "Corefile" = <<-EOF + cluster.local:53 { + errors + cache { + success 9984 30 + denial 9984 5 + } + reload + loop + bind ${var.link_local_ip} ${var.kube_dns_ip} + forward . __PILLAR__CLUSTER__DNS__ { + force_tcp + } + prometheus :9253 + health ${var.link_local_ip}:8080 + } + in-addr.arpa:53 { + errors + cache 30 + reload + loop + bind ${var.link_local_ip} ${var.kube_dns_ip} + forward . __PILLAR__CLUSTER__DNS__ { + force_tcp + } + prometheus :9253 + } + ip6.arpa:53 { + errors + cache 30 + reload + loop + bind ${var.link_local_ip} ${var.kube_dns_ip} + forward . __PILLAR__CLUSTER__DNS__ { + force_tcp + } + prometheus :9253 + } + viktorbarzin.lan:53 { + errors + cache 30 + reload + loop + bind ${var.link_local_ip} ${var.kube_dns_ip} + forward . ${var.technitium_ip} + prometheus :9253 + } + .:53 { + errors + cache 30 + reload + loop + bind ${var.link_local_ip} ${var.kube_dns_ip} + forward . __PILLAR__CLUSTER__DNS__ + prometheus :9253 + } + EOF + } +} + +// --------------------------------------------------------------------------- +// DaemonSet +// --------------------------------------------------------------------------- + +resource "kubernetes_daemon_set_v1" "node_local_dns" { + metadata { + name = "node-local-dns" + namespace = local.namespace + labels = { + "k8s-app" = local.app_label + tier = var.tier + } + } + spec { + selector { + match_labels = { + "k8s-app" = local.app_label + } + } + strategy { + type = "RollingUpdate" + rolling_update { + max_unavailable = "10%" + } + } + template { + metadata { + labels = { + "k8s-app" = local.app_label + } + annotations = { + # Ensure pods pick up Corefile changes without waiting for a + # reload (CoreDNS reload plugin picks up changes within 30s, + # but a hash annotation forces an immediate rollout). + "node-local-dns/corefile-hash" = sha256(kubernetes_config_map.node_local_dns.data["Corefile"]) + } + } + spec { + priority_class_name = "system-node-critical" + service_account_name = kubernetes_service_account.node_local_dns.metadata[0].name + host_network = true + dns_policy = "Default" + termination_grace_period_seconds = 0 + + toleration { + operator = "Exists" + } + + container { + name = "node-cache" + image = var.image + image_pull_policy = "IfNotPresent" + + resources { + # Per cluster CPU-limits-removed policy: requests only, no limit. + requests = { + cpu = "25m" + memory = "32Mi" + } + limits = { + memory = "128Mi" + } + } + + args = [ + "-localip", + "${var.link_local_ip},${var.kube_dns_ip}", + "-conf", + "/etc/Corefile", + "-upstreamsvc", + kubernetes_service.kube_dns_upstream.metadata[0].name, + "-skipteardown=true", + ] + + security_context { + capabilities { + add = ["NET_ADMIN"] + } + } + + port { + name = "dns" + container_port = 53 + protocol = "UDP" + } + port { + name = "dns-tcp" + container_port = 53 + protocol = "TCP" + } + port { + name = "metrics" + container_port = 9253 + protocol = "TCP" + } + + liveness_probe { + http_get { + host = var.link_local_ip + path = "/health" + port = "8080" + } + initial_delay_seconds = 60 + timeout_seconds = 5 + } + + volume_mount { + name = "xtables-lock" + mount_path = "/run/xtables.lock" + read_only = false + } + volume_mount { + name = "config-volume" + mount_path = "/etc/coredns" + } + volume_mount { + name = "kube-dns-config" + mount_path = "/etc/kube-dns" + } + } + + volume { + name = "xtables-lock" + host_path { + path = "/run/xtables.lock" + type = "FileOrCreate" + } + } + volume { + name = "kube-dns-config" + config_map { + name = "kube-dns" + optional = true + } + } + volume { + name = "config-volume" + config_map { + name = kubernetes_config_map.node_local_dns.metadata[0].name + items { + key = "Corefile" + path = "Corefile.base" + } + } + } + } + } + } + + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with + # ndots=2 on every pod; ignoring avoids spurious plan drift. + ignore_changes = [spec[0].template[0].spec[0].dns_config] + } +} diff --git a/stacks/nodelocal-dns/terragrunt.hcl b/stacks/nodelocal-dns/terragrunt.hcl new file mode 100644 index 00000000..ac8dda28 --- /dev/null +++ b/stacks/nodelocal-dns/terragrunt.hcl @@ -0,0 +1,11 @@ +include "root" { + path = find_in_parent_folders() +} + +# CoreDNS ConfigMap + kube-dns Service live in the technitium stack. +# NodeLocal DNSCache co-listens on the kube-dns ClusterIP (10.96.0.10) +# via hostNetwork + iptables NOTRACK — no kubelet clusterDNS change needed. +dependency "technitium" { + config_path = "../technitium" + skip_outputs = true +}