[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. **Technitium (WS A)** - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). **CoreDNS (WS B)** - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. **Observability (WS G)** - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. **Post-apply readiness gate (WS H)** - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. **Docs (WS I)** - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a5e097088a
commit
9a21c0f065
7 changed files with 390 additions and 50 deletions
|
|
@ -1,6 +1,6 @@
|
|||
# DNS Architecture
|
||||
|
||||
Last updated: 2026-04-15
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## Overview
|
||||
|
||||
|
|
@ -254,27 +254,42 @@ Config is synced to all 3 Technitium instances by CronJob `technitium-split-hori
|
|||
|
||||
## CoreDNS Configuration
|
||||
|
||||
CoreDNS is managed via a Terraform `kubernetes_config_map` resource in `stacks/technitium/modules/technitium/main.tf`.
|
||||
CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment).
|
||||
|
||||
```
|
||||
.:53 {
|
||||
errors / health / ready
|
||||
kubernetes cluster.local in-addr.arpa ip6.arpa # K8s service discovery
|
||||
prometheus :9153 # Metrics
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 # pfSense → Google → Cloudflare
|
||||
cache (success 10000 300, denial 10000 300)
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
|
||||
policy sequential # try upstreams in order
|
||||
health_check 5s # mark unhealthy in 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
serve_stale 3600s 86400s # resilience during upstream outage
|
||||
}
|
||||
loop / reload / loadbalance
|
||||
}
|
||||
|
||||
viktorbarzin.lan:53 {
|
||||
template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN # ndots:5 junk filter
|
||||
forward . 10.96.0.53 # Technitium ClusterIP
|
||||
cache (success 10000 300, denial 10000 300)
|
||||
forward . 10.96.0.53 { # Technitium ClusterIP
|
||||
health_check 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache (success 10000 300, denial 10000 300, serve_stale 3600s 86400s)
|
||||
}
|
||||
```
|
||||
|
||||
**Scaling**: 3 replicas, `required` anti-affinity on `kubernetes.io/hostname` (spread across 3 distinct nodes). PodDisruptionBudget `coredns` with `minAvailable=2`.
|
||||
|
||||
**Kyverno ndots injection**: A Kyverno policy injects `ndots:2` on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.
|
||||
|
||||
**Failover behaviour**: With `policy sequential` on the root forward block, CoreDNS tries pfSense first; if `health_check 5s` detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with `serve_stale`, pods keep resolving cached names for up to 24h even with full upstream failure.
|
||||
|
||||
## Cloudflare DNS — External Domains
|
||||
|
||||
All public domains are under the `viktorbarzin.me` zone. DNS records are **auto-created per service** via the `ingress_factory` module's `dns_type` parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in `config.tfvars`.
|
||||
|
|
@ -360,9 +375,28 @@ Vault DB engine rotates password
|
|||
| Metric Source | Dashboard | Alerts |
|
||||
|---------------|-----------|--------|
|
||||
| Technitium query logs (PostgreSQL) | Grafana `technitium-dns.json` | — |
|
||||
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | — |
|
||||
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | `CoreDNSErrors`, `CoreDNSForwardFailureRate` |
|
||||
| Technitium zone-sync CronJob (Pushgateway) | — | `TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch` |
|
||||
| Technitium DNS pod availability | — | `TechnitiumDNSDown` |
|
||||
| `dns-anomaly-monitor` CronJob (Pushgateway) | — | `DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate` |
|
||||
| Uptime Kuma | External monitors for all proxied domains | ExternalAccessDivergence (15min) |
|
||||
|
||||
### Metrics pushed by `technitium-zone-sync`
|
||||
|
||||
The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under `job=technitium-zone-sync`:
|
||||
|
||||
| Metric | Labels | Meaning |
|
||||
|--------|--------|---------|
|
||||
| `technitium_zone_sync_status` | — | 0 = last run succeeded, 1 = at least one zone failed to create |
|
||||
| `technitium_zone_sync_failures` | — | Number of zones that failed to create this run |
|
||||
| `technitium_zone_sync_last_run` | — | Unix timestamp of last run (used by `TechnitiumZoneSyncStale`) |
|
||||
| `technitium_zone_count` | `instance=primary\|<replica-host>` | Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`) |
|
||||
|
||||
### DNS alert rewrites
|
||||
|
||||
- `DNSQuerySpike` was previously broken: it compared current queries against `dns_anomaly_avg_queries`, which was computed from a per-pod `/tmp/dns_avg` file. Each CronJob run started with a fresh `/tmp`, so `NEW_AVG == TOTAL_QUERIES` every time and the spike condition could never fire. Rewritten to use `avg_over_time(dns_anomaly_total_queries[1h] offset 15m)` which compares against the actual 1h Prometheus history.
|
||||
- `DNSQueryRateDropped` (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### DNS Not Resolving Internal Domains
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue