[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. **Technitium (WS A)** - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). **CoreDNS (WS B)** - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. **Observability (WS G)** - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. **Post-apply readiness gate (WS H)** - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. **Docs (WS I)** - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a5e097088a
commit
9a21c0f065
7 changed files with 390 additions and 50 deletions
|
|
@ -60,10 +60,15 @@ resource "kubernetes_config_map" "coredns" {
|
|||
ttl 30
|
||||
}
|
||||
prometheus :9153
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
|
||||
policy sequential
|
||||
health_check 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
serve_stale 3600s 86400s
|
||||
}
|
||||
loop
|
||||
reload
|
||||
|
|
@ -77,10 +82,14 @@ resource "kubernetes_config_map" "coredns" {
|
|||
rcode NXDOMAIN
|
||||
fallthrough
|
||||
}
|
||||
forward . 10.96.0.53 # Technitium ClusterIP (technitium-dns-internal)
|
||||
forward . 10.96.0.53 {
|
||||
health_check 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
serve_stale 3600s 86400s
|
||||
}
|
||||
}
|
||||
EOF
|
||||
|
|
@ -161,11 +170,11 @@ resource "kubernetes_deployment" "technitium" {
|
|||
name = "technitium"
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "25m"
|
||||
memory = "512Mi"
|
||||
cpu = "100m"
|
||||
memory = "1Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
memory = "1Gi"
|
||||
}
|
||||
}
|
||||
port {
|
||||
|
|
@ -221,6 +230,10 @@ resource "kubernetes_deployment" "technitium" {
|
|||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "technitium-web" {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue