[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate

Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 14:53:41 +00:00
parent a5e097088a
commit 9a21c0f065
7 changed files with 390 additions and 50 deletions

View file

@ -1868,13 +1868,24 @@ serverFiles:
summary: "NetFlow processing delay p50: {{ $value | printf \"%.0f\" }}s — softflowd may be overloaded"
- name: "DNS Anomaly Detection"
rules:
# Spike detection: compare current value against its own 1h history via
# avg_over_time. Previous version compared against dns_anomaly_avg_queries
# which was computed from a per-pod /tmp file and always equalled the
# current value (fresh /tmp each run), so the alert could never fire.
- alert: DNSQuerySpike
expr: dns_anomaly_total_queries > 2 * dns_anomaly_avg_queries and dns_anomaly_total_queries > 1000
expr: dns_anomaly_total_queries > 2 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and dns_anomaly_total_queries > 1000
for: 0m
labels:
severity: warning
annotations:
summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x average)"
summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x 1h avg)"
- alert: DNSQueryRateDropped
expr: dns_anomaly_total_queries < 0.5 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and avg_over_time(dns_anomaly_total_queries[1h] offset 15m) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "DNS query volume dropped: {{ $value | printf \"%.0f\" }} queries (<50% of 1h avg) — upstream clients may be failing to reach Technitium"
- alert: DNSHighErrorRate
expr: dns_anomaly_server_failure > 100
for: 0m
@ -1882,6 +1893,34 @@ serverFiles:
severity: warning
annotations:
summary: "High DNS SERVFAIL rate: {{ $value | printf \"%.0f\" }} failures detected"
- alert: TechnitiumZoneSyncFailed
expr: technitium_zone_sync_status != 0
for: 30m
labels:
severity: warning
annotations:
summary: "Technitium zone-sync CronJob has reported failure for 30m — replicas may be missing zones"
- alert: TechnitiumZoneSyncStale
expr: (time() - technitium_zone_sync_last_run) > 3600
for: 10m
labels:
severity: warning
annotations:
summary: "Technitium zone-sync has not run successfully in >1h (last: {{ $value | humanizeDuration }} ago)"
- alert: TechnitiumZoneCountMismatch
expr: (max(technitium_zone_count) - min(technitium_zone_count)) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Technitium zone counts differ across instances (max-min delta: {{ $value | printf \"%.0f\" }}) — replica has drifted from primary"
- alert: CoreDNSForwardFailureRate
expr: sum(rate(coredns_forward_responses_total{rcode=~"SERVFAIL|REFUSED"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS forward SERVFAIL/REFUSED rate: {{ $value | printf \"%.2f\" }}/s — upstream DNS (pfSense/public) may be unhealthy"
- name: qbittorrent
rules:
- alert: MAMMouseClass