[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate

Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. **Technitium (WS A)** - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). **CoreDNS (WS B)** - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. **Observability (WS G)** - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. **Post-apply readiness gate (WS H)** - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. **Docs (WS I)** - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:53:41 +00:00 · 2026-04-19 14:53:41 +00:00 · 9a21c0f065
commit 9a21c0f065
parent a5e097088a
7 changed files with 390 additions and 50 deletions
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1868,13 +1868,24 @@ serverFiles:
              summary: "NetFlow processing delay p50: {{ $value | printf \"%.0f\" }}s — softflowd may be overloaded"
      - name: "DNS Anomaly Detection"
        rules:
+          # Spike detection: compare current value against its own 1h history via
+          # avg_over_time. Previous version compared against dns_anomaly_avg_queries
+          # which was computed from a per-pod /tmp file and always equalled the
+          # current value (fresh /tmp each run), so the alert could never fire.
          - alert: DNSQuerySpike
-            expr: dns_anomaly_total_queries > 2 * dns_anomaly_avg_queries and dns_anomaly_total_queries > 1000
+            expr: dns_anomaly_total_queries > 2 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and dns_anomaly_total_queries > 1000
            for: 0m
            labels:
              severity: warning
            annotations:
-              summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x average)"
+              summary: "DNS query spike: {{ $value | printf \"%.0f\" }} queries (>2x 1h avg)"
+          - alert: DNSQueryRateDropped
+            expr: dns_anomaly_total_queries < 0.5 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and avg_over_time(dns_anomaly_total_queries[1h] offset 15m) > 1000
+            for: 10m
+            labels:
+              severity: warning
+            annotations:
+              summary: "DNS query volume dropped: {{ $value | printf \"%.0f\" }} queries (<50% of 1h avg) — upstream clients may be failing to reach Technitium"
          - alert: DNSHighErrorRate
            expr: dns_anomaly_server_failure > 100
            for: 0m
@ -1882,6 +1893,34 @@ serverFiles:
              severity: warning
            annotations:
              summary: "High DNS SERVFAIL rate: {{ $value | printf \"%.0f\" }} failures detected"
+          - alert: TechnitiumZoneSyncFailed
+            expr: technitium_zone_sync_status != 0
+            for: 30m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Technitium zone-sync CronJob has reported failure for 30m — replicas may be missing zones"
+          - alert: TechnitiumZoneSyncStale
+            expr: (time() - technitium_zone_sync_last_run) > 3600
+            for: 10m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Technitium zone-sync has not run successfully in >1h (last: {{ $value | humanizeDuration }} ago)"
+          - alert: TechnitiumZoneCountMismatch
+            expr: (max(technitium_zone_count) - min(technitium_zone_count)) > 0
+            for: 15m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Technitium zone counts differ across instances (max-min delta: {{ $value | printf \"%.0f\" }}) — replica has drifted from primary"
+          - alert: CoreDNSForwardFailureRate
+            expr: sum(rate(coredns_forward_responses_total{rcode=~"SERVFAIL|REFUSED"}[5m])) > 0.1
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "CoreDNS forward SERVFAIL/REFUSED rate: {{ $value | printf \"%.2f\" }}/s — upstream DNS (pfSense/public) may be unhealthy"
      - name: qbittorrent
        rules:
          - alert: MAMMouseClass