Viktor Barzin 9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate

Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 14:53:41 +00:00

3.6 KiB

Raw Blame History

Runbook: Applying the Technitium Terraform stack

Last updated: 2026-04-19

The stacks/technitium/ apply has a post-apply readiness gate that asserts all three DNS instances are healthy before the apply is allowed to finish. This runbook explains what it checks, how to interpret failures, and how to override it for emergency maintenance.

What the gate checks

stacks/technitium/modules/technitium/readiness.tf defines null_resource.technitium_readiness_gate. It runs after the three Technitium deployments, the DNS LoadBalancer service, and the PDB are applied, and performs:

Rollout status — kubectl rollout status deploy/<name> --timeout=180s for technitium, technitium-secondary, technitium-tertiary. Fails if any deployment has not reached its desired pod count within 180s.
Per-pod API health — for every pod with label dns-server=true, executes wget http://127.0.0.1:5380/api/stats/get inside the pod and asserts the response contains "status":"ok". Catches Technitium process hangs that TCP probes miss.
Zone-count parity — queries technitium-web, technitium-secondary-web, technitium-tertiary-web and counts the zones returned. Fails if the three counts differ, which would mean technitium-zone-sync has drifted or a replica has lost state.

The gate is re-run whenever any of the deployment container spec, the CoreDNS Corefile, or the apply timestamp changes (see triggers in readiness.tf).

Emergency override

Set skip_readiness=true via terragrunt inputs or pass it directly to the Terraform apply:

cd infra/stacks/technitium
scripts/tg apply -var skip_readiness=true

Only use this when you need to land a Terraform change while one Technitium instance is intentionally offline (e.g., you are replacing its PVC, migrating storage, or recovering a corrupted config DB). Re-apply without the flag once the instance is back.

You can also target around the gate during emergency work:

scripts/tg apply -target=kubernetes_config_map.coredns

-target bypasses the depends_on chain feeding the gate, so a single-resource push does not need the gate to pass.

Failure modes and responses

Symptom	Likely cause	Fix
`rollout status` times out on one deployment	Pod stuck `Pending` (node pressure / anti-affinity with other dns-server pods) or `ImagePullBackOff`	`kubectl describe pod` for events. If anti-affinity is blocking, confirm 3 nodes are Ready.
API check fails on a pod but readiness probe passes	Technitium process hung but port 53 still accepting TCP (liveness probe is `tcp_socket` on :53)	`kubectl delete pod <name>` — deployment will recreate it.
Zone count differs between instances	`technitium-zone-sync` CronJob is failing or AXFR is blocked	`kubectl logs -n technitium -l job-name=<latest-zone-sync-job>`. Check `TechnitiumZoneSyncFailed` alert.
Gate passes but external clients still cannot resolve	Gate only checks in-pod API and intra-cluster zone parity — external path (LoadBalancer → Technitium pod) is not tested	Run the LAN-client drill in `docs/architecture/dns.md` troubleshooting section.

What the gate does NOT check

External reachability through the LoadBalancer IP 10.0.20.201 (that would require a LAN-side probe).
CoreDNS health (CoreDNS is patched by coredns.tf, not this module's deployments — alerts CoreDNSErrors / CoreDNSForwardFailureRate catch regressions post-apply).
Upstream resolver health (covered by CoreDNSForwardFailureRate).

For broader end-to-end verification, see docs/architecture/dns.md → "Verification" section, or run the Uptime Kuma external DNS probe.

3.6 KiB Raw Blame History