infra/docs/architecture
Viktor Barzin 9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:53:41 +00:00
..
agent-task-tracking.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
authentication.md fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 2026-04-15 06:41:56 +00:00
automated-upgrades.md [docs] automated-upgrades: document long-lived OAuth + expiry monitoring 2026-04-18 13:00:07 +00:00
backup-dr.md [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps 2026-04-17 05:51:52 +00:00
ci-cd.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
compute.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
databases.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
dns.md [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate 2026-04-19 14:53:41 +00:00
homepage.md add homepage auto-discovery documentation [ci skip] 2026-03-25 13:06:43 +02:00
incident-response.md [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP 2026-04-18 10:12:02 +00:00
mailserver.md [docs] Mailserver architecture — richer diagrams + steady-state accuracy [ci skip] 2026-04-19 12:40:53 +00:00
monitoring.md [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps 2026-04-17 05:51:52 +00:00
multi-tenancy.md add architecture documentation for all infrastructure subsystems [ci skip] 2026-03-24 00:55:25 +02:00
networking.md docs: add comprehensive DNS architecture documentation 2026-04-15 18:10:27 +00:00
overview.md deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] 2026-04-13 14:42:07 +00:00
secrets.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
security.md [docs] Update anti-AI and rybbit docs after rewrite-body removal 2026-04-17 21:43:13 +00:00
storage.md docs(storage): add encrypted LVM documentation 2026-04-15 21:00:37 +00:00
vpn.md docs: update Technitium DNS docs after cache optimization 2026-04-12 18:29:25 +01:00