Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. **Technitium (WS A)** - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). **CoreDNS (WS B)** - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. **Observability (WS G)** - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. **Post-apply readiness gate (WS H)** - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. **Docs (WS I)** - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
69 lines
2.3 KiB
HCL
69 lines
2.3 KiB
HCL
# =============================================================================
|
|
# CoreDNS — Scaling, Anti-Affinity, PDB
|
|
# =============================================================================
|
|
#
|
|
# CoreDNS is kube-system / kubeadm-managed. We only patch replicas + affinity
|
|
# here (the Corefile ConfigMap is in main.tf). The hashicorp/kubernetes v3
|
|
# provider removed the *_patch resource family from v2, so we apply the
|
|
# desired state via `kubectl patch` inside a null_resource. The patch is
|
|
# idempotent — a no-op when the deployment already matches.
|
|
#
|
|
# Kubeadm upgrades preserve the replica count on the existing deployment but
|
|
# reset the pod template (including affinity) from the ClusterConfiguration.
|
|
# Re-running `terraform apply` re-asserts the affinity patch; the readiness
|
|
# gate in `readiness.tf` catches regressions if the patch is reverted.
|
|
|
|
resource "null_resource" "coredns_scale_and_affinity" {
|
|
triggers = {
|
|
replicas = 3
|
|
spec_hash = sha256(file("${path.module}/coredns.tf"))
|
|
}
|
|
|
|
provisioner "local-exec" {
|
|
command = <<-BASH
|
|
set -euo pipefail
|
|
# 1. Scale to 3 replicas.
|
|
kubectl -n kube-system scale deploy/coredns --replicas=3
|
|
|
|
# 2. Switch anti-affinity from preferred → required on hostname.
|
|
kubectl -n kube-system patch deploy/coredns --type=json -p='[
|
|
{
|
|
"op": "replace",
|
|
"path": "/spec/template/spec/affinity/podAntiAffinity",
|
|
"value": {
|
|
"requiredDuringSchedulingIgnoredDuringExecution": [
|
|
{
|
|
"labelSelector": {
|
|
"matchExpressions": [
|
|
{"key": "k8s-app", "operator": "In", "values": ["kube-dns"]}
|
|
]
|
|
},
|
|
"topologyKey": "kubernetes.io/hostname"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]' || true
|
|
|
|
# 3. Wait for rollout to settle.
|
|
kubectl -n kube-system rollout status deploy/coredns --timeout=120s
|
|
BASH
|
|
interpreter = ["/bin/bash", "-c"]
|
|
}
|
|
}
|
|
|
|
# PDB — keep at least 2 CoreDNS pods running during voluntary disruptions.
|
|
resource "kubernetes_pod_disruption_budget_v1" "coredns" {
|
|
metadata {
|
|
name = "coredns"
|
|
namespace = "kube-system"
|
|
}
|
|
spec {
|
|
min_available = "2"
|
|
selector {
|
|
match_labels = {
|
|
"k8s-app" = "kube-dns"
|
|
}
|
|
}
|
|
}
|
|
}
|