technitium: add viktorbarzin.me apex DNS drift probe + alerts

Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.

This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:

- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
  resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
  and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
  Pushgateway. Python+dnspython, ~30 LOC.

- 3 Prometheus alerts:
  - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
    other than 10.0.20.200.
  - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
    succeeding.
  - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
    reported.

- Added the new alert names to the Slack receiver matcher in both routes
  alongside EmailRoundtrip*.

Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
This commit is contained in:
Viktor Barzin 2026-05-23 08:41:14 +00:00
parent 4713c3a6d9
commit 000d306542
2 changed files with 129 additions and 2 deletions

View file

@ -696,3 +696,106 @@ resource "kubernetes_cron_job_v1" "technitium_dns_optimization" {
}
}
# viktorbarzin.me apex DNS drift probe
# Resolves `viktorbarzin.me A` against the Technitium LoadBalancer IP every
# 5 min and pushes a Pushgateway gauge. Backstop for the entire
# split-horizon zone: every internal `*.viktorbarzin.me` CNAME chains through
# this apex, so if it drifts (ISP rollover, accidental edit), this is the
# canary. Alerts: ViktorBarzinApexDrift, ApexProbeStale, ApexProbeNeverRun
# in stacks/monitoring/.
resource "kubernetes_cron_job_v1" "viktorbarzin_apex_probe" {
metadata {
name = "viktorbarzin-apex-probe"
namespace = kubernetes_namespace.technitium.metadata[0].name
}
spec {
concurrency_policy = "Replace"
schedule = "*/5 * * * *"
successful_jobs_history_limit = 1
failed_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
container {
name = "probe"
image = "docker.io/library/python:3.12-alpine"
resources {
requests = {
cpu = "10m"
memory = "48Mi"
}
limits = {
memory = "96Mi"
}
}
command = ["/bin/sh", "-c", <<-EOT
pip install --quiet --disable-pip-version-check dnspython requests && python3 -c '
import dns.resolver, requests, time, sys
EXPECTED = {"10.0.20.200"}
NAMESERVER = "10.0.20.201" # Technitium LB IP
NAME = "viktorbarzin.me"
PUSHGATEWAY = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/viktorbarzin-apex-probe"
resolver = dns.resolver.Resolver(configure=False)
resolver.nameservers = [NAMESERVER]
resolver.timeout = 5
resolver.lifetime = 8
correct = 0
observed = "unknown"
try:
answer = resolver.resolve(NAME, "A")
ips = sorted(str(r) for r in answer)
observed = ",".join(ips)
correct = 1 if set(ips) <= EXPECTED and ips else 0
print(f"apex {NAME} -> {observed} (expected one of {EXPECTED}); correct={correct}")
except Exception as e:
observed = f"error:{type(e).__name__}"
print(f"resolve error: {e}", file=sys.stderr)
metric_lines = [
"# HELP viktorbarzin_apex_correct 1 if viktorbarzin.me apex resolves to expected IP, 0 otherwise",
"# TYPE viktorbarzin_apex_correct gauge",
f"viktorbarzin_apex_correct {correct}",
]
if correct:
metric_lines += [
"# HELP viktorbarzin_apex_last_correct_timestamp Unix time of last correct resolution",
"# TYPE viktorbarzin_apex_last_correct_timestamp gauge",
f"viktorbarzin_apex_last_correct_timestamp {int(time.time())}",
]
metrics = "\n".join(metric_lines) + "\n"
try:
r = requests.post(PUSHGATEWAY, data=metrics, timeout=10)
print(f"pushgateway: {r.status_code}")
except Exception as e:
print(f"pushgateway error: {e}", file=sys.stderr)
sys.exit(0 if correct else 1)
'
EOT
]
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}