technitium: add viktorbarzin.me apex DNS drift probe + alerts

Every internal *.viktorbarzin.me hostname (~80 services) chains through the split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP rollover, accidental edit), every internal service breaks at once — the 2026-05-22 ha-sofia incident was exactly this. This adds a backstop probe so the next drift surfaces in <10 min instead of via user-reported outage: - CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min, resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201) and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to Pushgateway. Python+dnspython, ~30 LOC. - 3 Prometheus alerts: - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything other than 10.0.20.200. - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped succeeding. - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never reported. - Added the new alert names to the Slack receiver matcher in both routes alongside EmailRoundtrip*. Verified: rules loaded as inactive (apex is correct), metric flowing, manual probe job pass observed.
2026-05-23 08:41:14 +00:00 · 2026-05-23 08:41:14 +00:00 · 000d306542
commit 000d306542
parent 4713c3a6d9
2 changed files with 129 additions and 2 deletions
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -84,12 +84,12 @@ alertmanager:
      - source_matchers:
          - alertname = NodeDown
        target_matchers:
-          - alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|RedisDown|HeadscaleDown|HeadscaleReplicasMismatch|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
+          - alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|RedisDown|HeadscaleDown|HeadscaleReplicasMismatch|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|ViktorBarzinApexDrift|ViktorBarzinApexProbeStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
      # NFS down causes mass pod failures and NFS-dependent service outages
      - source_matchers:
          - alertname = NFSServerUnresponsive
        target_matchers:
-          - alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable"
+          - alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|ViktorBarzinApexDrift|ViktorBarzinApexProbeStale|HomeAssistantDown|HomeAssistantCriticalSensorUnavailable"
      # Traefik down makes service-level alerts noise
      - source_matchers:
          - alertname = TraefikDown
@ -2337,6 +2337,30 @@ serverFiles:
              severity: warning
            annotations:
              summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
+          - alert: ViktorBarzinApexDrift
+            expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0
+            for: 10m
+            labels:
+              severity: critical
+            annotations:
+              summary: "viktorbarzin.me apex A drifted from expected 10.0.20.200"
+              description: "Technitium serves the split-horizon apex for ~80 *.viktorbarzin.me CNAMEs. If this is wrong, every internal service (auth, vault, immich, ha-sofia, ...) breaks. Check Technitium primary zone records via API or web console."
+          - alert: ViktorBarzinApexProbeStale
+            expr: (time() - viktorbarzin_apex_last_correct_timestamp{job="viktorbarzin-apex-probe"}) > 900
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "viktorbarzin.me apex probe has not seen a correct result in >15 min"
+              description: "Probe may be failing intermittently or apex may be drifting. Check CronJob `viktorbarzin-apex-probe` in `technitium` namespace."
+          - alert: ViktorBarzinApexProbeNeverRun
+            expr: absent(viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"})
+            for: 30m
+            labels:
+              severity: warning
+            annotations:
+              summary: "viktorbarzin.me apex probe never reported"
+              description: "Check `kubectl -n technitium get cronjob viktorbarzin-apex-probe` and the most recent job pod logs."
          - alert: AIOStreamsStreamCountLow
            expr: aiostreams_stream_count{job="aiostreams-stream-probe"} < 50
            for: 30m