infra

Viktor Barzin 000d306542 technitium: add viktorbarzin.me apex DNS drift probe + alerts Every internal .viktorbarzin.me hostname (~80 services) chains through the split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP rollover, accidental edit), every internal service breaks at once — the 2026-05-22 ha-sofia incident was exactly this. This adds a backstop probe so the next drift surfaces in <10 min instead of via user-reported outage: - CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min, resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201) and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to Pushgateway. Python+dnspython, ~30 LOC. - 3 Prometheus alerts: - `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything other than 10.0.20.200. - `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped succeeding. - `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never reported. - Added the new alert names to the Slack receiver matcher in both routes alongside EmailRoundtrip. Verified: rules loaded as inactive (apex is correct), metric flowing, manual probe job pass observed.		2026-05-23 08:41:14 +00:00
..
modules/monitoring	technitium: add viktorbarzin.me apex DNS drift probe + alerts	2026-05-23 08:41:14 +00:00
main.tf	[forgejo] Tolerate missing Vault keys during Phase 0 bootstrap	2026-05-07 15:53:08 +00:00
secrets	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00
terragrunt.hcl	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00