Viktor Barzin 9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate

Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.

**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
  (secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
  was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
  per-instance zone_count gauges to Pushgateway, fail the job on any
  create error (was silently passing).

**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
  forward, health_check on viktorbarzin.lan forward, serve_stale
  3600s/86400s on both cache blocks — pfSense flap no longer takes the
  cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
  hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
  resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.

**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
  avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
  dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
  equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
  TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
  CoreDNSForwardFailureRate.

**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
  kubectl rollout status on all 3 deployments (180s), per-pod
  /api/stats/get probe, zone-count parity across the 3 instances.
  Fails the apply on any check fail. Override: -var skip_readiness=true.

**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
  zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
  modes, emergency override.

Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 14:53:41 +00:00

22 KiB

Raw Blame History

DNS Architecture

Last updated: 2026-04-19

Overview

DNS is served by a split architecture: Technitium DNS handles internal resolution (.viktorbarzin.lan) and recursive lookups, while Cloudflare DNS manages all public domains (.viktorbarzin.me). Kubernetes pods use CoreDNS which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes.

Architecture Diagram

graph TB
    subgraph "External"
        Internet[Internet Clients]
        CF[Cloudflare DNS<br/>~50 domains<br/>viktorbarzin.me]
        CFTunnel[Cloudflared Tunnel<br/>3 replicas]
    end

    subgraph "LAN (192.168.1.0/24)"
        LAN[LAN Clients<br/>WiFi / Wired]
        TPLINK[TP-Link AP<br/>Dumb AP only]
    end

    subgraph "pfSense (10.0.20.1)"
        pf_dnsmasq[dnsmasq<br/>Forwarder]
        pf_kea[Kea DHCP4<br/>3 subnets, 53 reservations]
        pf_ddns[Kea DHCP-DDNS<br/>RFC 2136]
        pf_nat[NAT rdr<br/>UDP 53 → Technitium]
    end

    subgraph "Kubernetes Cluster"
        CoreDNS[CoreDNS<br/>kube-system<br/>.:53 + viktorbarzin.lan:53]

        subgraph "Technitium HA (namespace: technitium)"
            Primary[Primary<br/>technitium]
            Secondary[Secondary<br/>technitium-secondary]
            Tertiary[Tertiary<br/>technitium-tertiary]
        end

        LB_DNS[LoadBalancer<br/>10.0.20.201<br/>ETP=Local]
        ClusterIP[ClusterIP<br/>10.96.0.53<br/>pinned]

        subgraph "Automation CronJobs"
            ZoneSync[zone-sync<br/>every 30min]
            SplitHorizon[split-horizon-sync<br/>every 6h]
            DNSOpt[dns-optimization<br/>every 6h]
            PassSync[password-sync<br/>every 6h]
            DNSSync[phpipam-dns-sync<br/>every 15min]
        end
    end

    Internet -->|DNS query| CF
    CF -->|CNAME to tunnel| CFTunnel
    LAN -->|DNS query UDP 53| pf_nat
    pf_nat -->|forward| LB_DNS
    pf_kea -->|lease event| pf_ddns
    pf_ddns -->|A + PTR| LB_DNS

    pf_dnsmasq -->|.viktorbarzin.lan| LB_DNS
    pf_dnsmasq -->|public queries| CF

    CoreDNS -->|.viktorbarzin.lan| ClusterIP
    CoreDNS -->|public queries| pf_dnsmasq

    LB_DNS --> Primary
    LB_DNS --> Secondary
    LB_DNS --> Tertiary
    ClusterIP --> Primary
    ClusterIP --> Secondary
    ClusterIP --> Tertiary

    ZoneSync -->|AXFR| Primary
    ZoneSync -->|replicate| Secondary
    ZoneSync -->|replicate| Tertiary

Components

Component	Location	Version	Purpose
Technitium DNS	K8s namespace `technitium`	14.3.0	Primary internal DNS + recursive resolver
CoreDNS	K8s `kube-system`	Cluster default	K8s service discovery + forwarding to Technitium
Cloudflare DNS	SaaS	N/A	Public domain management (~50 domains)
pfSense dnsmasq	10.0.20.1	pfSense 2.7.x	DNS forwarder for management VLAN
Kea DHCP-DDNS	10.0.20.1	pfSense 2.7.x	Automatic DNS registration on DHCP lease
phpIPAM	K8s namespace `phpipam`	v1.7.0	IPAM ↔ DNS bidirectional sync

Terraform Stacks

Stack	Path	DNS Resources
Technitium	`stacks/technitium/`	3 deployments, services, PVCs, 4 CronJobs, CoreDNS ConfigMap
Cloudflared	`stacks/cloudflared/`	Cloudflare DNS records (A, AAAA, CNAME, MX, TXT), tunnel config
phpIPAM	`stacks/phpipam/`	dns-sync CronJob, pfsense-import CronJob
pfSense	`stacks/pfsense/`	VM config (DNS config is via pfSense web UI)

DNS Resolution Paths

K8s Pod → Internal Domain (.viktorbarzin.lan)

Pod → CoreDNS (kube-dns:53)
  → template: if 2+ labels before .viktorbarzin.lan → NXDOMAIN (ndots:5 junk filter)
  → forward to Technitium ClusterIP (10.96.0.53)
  → Technitium resolves from viktorbarzin.lan zone

The ndots:5 template in CoreDNS short-circuits queries like www.cloudflare.com.viktorbarzin.lan (caused by K8s search domain expansion) by returning NXDOMAIN for any query with 2+ labels before .viktorbarzin.lan. Only single-label queries (e.g., idrac.viktorbarzin.lan) reach Technitium.

K8s Pod → Public Domain

Pod → CoreDNS (kube-dns:53)
  → forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
  → pfSense dnsmasq → Cloudflare (1.1.1.1)

LAN Client (192.168.1.x) → Any Domain

Client gets DNS=192.168.1.2 (pfSense WAN) from DHCP
  → pfSense NAT rdr on WAN interface → Technitium LB (10.0.20.201)
  → Technitium resolves:
    - .viktorbarzin.lan → local zone
    - .viktorbarzin.me (non-proxied) → recursive, then Split Horizon translates
      176.12.22.76 → 10.0.20.200 for 192.168.1.0/24 clients
    - other → recursive to Cloudflare DoH (1.1.1.1)

Client source IPs are preserved (no SNAT on 192.168.1.x → 10.0.20.x path) — Technitium logs show real per-device IPs.

Management VLAN (10.0.10.x) → Any Domain

Client gets DNS from Kea DHCP → pfSense (10.0.10.1)
  → pfSense dnsmasq:
    - .viktorbarzin.lan → forward to Technitium (10.0.20.201)
    - other → forward to Cloudflare (1.1.1.1)

K8s VLAN (10.0.20.x) → Any Domain

Client gets DNS from Kea DHCP → pfSense (10.0.20.1)
  → pfSense dnsmasq:
    - .viktorbarzin.lan → forward to Technitium (10.0.20.201)
    - other → forward to Cloudflare (1.1.1.1)

Technitium DNS — Internal DNS Server

Deployment Topology

Three independent Technitium instances, each with its own encrypted block storage PVC (proxmox-lvm-encrypted, 2Gi each):

Instance	Deployment	PVC	Web Service	Role
Primary	`technitium`	`technitium-primary-config-encrypted`	`technitium-web:5380`	Authoritative primary, zone edits happen here
Secondary	`technitium-secondary`	`technitium-secondary-config-encrypted`	`technitium-secondary-web:5380`	AXFR replica
Tertiary	`technitium-tertiary`	`technitium-tertiary-config-encrypted`	`technitium-tertiary-web:5380`	AXFR replica

All three pods share the dns-server=true label, so the DNS LoadBalancer (10.0.20.201) and ClusterIP (10.96.0.53) route queries to any healthy instance.

High Availability

Pod anti-affinity: required on kubernetes.io/hostname — all 3 pods run on different nodes
PodDisruptionBudget: minAvailable=2 — at least 2 DNS pods survive voluntary disruptions
Recreate strategy: Each deployment uses Recreate (RWO block storage)
Zone sync CronJob (technitium-zone-sync, every 30min): Replicates all primary zones to secondary/tertiary via AXFR. Idempotent — skips existing zones, creates missing ones as Secondary type.

Services

Service	Type	IP	Selector	Purpose
`technitium-dns`	LoadBalancer	10.0.20.201	`dns-server=true`	External LAN access, `externalTrafficPolicy: Local`
`technitium-dns-internal`	ClusterIP	10.96.0.53 (pinned)	`dns-server=true`	CoreDNS forwarding, survives Service recreation
`technitium-primary`	ClusterIP	auto	`app=technitium`	Zone transfers (AXFR) + API access to primary only
`technitium-web`	ClusterIP	auto	`app=technitium`	Web UI (port 5380) + DoH (port 80)
`technitium-secondary-web`	ClusterIP	auto	`app=technitium-secondary`	Secondary API access
`technitium-tertiary-web`	ClusterIP	auto	`app=technitium-tertiary`	Tertiary API access

Zones

Primary zones (managed on primary, replicated to secondary/tertiary):

Zone	Type	Records	Notes
`viktorbarzin.lan`	Primary	30+ A/CNAME	Internal hosts (idrac, grafana, proxmox, vaultwarden, etc.)
`10.0.10.in-addr.arpa`	Primary	PTR	Reverse DNS for management VLAN
`20.0.10.in-addr.arpa`	Primary	PTR	Reverse DNS for K8s VLAN
`1.168.192.in-addr.arpa`	Primary	PTR	Reverse DNS for LAN
`2.3.10.in-addr.arpa`	Primary	PTR	Reverse DNS for VPN
`0.168.192.in-addr.arpa`	Primary	PTR	Reverse DNS for Valchedrym site
`emrsn.org`	Primary (stub)	—	Returns NXDOMAIN locally (avoids 27K+ daily corporate query floods)

Dynamic updates: Enabled via UseSpecifiedNetworkACL from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) for Kea DDNS RFC 2136 updates.

Resolver Settings

Setting	Value	Rationale
Forwarders	Cloudflare DoH (1.1.1.1, 1.0.0.1)	Encrypted upstream DNS
Cache max entries	100K	Ample for homelab
Cache min TTL	60s	Reduces re-queries for short-TTL domains (e.g., headscale: 18s)
Cache max TTL	7 days	Long cache for stable records
Serve stale	Enabled (3 days)	Resilience during upstream failures

Ad Blocking

Technitium runs built-in DNS blocking with:

OISD Big List (~486K domains)
StevenBlack hosts list

Blocking is enabled on all three instances (DNS_SERVER_ENABLE_BLOCKING=true on secondary/tertiary).

Query Logging

Backend	Status	Retention	Purpose
MySQL (`technitium` DB)	Disabled	—	Legacy, disabled by password-sync CronJob
PostgreSQL (`technitium` DB on CNPG)	Enabled	90 days	Primary query log store

Grafana dashboard (grafana-technitium-dashboard ConfigMap) visualizes query logs from the MySQL datasource. A Grafana datasource is auto-provisioned via sidecar.

Web UI & Ingress

Web UI: technitium.viktorbarzin.me (Authentik-protected via ingress_factory)
DNS-over-HTTPS: dns.viktorbarzin.me (separate ingress, port 80)
Homepage widget: Technitium widget showing totalQueries, totalCached, totalBlocked, totalRecursive

Split Horizon (Hairpin NAT Fix)

Problem

The TP-Link AP (dumb AP on 192.168.1.x) does not support hairpin NAT. LAN clients resolving non-proxied *.viktorbarzin.me domains get the public IP 176.12.22.76, but can't reach it because the TP-Link won't route back to the local network.

Solution

Technitium's Split Horizon AddressTranslation app post-processes DNS responses for 192.168.1.0/24 clients, translating the public IP to the internal Traefik LB IP:

176.12.22.76 → 10.0.20.200

DNS Rebinding Protection has viktorbarzin.me in privateDomains to allow the translated private IP without being stripped as a rebinding attack.

Scope

Affected: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients
Not affected: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed)
Not affected: 10.0.x.x and K8s clients (reach public IP via pfSense outbound NAT normally)

Config is synced to all 3 Technitium instances by CronJob technitium-split-horizon-sync (every 6h).

CoreDNS Configuration

CoreDNS is managed via Terraform in stacks/technitium/modules/technitium/ — the Corefile ConfigMap lives in main.tf, and scaling/PDB are in coredns.tf (a kubernetes_deployment_v1_patch against the kubeadm-managed Deployment).

.:53 {
  errors / health / ready
  kubernetes cluster.local in-addr.arpa ip6.arpa  # K8s service discovery
  prometheus :9153                                  # Metrics
  forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
      policy sequential                             # try upstreams in order
      health_check 5s                               # mark unhealthy in 5s
      max_fails 2
  }
  cache {
    success 10000 300 6
    denial 10000 300 60
    serve_stale 3600s 86400s                        # resilience during upstream outage
  }
  loop / reload / loadbalance
}

viktorbarzin.lan:53 {
  template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN  # ndots:5 junk filter
  forward . 10.96.0.53 {                             # Technitium ClusterIP
    health_check 5s
    max_fails 2
  }
  cache (success 10000 300, denial 10000 300, serve_stale 3600s 86400s)
}

Scaling: 3 replicas, required anti-affinity on kubernetes.io/hostname (spread across 3 distinct nodes). PodDisruptionBudget coredns with minAvailable=2.

Kyverno ndots injection: A Kyverno policy injects ndots:2 on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.

Failover behaviour: With policy sequential on the root forward block, CoreDNS tries pfSense first; if health_check 5s detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with serve_stale, pods keep resolving cached names for up to 24h even with full upstream failure.

Cloudflare DNS — External Domains

All public domains are under the viktorbarzin.me zone. DNS records are auto-created per service via the ingress_factory module's dns_type parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in config.tfvars.

How DNS Records Are Created

stacks/<service>/main.tf
  module "ingress" {
    source   = ingress_factory
    dns_type = "proxied"    # ← auto-creates Cloudflare DNS record
  }

dns_type = "proxied": Creates CNAME → {tunnel_id}.cfargotunnel.com (Cloudflare CDN)
dns_type = "non-proxied": Creates A → public IP + AAAA → IPv6
dns_type = "none" (default): No DNS record

The Cloudflare tunnel uses a wildcard rule (*.viktorbarzin.me → Traefik) — no per-hostname tunnel config needed. Traefik handles host-based routing via K8s Ingress resources.

Record Types

Type	Records	Target	Example
Proxied CNAME	~100 domains	`{tunnel_id}.cfargotunnel.com`	blog, hackmd, homepage, ntfy
Non-proxied A	~35 domains	`176.12.22.76` (public IP)	mail, headscale, immich
Non-proxied AAAA	~35 domains	IPv6 (HE tunnel)	Same as non-proxied A
MX	1	`mail.viktorbarzin.me`	Inbound email
TXT (SPF)	1	`v=spf1 include:mailgun.org -all`	Email authentication
TXT (DKIM)	4	RSA keys (s1, mail, brevo1, brevo2)	Email signing
TXT (DMARC)	1	`v=DMARC1; p=quarantine; pct=100`	Email policy
TXT (MTA-STS)	1	`v=STSv1; id=20260412`	TLS enforcement
TXT (TLSRPT)	1	`v=TLSRPTv1; rua=mailto:postmaster@...`	TLS reporting
A (keyserver)	1	`130.162.165.220` (Oracle VPS)	PGP keyserver

Proxied vs Non-Proxied

Proxied (orange cloud): Traffic routes through Cloudflare CDN → Cloudflared tunnel → Traefik. Benefits: DDoS protection, caching, no public IP exposure.
Non-proxied (grey cloud): DNS resolves directly to public IP. Required for services needing direct connections (mail, VPN, WebSocket-heavy apps).

Zone Settings

HTTP/3 (QUIC): Enabled globally via cloudflare_zone_settings_override

DHCP → DNS Auto-Registration

Devices get automatic DNS registration without manual intervention. See networking.md § IPAM & DNS Auto-Registration for the full data flow diagram.

Summary:

Kea DHCP on pfSense assigns IP (53 reservations across 3 subnets)
Kea DDNS sends RFC 2136 dynamic update to Technitium (A + PTR records) — immediate
phpipam-pfsense-import CronJob (5min) pulls Kea leases + ARP table into phpIPAM
phpipam-dns-sync CronJob (15min) pushes named phpIPAM hosts → Technitium A + PTR, pulls Technitium PTR → phpIPAM hostnames

Automation CronJobs

CronJob	Schedule	Namespace	Purpose
`technitium-zone-sync`	`/30 * * *`	technitium	AXFR replication to secondary/tertiary
`technitium-password-sync`	`0 /6 * *`	technitium	Vault-rotated MySQL password → Technitium config, configure PG logging
`technitium-split-horizon-sync`	`15 /6 * *`	technitium	Split Horizon + DNS Rebinding Protection on all 3 instances
`technitium-dns-optimization`	`30 /6 * *`	technitium	Min cache TTL 60s, emrsn.org stub zone
`phpipam-dns-sync`	`/15 * * *`	phpipam	Bidirectional phpIPAM ↔ Technitium DNS sync
`phpipam-pfsense-import`	`/5 * * *`	phpipam	Import Kea DHCP leases + ARP from pfSense

Password Rotation Flow

Vault's database engine rotates the Technitium MySQL password every 7 days. The flow:

Vault DB engine rotates password
  → ExternalSecret (refreshInterval=15m) pulls from static-creds/mysql-technitium
  → K8s Secret technitium-db-creds updated
  → CronJob technitium-password-sync (every 6h):
    1. Logs into Technitium API
    2. Disables MySQL query logging (migrated to PG)
    3. Checks PG plugin is loaded (warns if missing)
    4. Configures PG query logging (90-day retention)

Monitoring

Metric Source	Dashboard	Alerts
Technitium query logs (PostgreSQL)	Grafana `technitium-dns.json`	—
CoreDNS Prometheus metrics (:9153)	Grafana CoreDNS dashboard	`CoreDNSErrors`, `CoreDNSForwardFailureRate`
Technitium zone-sync CronJob (Pushgateway)	—	`TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch`
Technitium DNS pod availability	—	`TechnitiumDNSDown`
`dns-anomaly-monitor` CronJob (Pushgateway)	—	`DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate`
Uptime Kuma	External monitors for all proxied domains	ExternalAccessDivergence (15min)

Metrics pushed by `technitium-zone-sync`

The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under job=technitium-zone-sync:

Metric	Labels	Meaning
`technitium_zone_sync_status`	—	0 = last run succeeded, 1 = at least one zone failed to create
`technitium_zone_sync_failures`	—	Number of zones that failed to create this run
`technitium_zone_sync_last_run`	—	Unix timestamp of last run (used by `TechnitiumZoneSyncStale`)
`technitium_zone_count`	`instance=primary\|<replica-host>`	Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`)

DNS alert rewrites

DNSQuerySpike was previously broken: it compared current queries against dns_anomaly_avg_queries, which was computed from a per-pod /tmp/dns_avg file. Each CronJob run started with a fresh /tmp, so NEW_AVG == TOTAL_QUERIES every time and the spike condition could never fire. Rewritten to use avg_over_time(dns_anomaly_total_queries[1h] offset 15m) which compares against the actual 1h Prometheus history.
DNSQueryRateDropped (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.

Troubleshooting

DNS Not Resolving Internal Domains

Check Technitium pods: kubectl get pod -n technitium
Check all 3 are healthy: kubectl get pod -n technitium -l dns-server=true
Test from a pod: kubectl exec -it <pod> -- nslookup idrac.viktorbarzin.lan 10.96.0.53
Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns
Verify ClusterIP service: kubectl get svc -n technitium technitium-dns-internal

LAN Clients Can't Resolve

Verify pfSense NAT rule redirects UDP 53 on WAN to 10.0.20.201
Check Technitium LB service: kubectl get svc -n technitium technitium-dns
Test from LAN: dig @192.168.1.2 idrac.viktorbarzin.lan
Check externalTrafficPolicy: Local — if no Technitium pod runs on the node receiving traffic, it drops

Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)

Verify Split Horizon app is installed on all instances
Check CronJob status: kubectl get cronjob -n technitium technitium-split-horizon-sync
Run the job manually: kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium
Test: dig @10.0.20.201 immich.viktorbarzin.me — should return 10.0.20.200 for 192.168.1.x source

Zone Not Replicating to Secondary/Tertiary

Check zone-sync CronJob: kubectl get cronjob -n technitium technitium-zone-sync
Check recent jobs: kubectl get jobs -n technitium | grep zone-sync
Verify AXFR is enabled on primary: Check zone options → Zone Transfer = Allow
Run sync manually: kubectl create job --from=cronjob/technitium-zone-sync test-sync -n technitium

High NXDOMAIN Rate in Logs

Common causes:

ndots:5 expansion: Pods query host.search.domain.viktorbarzin.lan — mitigated by CoreDNS template + Kyverno ndots:2
Corporate domains (emrsn.org): 27K+ daily queries — mitigated by stub zone returning NXDOMAIN locally
Ad blocking: Expected for blocked domains

Adding a New DNS Record

For internal .viktorbarzin.lan records:

Add host in phpIPAM web UI (phpipam.viktorbarzin.me) with hostname
Wait 15 minutes for phpipam-dns-sync to push to Technitium
Or add directly in Technitium web UI (technitium.viktorbarzin.me)

For external .viktorbarzin.me records:

Add dns_type = "proxied" (or "non-proxied") to the ingress_factory module call in the service stack
Run scripts/tg apply on the service stack — DNS record is auto-created
For non-standard records (MX, TXT), add a cloudflare_record resource in stacks/cloudflared/modules/cloudflared/cloudflare.tf

Incident History

2026-04-14 (SEV1): NFS fsid=0 caused Technitium primary data loss on restart. Fixed by migrating all 3 instances to proxmox-lvm-encrypted, adding zone-sync CronJob (30min AXFR). See post-mortem.

Networking Architecture — VLAN topology, IPAM auto-registration, ingress flow, MetalLB
Mailserver Architecture — DNS records for email (MX, SPF, DKIM, DMARC)
Security Architecture — Kyverno ndots policy
Monitoring Architecture — CoreDNS metrics, Uptime Kuma external monitors
Runbook: docs/runbooks/add-dns-record.md (referenced but not yet created)

22 KiB Raw Blame History