infra/docs/plans/2026-03-03-cluster-hardening-design.md
Viktor Barzin 197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00

2.7 KiB

Cluster Hardening Design

Date: 2026-03-03 Status: Approved Scope: Service availability, failure detection, DNS HA

Context

Reliability audit identified gaps in failure detection (most services lack health probes), NFS monitoring (backbone for 70+ services has no dedicated alerting), and DNS high availability (AXFR-based secondary doesn't sync settings/blocklists).

Decisions

  • No PDBs for now — revisit when adding more replicas
  • No NetworkPolicies in this phase — covered by security observability design
  • Replicate only critical infra (DNS); apps stay at 1 replica
  • Keep databases on NFS; harden via monitoring, not migration
  • Backup/DR items (MinIO, rsync, PBS, runbooks) deferred to a separate effort

Items

1. etcd Backup Alerts — DONE

  • EtcdBackupStale: fires critical if last successful backup > 36h
  • EtcdBackupNeverSucceeded: fires critical if backup has never completed
  • etcd backup image updated to registry.k8s.io/etcd:3.6.5-0 (matches cluster)
  • Applied 2026-03-03

2. Liveness & Readiness Probes

Add HTTP probes to Terraform-managed deployments. Conservative timing to avoid spamming:

  • periodSeconds: 30
  • failureThreshold: 5 (150s before restart)
  • initialDelaySeconds: 15
  • timeoutSeconds: 5

Use known health endpoints where available, fall back to GET / on container port. Start with tier-0/tier-1 services, then extend to tier-3/tier-4.

3. NFS Health Monitoring

  • Prometheus alert: NFSServerDown via blackbox exporter TCP probe on 10.0.10.15:2049, fires critical after 2 minutes
  • Uptime Kuma: TCP monitor on 10.0.10.15:2049

4. Technitium DNS Clustering

Migrate from AXFR zone transfers to Technitium's built-in clustering:

Architecture change:

  • Convert primary + secondary Deployments → single StatefulSet with 2 replicas
  • Add headless Service for stable pod DNS names
  • Separate NFS volumes per replica (existing pattern preserved)

Clustering setup:

  • Cluster domain: dns.viktorbarzin.lan (permanent)
  • Pod-0: primary (/api/admin/cluster/init)
  • Pod-1: secondary (/api/admin/cluster/initJoin)
  • HTTPS auto-enabled with self-signed certs (internal only)
  • One-shot setup Job after StatefulSet is running

What clustering syncs (vs AXFR which only syncs zone records):

  • Zones (via catalog zone — auto-syncs new zones)
  • Blocklists and allowed lists
  • DNS applications and their configs
  • Users, groups, permissions, API tokens
  • Settings

Requires maintenance window: brief DNS outage during StatefulSet migration.

Implementation Order

  1. NFS health monitoring (low effort, no disruption)
  2. Health probes (medium effort, rolling restarts)
  3. Technitium clustering (high effort, requires maintenance window)