infra/docs/plans/2026-03-03-cluster-hardening-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

73 lines
2.7 KiB
Markdown

# Cluster Hardening Design
**Date**: 2026-03-03
**Status**: Approved
**Scope**: Service availability, failure detection, DNS HA
## Context
Reliability audit identified gaps in failure detection (most services lack health probes), NFS monitoring (backbone for 70+ services has no dedicated alerting), and DNS high availability (AXFR-based secondary doesn't sync settings/blocklists).
## Decisions
- No PDBs for now — revisit when adding more replicas
- No NetworkPolicies in this phase — covered by security observability design
- Replicate only critical infra (DNS); apps stay at 1 replica
- Keep databases on NFS; harden via monitoring, not migration
- Backup/DR items (MinIO, rsync, PBS, runbooks) deferred to a separate effort
## Items
### 1. etcd Backup Alerts — DONE
- `EtcdBackupStale`: fires critical if last successful backup > 36h
- `EtcdBackupNeverSucceeded`: fires critical if backup has never completed
- etcd backup image updated to `registry.k8s.io/etcd:3.6.5-0` (matches cluster)
- Applied 2026-03-03
### 2. Liveness & Readiness Probes
Add HTTP probes to Terraform-managed deployments. Conservative timing to avoid spamming:
- `periodSeconds: 30`
- `failureThreshold: 5` (150s before restart)
- `initialDelaySeconds: 15`
- `timeoutSeconds: 5`
Use known health endpoints where available, fall back to `GET /` on container port.
Start with tier-0/tier-1 services, then extend to tier-3/tier-4.
### 3. NFS Health Monitoring
- **Prometheus alert**: `NFSServerDown` via blackbox exporter TCP probe on `10.0.10.15:2049`, fires critical after 2 minutes
- **Uptime Kuma**: TCP monitor on `10.0.10.15:2049`
### 4. Technitium DNS Clustering
Migrate from AXFR zone transfers to Technitium's built-in clustering:
**Architecture change**:
- Convert primary + secondary Deployments → single StatefulSet with 2 replicas
- Add headless Service for stable pod DNS names
- Separate NFS volumes per replica (existing pattern preserved)
**Clustering setup**:
- Cluster domain: `dns.viktorbarzin.lan` (permanent)
- Pod-0: primary (`/api/admin/cluster/init`)
- Pod-1: secondary (`/api/admin/cluster/initJoin`)
- HTTPS auto-enabled with self-signed certs (internal only)
- One-shot setup Job after StatefulSet is running
**What clustering syncs** (vs AXFR which only syncs zone records):
- Zones (via catalog zone — auto-syncs new zones)
- Blocklists and allowed lists
- DNS applications and their configs
- Users, groups, permissions, API tokens
- Settings
**Requires maintenance window**: brief DNS outage during StatefulSet migration.
## Implementation Order
1. NFS health monitoring (low effort, no disruption)
2. Health probes (medium effort, rolling restarts)
3. Technitium clustering (high effort, requires maintenance window)