[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache

- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
This commit is contained in:
Viktor Barzin 2026-03-06 23:55:57 +00:00
parent 8d3db35b5e
commit 197cef7f3f
60 changed files with 3530 additions and 0 deletions

View file

@ -0,0 +1,73 @@
# Cluster Hardening Design
**Date**: 2026-03-03
**Status**: Approved
**Scope**: Service availability, failure detection, DNS HA
## Context
Reliability audit identified gaps in failure detection (most services lack health probes), NFS monitoring (backbone for 70+ services has no dedicated alerting), and DNS high availability (AXFR-based secondary doesn't sync settings/blocklists).
## Decisions
- No PDBs for now — revisit when adding more replicas
- No NetworkPolicies in this phase — covered by security observability design
- Replicate only critical infra (DNS); apps stay at 1 replica
- Keep databases on NFS; harden via monitoring, not migration
- Backup/DR items (MinIO, rsync, PBS, runbooks) deferred to a separate effort
## Items
### 1. etcd Backup Alerts — DONE
- `EtcdBackupStale`: fires critical if last successful backup > 36h
- `EtcdBackupNeverSucceeded`: fires critical if backup has never completed
- etcd backup image updated to `registry.k8s.io/etcd:3.6.5-0` (matches cluster)
- Applied 2026-03-03
### 2. Liveness & Readiness Probes
Add HTTP probes to Terraform-managed deployments. Conservative timing to avoid spamming:
- `periodSeconds: 30`
- `failureThreshold: 5` (150s before restart)
- `initialDelaySeconds: 15`
- `timeoutSeconds: 5`
Use known health endpoints where available, fall back to `GET /` on container port.
Start with tier-0/tier-1 services, then extend to tier-3/tier-4.
### 3. NFS Health Monitoring
- **Prometheus alert**: `NFSServerDown` via blackbox exporter TCP probe on `10.0.10.15:2049`, fires critical after 2 minutes
- **Uptime Kuma**: TCP monitor on `10.0.10.15:2049`
### 4. Technitium DNS Clustering
Migrate from AXFR zone transfers to Technitium's built-in clustering:
**Architecture change**:
- Convert primary + secondary Deployments → single StatefulSet with 2 replicas
- Add headless Service for stable pod DNS names
- Separate NFS volumes per replica (existing pattern preserved)
**Clustering setup**:
- Cluster domain: `dns.viktorbarzin.lan` (permanent)
- Pod-0: primary (`/api/admin/cluster/init`)
- Pod-1: secondary (`/api/admin/cluster/initJoin`)
- HTTPS auto-enabled with self-signed certs (internal only)
- One-shot setup Job after StatefulSet is running
**What clustering syncs** (vs AXFR which only syncs zone records):
- Zones (via catalog zone — auto-syncs new zones)
- Blocklists and allowed lists
- DNS applications and their configs
- Users, groups, permissions, API tokens
- Settings
**Requires maintenance window**: brief DNS outage during StatefulSet migration.
## Implementation Order
1. NFS health monitoring (low effort, no disruption)
2. Health probes (medium effort, rolling restarts)
3. Technitium clustering (high effort, requires maintenance window)