[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache
This commit is contained in:
parent
8d3db35b5e
commit
197cef7f3f
60 changed files with 3530 additions and 0 deletions
73
docs/plans/2026-03-03-cluster-hardening-design.md
Normal file
73
docs/plans/2026-03-03-cluster-hardening-design.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# Cluster Hardening Design
|
||||
|
||||
**Date**: 2026-03-03
|
||||
**Status**: Approved
|
||||
**Scope**: Service availability, failure detection, DNS HA
|
||||
|
||||
## Context
|
||||
|
||||
Reliability audit identified gaps in failure detection (most services lack health probes), NFS monitoring (backbone for 70+ services has no dedicated alerting), and DNS high availability (AXFR-based secondary doesn't sync settings/blocklists).
|
||||
|
||||
## Decisions
|
||||
|
||||
- No PDBs for now — revisit when adding more replicas
|
||||
- No NetworkPolicies in this phase — covered by security observability design
|
||||
- Replicate only critical infra (DNS); apps stay at 1 replica
|
||||
- Keep databases on NFS; harden via monitoring, not migration
|
||||
- Backup/DR items (MinIO, rsync, PBS, runbooks) deferred to a separate effort
|
||||
|
||||
## Items
|
||||
|
||||
### 1. etcd Backup Alerts — DONE
|
||||
|
||||
- `EtcdBackupStale`: fires critical if last successful backup > 36h
|
||||
- `EtcdBackupNeverSucceeded`: fires critical if backup has never completed
|
||||
- etcd backup image updated to `registry.k8s.io/etcd:3.6.5-0` (matches cluster)
|
||||
- Applied 2026-03-03
|
||||
|
||||
### 2. Liveness & Readiness Probes
|
||||
|
||||
Add HTTP probes to Terraform-managed deployments. Conservative timing to avoid spamming:
|
||||
- `periodSeconds: 30`
|
||||
- `failureThreshold: 5` (150s before restart)
|
||||
- `initialDelaySeconds: 15`
|
||||
- `timeoutSeconds: 5`
|
||||
|
||||
Use known health endpoints where available, fall back to `GET /` on container port.
|
||||
Start with tier-0/tier-1 services, then extend to tier-3/tier-4.
|
||||
|
||||
### 3. NFS Health Monitoring
|
||||
|
||||
- **Prometheus alert**: `NFSServerDown` via blackbox exporter TCP probe on `10.0.10.15:2049`, fires critical after 2 minutes
|
||||
- **Uptime Kuma**: TCP monitor on `10.0.10.15:2049`
|
||||
|
||||
### 4. Technitium DNS Clustering
|
||||
|
||||
Migrate from AXFR zone transfers to Technitium's built-in clustering:
|
||||
|
||||
**Architecture change**:
|
||||
- Convert primary + secondary Deployments → single StatefulSet with 2 replicas
|
||||
- Add headless Service for stable pod DNS names
|
||||
- Separate NFS volumes per replica (existing pattern preserved)
|
||||
|
||||
**Clustering setup**:
|
||||
- Cluster domain: `dns.viktorbarzin.lan` (permanent)
|
||||
- Pod-0: primary (`/api/admin/cluster/init`)
|
||||
- Pod-1: secondary (`/api/admin/cluster/initJoin`)
|
||||
- HTTPS auto-enabled with self-signed certs (internal only)
|
||||
- One-shot setup Job after StatefulSet is running
|
||||
|
||||
**What clustering syncs** (vs AXFR which only syncs zone records):
|
||||
- Zones (via catalog zone — auto-syncs new zones)
|
||||
- Blocklists and allowed lists
|
||||
- DNS applications and their configs
|
||||
- Users, groups, permissions, API tokens
|
||||
- Settings
|
||||
|
||||
**Requires maintenance window**: brief DNS outage during StatefulSet migration.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. NFS health monitoring (low effort, no disruption)
|
||||
2. Health probes (medium effort, rolling restarts)
|
||||
3. Technitium clustering (high effort, requires maintenance window)
|
||||
Loading…
Add table
Add a link
Reference in a new issue