[ci skip] update claude knowledge: infrastructure hardening changes
- NFS volumes now use var.nfs_server (not hardcoded IP) - Shared infra variables documented (redis_host, postgresql_host, etc.) - Tiers locals now generated by terragrunt.hcl, not duplicated in stacks - Traefik security hardening documented (API, headers, rate limiting) - Kyverno pod security policies documented (audit mode) - Prometheus alert groups updated (Critical Services, PVPredictedFull) - Loki retention updated to 30d, Alloy memory to 512Mi/1Gi - Grampsweb now protected by Authentik - MeshCentral registration disabled
This commit is contained in:
parent
36fd424107
commit
c61c1744de
1 changed files with 41 additions and 10 deletions
|
|
@ -39,12 +39,12 @@ Terragrunt-based infrastructure repository managing a home Kubernetes cluster on
|
|||
## Key Patterns
|
||||
|
||||
### NFS Volume Pattern
|
||||
**Prefer inline NFS volumes** over separate PV/PVC resources:
|
||||
**Prefer inline NFS volumes** over separate PV/PVC resources. Use `var.nfs_server` (defined in `terraform.tfvars`, auto-loaded by Terragrunt):
|
||||
```hcl
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
server = var.nfs_server
|
||||
path = "/mnt/main/<service>"
|
||||
}
|
||||
}
|
||||
|
|
@ -60,7 +60,7 @@ Structure: `stacks/<service>/main.tf` + `factory/main.tf`. Examples: `actualbudg
|
|||
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
|
||||
|
||||
### SMTP/Email
|
||||
- **Use**: `mail.viktorbarzin.me` port 587 (STARTTLS). **NOT** `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch).
|
||||
- **Use**: `var.mail_host` (defaults to `mail.viktorbarzin.me`) port 587 (STARTTLS). **NOT** `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch).
|
||||
- **Credentials**: `mailserver_accounts` in tfvars. Common: `info@viktorbarzin.me`
|
||||
|
||||
### Anti-AI Scraping (5-Layer Defense)
|
||||
|
|
@ -76,12 +76,13 @@ Testing: `curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultward
|
|||
Disable per-service: `anti_ai_scraping = false` in ingress_factory call.
|
||||
|
||||
### Terragrunt Architecture
|
||||
- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading
|
||||
- Root `terragrunt.hcl` provides DRY provider, backend, variable loading, and shared `tiers` locals (via `generate "tiers"` block)
|
||||
- Each stack: `stacks/<service>/main.tf` with resources inline, state at `state/stacks/<service>/terraform.tfstate`
|
||||
- Platform modules: `stacks/platform/modules/<service>/`, shared modules: `modules/kubernetes/`
|
||||
- Dependencies via `dependency` block; variables from `terraform.tfvars` (unused silently ignored)
|
||||
- `secrets/` symlinks in stacks for TLS cert path resolution
|
||||
- Syntax: `--non-interactive` (not `--terragrunt-non-interactive`), `terragrunt run --all -- <command>` (not `run-all`)
|
||||
- **Tiers locals**: Auto-generated by Terragrunt into `tiers.tf` in every stack — do NOT add `locals { tiers = { ... } }` to stacks manually
|
||||
|
||||
### Adding a New Service
|
||||
Use the **`setup-project`** skill for the full workflow. Quick reference:
|
||||
|
|
@ -90,6 +91,18 @@ Use the **`setup-project`** skill for the full workflow. Quick reference:
|
|||
3. Apply platform stack (for DNS): `cd stacks/platform && terragrunt apply --non-interactive`
|
||||
4. Apply service: `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
|
||||
### Shared Infrastructure Variables
|
||||
All stacks use variables from `terraform.tfvars` for shared service endpoints (auto-loaded by Terragrunt). **Never hardcode these values**:
|
||||
- `var.nfs_server` — NFS server IP (10.0.10.15)
|
||||
- `var.redis_host` — Redis hostname (redis.redis.svc.cluster.local)
|
||||
- `var.postgresql_host` — PostgreSQL hostname (postgresql.dbaas.svc.cluster.local)
|
||||
- `var.mysql_host` — MySQL hostname (mysql.dbaas.svc.cluster.local)
|
||||
- `var.ollama_host` — Ollama hostname (ollama.ollama.svc.cluster.local)
|
||||
- `var.mail_host` — Mail server hostname (mail.viktorbarzin.me)
|
||||
|
||||
For standalone stacks: add `variable "nfs_server" { type = string }` (etc.) to `main.tf`.
|
||||
For platform submodules: add the variable AND pass it through in `stacks/platform/main.tf` module block.
|
||||
|
||||
## Useful Commands
|
||||
```bash
|
||||
bash scripts/cluster_healthcheck.sh # Cluster health (24 checks)
|
||||
|
|
@ -108,7 +121,9 @@ terraform fmt -recursive # Format all
|
|||
## Infrastructure
|
||||
- Proxmox hypervisor (192.168.1.127) — see `.claude/reference/proxmox-inventory.md` for full VM table
|
||||
- Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
|
||||
- NFS: `10.0.10.15`, Redis: `redis.redis.svc.cluster.local`
|
||||
- NFS: `var.nfs_server` (10.0.10.15), Redis: `var.redis_host` (redis.redis.svc.cluster.local)
|
||||
- PostgreSQL: `var.postgresql_host` (postgresql.dbaas.svc.cluster.local), MySQL: `var.mysql_host` (mysql.dbaas.svc.cluster.local)
|
||||
- Ollama: `var.ollama_host` (ollama.ollama.svc.cluster.local), Mail: `var.mail_host` (mail.viktorbarzin.me)
|
||||
- Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
|
||||
- GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }`
|
||||
|
||||
|
|
@ -134,17 +149,32 @@ To rebuild a K8s worker node from scratch (e.g., after disk failure or corruptio
|
|||
- Commit only specific files. **ALWAYS ask user before pushing**.
|
||||
|
||||
## Prometheus Alerts
|
||||
- Rules in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
|
||||
- Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
|
||||
- Rules in `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`
|
||||
- Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Storage", "K8s Health", "Infrastructure Health", "Critical Services", "Cluster", "Traefik Ingress"
|
||||
- **Critical Services** group (added 2026-02): PostgreSQLDown, MySQLDown, RedisDown, HeadscaleDown, AuthentikDown, LokiDown
|
||||
- **Predictive alerts**: PVPredictedFull (predict_linear 24h ahead)
|
||||
- Loki alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (in `loki.tf` ConfigMap)
|
||||
|
||||
## Tier System & Resource Governance
|
||||
- **0-core**: Critical infra (ingress, DNS, VPN, auth) | **1-cluster**: Redis, metrics, security | **2-gpu**: GPU workloads | **3-edge**: User-facing | **4-aux**: Optional
|
||||
- Kyverno-based governance in `modules/kubernetes/kyverno/resource-governance.tf`:
|
||||
- **Tiers locals**: Generated by root `terragrunt.hcl` into `tiers.tf` — available as `local.tiers.core`, `local.tiers.cluster`, etc. in all stacks
|
||||
- Kyverno-based governance in `stacks/platform/modules/kyverno/resource-governance.tf`:
|
||||
1. PriorityClasses: `tier-0-core` (1M) through `tier-4-aux` (200K, preemption=Never)
|
||||
2. LimitRange defaults (Kyverno generate): auto-created per namespace tier
|
||||
3. ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label `resource-governance/custom-quota=true`)
|
||||
4. Priority injection (Kyverno mutate): sets priorityClassName on Pods
|
||||
- Custom quota override: monitoring, crowdsec, nvidia, realestate-crawler
|
||||
- **Pod Security Policies** (Kyverno, audit mode) in `stacks/platform/modules/kyverno/security-policies.tf`:
|
||||
1. `deny-privileged-containers`: Blocks `privileged: true` (exempt: frigate, nvidia, monitoring)
|
||||
2. `deny-host-namespaces`: Blocks hostNetwork/PID/IPC (exempt: frigate, monitoring)
|
||||
3. `restrict-sys-admin`: Blocks SYS_ADMIN capability (exempt: nvidia, monitoring)
|
||||
4. `require-trusted-registries`: Validates images from docker.io, ghcr.io, quay.io, registry.k8s.io, 10.0.20.10
|
||||
|
||||
### Traefik Security (hardened 2026-02)
|
||||
- **API dashboard**: `api.insecure = false` — dashboard only accessible via Authentik-protected ingress
|
||||
- **Forwarded headers**: `insecure=false` with `trustedIPs` set to Cloudflare IPv4 ranges + internal (10.0.0.0/8, 192.168.0.0/16)
|
||||
- **Security headers middleware** (`security-headers`): HSTS (31536000s, includeSubDomains), X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy
|
||||
- **Rate limiting**: average=10, burst=50 (default); average=100, burst=1000 (immich-rate-limit for uploads)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -189,10 +219,11 @@ To rebuild a K8s worker node from scratch (e.g., after disk failure or corruptio
|
|||
- **Image**: `ghcr.io/gramps-project/grampsweb:latest` | **Port**: 5000 | **URL**: `https://family.viktorbarzin.me`
|
||||
- **Components**: Web app + Celery worker (2 containers in 1 pod) | **Redis**: DB 2 (broker), DB 3 (rate limiting)
|
||||
- **Storage**: NFS `/mnt/main/grampsweb` with sub_paths
|
||||
- **Auth**: Protected by Authentik (added 2026-02)
|
||||
|
||||
### Loki + Alloy (Log Collection)
|
||||
- **Loki**: `grafana/loki:3.6.5` (single binary, 6Gi RAM, 7d retention)
|
||||
- **Alloy**: `grafana/alloy:v1.13.0` (DaemonSet, 128Mi/pod)
|
||||
- **Loki**: `grafana/loki:3.6.5` (single binary, 6Gi RAM, 30d retention / 720h)
|
||||
- **Alloy**: `grafana/alloy:v1.13.0` (DaemonSet, 512Mi requests / 1Gi limits)
|
||||
- **Storage**: NFS PV `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
|
||||
- **Alert rules**: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
|
||||
- **Troubleshooting**: "entry too far behind" on first start → restart Alloy DaemonSet
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue