- NFS volumes now use var.nfs_server (not hardcoded IP) - Shared infra variables documented (redis_host, postgresql_host, etc.) - Tiers locals now generated by terragrunt.hcl, not duplicated in stacks - Traefik security hardening documented (API, headers, rate limiting) - Kyverno pod security policies documented (audit mode) - Prometheus alert groups updated (Critical Services, PVPredictedFull) - Loki retention updated to 30d, Alloy memory to 512Mi/1Gi - Grampsweb now protected by Authentik - MeshCentral registration disabled
16 KiB
Executable file
Infrastructure Repository Knowledge
Instructions for Claude
- When the user says "remember" something: Always update this file (
.claude/CLAUDE.md) with the information so it persists across sessions - When discovering new patterns or versions: Add them to the appropriate section below
- When making infrastructure changes: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
- After every significant change: Proactively update this file (
.claude/CLAUDE.md) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically. - After updating any
.claude/files: Always commit them immediately (git add .claude/ && git commit -m "[ci skip] update claude knowledge") to avoid building up unstaged changes. - Skills available: Check
.claude/skills/directory for specialized workflows (e.g.,setup-projectfor deploying new services) - Reference data: Check
.claude/reference/for inventory tables, API patterns, and current state snapshots - CRITICAL: All infrastructure changes must go through Terraform/Terragrunt. NEVER modify cluster resources directly (kubectl apply/edit/patch, helm install, docker run). Use
kubectlonly for read-only operations and ephemeral debugging. - CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files unless encrypted via git-crypt. Secrets belong in
terraform.tfvarsorsecrets/directory. - CRITICAL: NEVER commit secrets — triple-check before every commit. Zero exceptions.
- New services MUST have CI/CD (Woodpecker CI pipeline) and monitoring (Prometheus alerts and/or Uptime Kuma).
Execution Environment
- Terraform/Terragrunt: Always run locally:
cd stacks/<service> && terragrunt apply --non-interactive - kubectl:
kubectl --kubeconfig $(pwd)/config - GitHub API: Use
curlwith tokens from tfvars (see.claude/reference/github-api.md).ghCLI is blocked by sandbox.
Overview
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under stacks/. Uses git-crypt for secrets encryption.
Key File Paths
terraform.tfvars— All secrets, DNS, Cloudflare config, WireGuard peers (git-crypt encrypted)terragrunt.hcl— Root config (providers, backend, variable loading)stacks/<service>/— Individual service stacks (terragrunt.hcl+main.tf)stacks/platform/— Core infrastructure (~22 services inmodules/subdir)stacks/infra/— Proxmox VM resourcesmodules/kubernetes/ingress_factory/,setup_tls_secret/— Shared utility modulessecrets/— git-crypt encrypted TLS certs and keys
Domains
- Public:
viktorbarzin.me(Cloudflare-managed) - Internal:
viktorbarzin.lan(Technitium DNS)
Key Patterns
NFS Volume Pattern
Prefer inline NFS volumes over separate PV/PVC resources. Use var.nfs_server (defined in terraform.tfvars, auto-loaded by Terragrunt):
volume {
name = "data"
nfs {
server = var.nfs_server
path = "/mnt/main/<service>"
}
}
Only use PV/PVC when a Helm chart requires existingClaim.
Adding NFS Exports
- Edit
secrets/nfs_directories.txt— add path, keep sorted - Run
secrets/nfs_exports.shfromsecrets/to update TrueNAS
Factory Pattern (multi-user services)
Structure: stacks/<service>/main.tf + factory/main.tf. Examples: actualbudget, freedify.
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
SMTP/Email
- Use:
var.mail_host(defaults tomail.viktorbarzin.me) port 587 (STARTTLS). NOTmailserver.mailserver.svc.cluster.local(TLS cert mismatch). - Credentials:
mailserver_accountsin tfvars. Common:info@viktorbarzin.me
Anti-AI Scraping (5-Layer Defense)
All services have anti_ai_scraping = true by default in ingress_factory. Layers:
- Bot blocking (
traefik-ai-bot-block): ForwardAuth → poison-fountain/auth. Returns 403 for GPTBot, ClaudeBot, CCBot, etc. - X-Robots-Tag (
traefik-anti-ai-headers): Addsnoai, noimageai - Trap links (
traefik-anti-ai-trap-links): rewrite-body injects 5 hidden links before</body>topoison.viktorbarzin.me/article/* - Tarpit:
/article/*drip-feeds at ~100 bytes/sec - Poison content: 50 cached docs from rnsaffn.com/poison2/ (CronJob every 6h,
--http1.1required)
Key files: stacks/poison-fountain/, stacks/platform/modules/traefik/middleware.tf, modules/kubernetes/ingress_factory/main.tf
Testing: curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'
Disable per-service: anti_ai_scraping = false in ingress_factory call.
Terragrunt Architecture
- Root
terragrunt.hclprovides DRY provider, backend, variable loading, and sharedtierslocals (viagenerate "tiers"block) - Each stack:
stacks/<service>/main.tfwith resources inline, state atstate/stacks/<service>/terraform.tfstate - Platform modules:
stacks/platform/modules/<service>/, shared modules:modules/kubernetes/ - Dependencies via
dependencyblock; variables fromterraform.tfvars(unused silently ignored) secrets/symlinks in stacks for TLS cert path resolution- Syntax:
--non-interactive(not--terragrunt-non-interactive),terragrunt run --all -- <command>(notrun-all) - Tiers locals: Auto-generated by Terragrunt into
tiers.tfin every stack — do NOT addlocals { tiers = { ... } }to stacks manually
Adding a New Service
Use the setup-project skill for the full workflow. Quick reference:
- Create
stacks/<service>/withterragrunt.hcl,main.tf,secretssymlink - Add Cloudflare DNS in
terraform.tfvars - Apply platform stack (for DNS):
cd stacks/platform && terragrunt apply --non-interactive - Apply service:
cd stacks/<service> && terragrunt apply --non-interactive
Shared Infrastructure Variables
All stacks use variables from terraform.tfvars for shared service endpoints (auto-loaded by Terragrunt). Never hardcode these values:
var.nfs_server— NFS server IP (10.0.10.15)var.redis_host— Redis hostname (redis.redis.svc.cluster.local)var.postgresql_host— PostgreSQL hostname (postgresql.dbaas.svc.cluster.local)var.mysql_host— MySQL hostname (mysql.dbaas.svc.cluster.local)var.ollama_host— Ollama hostname (ollama.ollama.svc.cluster.local)var.mail_host— Mail server hostname (mail.viktorbarzin.me)
For standalone stacks: add variable "nfs_server" { type = string } (etc.) to main.tf.
For platform submodules: add the variable AND pass it through in stacks/platform/main.tf module block.
Useful Commands
bash scripts/cluster_healthcheck.sh # Cluster health (24 checks)
bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL
cd stacks/<service> && terragrunt apply --non-interactive # Apply single stack
cd stacks && terragrunt run --all --non-interactive -- plan # Plan all
terraform fmt -recursive # Format all
CI/CD
- Woodpecker CI (
.woodpecker/): pushes applyplatformstack, hosted athttps://ci.viktorbarzin.me - TLS renewal pipeline: cron-triggered
renew2.sh(certbot + Cloudflare DNS) - ALWAYS add
[ci skip]to commit messages when you've already applied locally - After committing, run
git push origin masterto sync
Infrastructure
- Proxmox hypervisor (192.168.1.127) — see
.claude/reference/proxmox-inventory.mdfor full VM table - Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
- NFS:
var.nfs_server(10.0.10.15), Redis:var.redis_host(redis.redis.svc.cluster.local) - PostgreSQL:
var.postgresql_host(postgresql.dbaas.svc.cluster.local), MySQL:var.mysql_host(mysql.dbaas.svc.cluster.local) - Ollama:
var.ollama_host(ollama.ollama.svc.cluster.local), Mail:var.mail_host(mail.viktorbarzin.me) - Docker registry pull-through cache at
10.0.20.10(ports 5000/5010/5020/5030/5040) - GPU workloads need:
node_selector = { "gpu": "true" }+toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }
Node Rebuild Procedure
To rebuild a K8s worker node from scratch (e.g., after disk failure or corruption):
- Drain the node (if still reachable):
kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data - Delete the node from K8s:
kubectl delete node k8s-nodeX - Destroy the VM in Proxmox (or via Terraform: remove from
stacks/infra/main.tfand apply) - Ensure K8s template exists: The template
ubuntu-2404-cloudinit-k8s-template(VMID 2000) must exist. If not, applystacks/infra/to recreate it. - Get a fresh join command:
ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command' - Update
k8s_join_commandinterraform.tfvarswith the new join command - Create the new VM: Add it back in
stacks/infra/main.tfandcd stacks/infra && terragrunt apply --non-interactive - Wait for cloud-init: The VM will install packages, configure containerd mirrors, and join the cluster automatically via cloud-init
- Verify the node joined:
kubectl get nodes— should show the new node asReady - For GPU node (k8s-node1) only: Apply the platform stack to re-apply GPU label and taint:
cd stacks/platform && terragrunt apply --non-interactive(thenull_resource.gpu_node_configin the nvidia module handles this) - Verify containerd mirrors:
ssh wizard@<node-ip> 'ls /etc/containerd/certs.d/'— should show docker.io, ghcr.io, quay.io, registry.k8s.io, reg.kyverno.io
Note: kubeadm tokens expire after 24h by default. Generate a fresh one just before creating the VM.
Git Operations
- Git is slow — commands can take 30+ seconds. Use
GIT_OPTIONAL_LOCKS=0if git hangs. - Commit only specific files. ALWAYS ask user before pushing.
Prometheus Alerts
- Rules in
stacks/platform/modules/monitoring/prometheus_chart_values.tpl - Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Storage", "K8s Health", "Infrastructure Health", "Critical Services", "Cluster", "Traefik Ingress"
- Critical Services group (added 2026-02): PostgreSQLDown, MySQLDown, RedisDown, HeadscaleDown, AuthentikDown, LokiDown
- Predictive alerts: PVPredictedFull (predict_linear 24h ahead)
- Loki alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (in
loki.tfConfigMap)
Tier System & Resource Governance
- 0-core: Critical infra (ingress, DNS, VPN, auth) | 1-cluster: Redis, metrics, security | 2-gpu: GPU workloads | 3-edge: User-facing | 4-aux: Optional
- Tiers locals: Generated by root
terragrunt.hclintotiers.tf— available aslocal.tiers.core,local.tiers.cluster, etc. in all stacks - Kyverno-based governance in
stacks/platform/modules/kyverno/resource-governance.tf:- PriorityClasses:
tier-0-core(1M) throughtier-4-aux(200K, preemption=Never) - LimitRange defaults (Kyverno generate): auto-created per namespace tier
- ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label
resource-governance/custom-quota=true) - Priority injection (Kyverno mutate): sets priorityClassName on Pods
- PriorityClasses:
- Custom quota override: monitoring, crowdsec, nvidia, realestate-crawler
- Pod Security Policies (Kyverno, audit mode) in
stacks/platform/modules/kyverno/security-policies.tf:deny-privileged-containers: Blocksprivileged: true(exempt: frigate, nvidia, monitoring)deny-host-namespaces: Blocks hostNetwork/PID/IPC (exempt: frigate, monitoring)restrict-sys-admin: Blocks SYS_ADMIN capability (exempt: nvidia, monitoring)require-trusted-registries: Validates images from docker.io, ghcr.io, quay.io, registry.k8s.io, 10.0.20.10
Traefik Security (hardened 2026-02)
- API dashboard:
api.insecure = false— dashboard only accessible via Authentik-protected ingress - Forwarded headers:
insecure=falsewithtrustedIPsset to Cloudflare IPv4 ranges + internal (10.0.0.0/8, 192.168.0.0/16) - Security headers middleware (
security-headers): HSTS (31536000s, includeSubDomains), X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy - Rate limiting: average=10, burst=50 (default); average=100, burst=1000 (immich-rate-limit for uploads)
User Preferences
- Calendar: Nextcloud at
https://nextcloud.viktorbarzin.me - Home Assistant: ha-london (default) at
https://ha-london.viktorbarzin.me, ha-sofia athttps://ha-sofia.viktorbarzin.me. "ha"/"HA" = ha-london. - Frontend: Svelte for all new web apps
- Pod monitoring: Never use
sleep— spawn background subagent withkubectl get pods -winstead
Reference Data
.claude/reference/service-catalog.md— Full service catalog (70+ services) with Cloudflare domains.claude/reference/proxmox-inventory.md— VM table, hardware specs, network topology, GPU config.claude/reference/github-api.md— GitHub API patterns with curl examples.claude/reference/authentik-state.md— Current applications, groups, users, login sources
Service-Specific Notes
Authentik (Identity Provider)
- URL:
https://authentik.viktorbarzin.me| API:/api/v3/| Token:authentik_api_tokenin tfvars - Architecture: 3 server + 3 worker + 3 PgBouncer + embedded outpost
- Database: PostgreSQL via
postgresql.dbaas:5432, PgBouncer atpgbouncer.authentik:6432 - Traefik integration: Forward auth via
protected = truein ingress_factory - OIDC for K8s: Issuer
https://authentik.viktorbarzin.me/application/o/kubernetes/, clientkubernetes(public) - For management tasks, current state, and OIDC gotchas: see
authentikandauthentik-oidc-kubernetesskills - For current apps/groups/users snapshot: see
.claude/reference/authentik-state.md
AFFiNE (Visual Canvas)
- Image:
ghcr.io/toeverything/affine:stable| Port: 3010 | Requires: PostgreSQL + Redis - Migration: Init container runs
node ./scripts/self-host-predeploy.js - Storage: NFS
/mnt/main/affine→/root/.affine/storageand/root/.affine/config
Wyoming Whisper (STT)
- Image:
rhasspy/wyoming-whisper:latest| Port: 10300/TCP (Wyoming protocol) - Model:
small-int8(CPU-only) | Access:10.0.20.202:10300(internal, no public DNS) - HA Integration: Wyoming Protocol in ha-london
Gramps Web (Genealogy)
- Image:
ghcr.io/gramps-project/grampsweb:latest| Port: 5000 | URL:https://family.viktorbarzin.me - Components: Web app + Celery worker (2 containers in 1 pod) | Redis: DB 2 (broker), DB 3 (rate limiting)
- Storage: NFS
/mnt/main/grampswebwith sub_paths - Auth: Protected by Authentik (added 2026-02)
Loki + Alloy (Log Collection)
- Loki:
grafana/loki:3.6.5(single binary, 6Gi RAM, 30d retention / 720h) - Alloy:
grafana/alloy:v1.13.0(DaemonSet, 512Mi requests / 1Gi limits) - Storage: NFS PV
/mnt/main/loki/loki(15Gi), WAL on tmpfs (2Gi) - Alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap
loki-alert-rules) - Troubleshooting: "entry too far behind" on first start → restart Alloy DaemonSet
OpenClaw (AI Agent Gateway)
- Image:
ghcr.io/openclaw/openclaw:2026.2.9| Port: 18789 | URL:https://openclaw.viktorbarzin.me - Init container: Downloads kubectl, terraform, git-crypt; clones infra repo
- ServiceAccount:
openclawwithcluster-adminClusterRoleBinding - Model providers: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API
Service Versions (as of 2026-02)
Immich v2.4.1 | AFFiNE stable | Whisper latest | Loki 3.6.5 | Alloy v1.13.0 | OpenClaw 2026.2.9