Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.
13 KiB
Executable file
Infrastructure Repository Knowledge
Instructions for Claude
- When the user says "remember" something: Always update this file (
.claude/CLAUDE.md) with the information so it persists across sessions - When discovering new patterns or versions: Add them to the appropriate section below
- When making infrastructure changes: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
- After every significant change: Proactively update this file (
.claude/CLAUDE.md) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically. - After updating any
.claude/files: Always commit them immediately (git add .claude/ && git commit -m "[ci skip] update claude knowledge") to avoid building up unstaged changes. - Skills available: Check
.claude/skills/directory for specialized workflows (e.g.,setup-projectfor deploying new services) - Reference data: Check
.claude/reference/for inventory tables, API patterns, and current state snapshots - CRITICAL: All infrastructure changes must go through Terraform/Terragrunt. NEVER modify cluster resources directly (kubectl apply/edit/patch, helm install, docker run). Use
kubectlonly for read-only operations and ephemeral debugging. - CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files unless encrypted via git-crypt. Secrets belong in
terraform.tfvarsorsecrets/directory. - CRITICAL: NEVER commit secrets — triple-check before every commit. Zero exceptions.
- New services MUST have CI/CD (Woodpecker CI pipeline) and monitoring (Prometheus alerts and/or Uptime Kuma).
Execution Environment
- Terraform/Terragrunt: Always run locally:
cd stacks/<service> && terragrunt apply --non-interactive - kubectl:
kubectl --kubeconfig $(pwd)/config - GitHub API: Use
curlwith tokens from tfvars (see.claude/reference/github-api.md).ghCLI is blocked by sandbox.
Overview
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under stacks/. Uses git-crypt for secrets encryption.
Key File Paths
terraform.tfvars— All secrets, DNS, Cloudflare config, WireGuard peers (git-crypt encrypted)terragrunt.hcl— Root config (providers, backend, variable loading)stacks/<service>/— Individual service stacks (terragrunt.hcl+main.tf)stacks/platform/— Core infrastructure (~22 services inmodules/subdir)stacks/infra/— Proxmox VM resourcesmodules/kubernetes/ingress_factory/,setup_tls_secret/— Shared utility modulessecrets/— git-crypt encrypted TLS certs and keys
Domains
- Public:
viktorbarzin.me(Cloudflare-managed) - Internal:
viktorbarzin.lan(Technitium DNS)
Key Patterns
NFS Volume Pattern
Prefer inline NFS volumes over separate PV/PVC resources:
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/<service>"
}
}
Only use PV/PVC when a Helm chart requires existingClaim.
Adding NFS Exports
- Edit
secrets/nfs_directories.txt— add path, keep sorted - Run
secrets/nfs_exports.shfromsecrets/to update TrueNAS
Factory Pattern (multi-user services)
Structure: stacks/<service>/main.tf + factory/main.tf. Examples: actualbudget, freedify.
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
SMTP/Email
- Use:
mail.viktorbarzin.meport 587 (STARTTLS). NOTmailserver.mailserver.svc.cluster.local(TLS cert mismatch). - Credentials:
mailserver_accountsin tfvars. Common:info@viktorbarzin.me
Anti-AI Scraping (5-Layer Defense)
All services have anti_ai_scraping = true by default in ingress_factory. Layers:
- Bot blocking (
traefik-ai-bot-block): ForwardAuth → poison-fountain/auth. Returns 403 for GPTBot, ClaudeBot, CCBot, etc. - X-Robots-Tag (
traefik-anti-ai-headers): Addsnoai, noimageai - Trap links (
traefik-anti-ai-trap-links): rewrite-body injects 5 hidden links before</body>topoison.viktorbarzin.me/article/* - Tarpit:
/article/*drip-feeds at ~100 bytes/sec - Poison content: 50 cached docs from rnsaffn.com/poison2/ (CronJob every 6h,
--http1.1required)
Key files: stacks/poison-fountain/, stacks/platform/modules/traefik/middleware.tf, modules/kubernetes/ingress_factory/main.tf
Testing: curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'
Disable per-service: anti_ai_scraping = false in ingress_factory call.
Terragrunt Architecture
- Root
terragrunt.hclprovides DRY provider, backend, and variable loading - Each stack:
stacks/<service>/main.tfwith resources inline, state atstate/stacks/<service>/terraform.tfstate - Platform modules:
stacks/platform/modules/<service>/, shared modules:modules/kubernetes/ - Dependencies via
dependencyblock; variables fromterraform.tfvars(unused silently ignored) secrets/symlinks in stacks for TLS cert path resolution- Syntax:
--non-interactive(not--terragrunt-non-interactive),terragrunt run --all -- <command>(notrun-all)
Adding a New Service
Use the setup-project skill for the full workflow. Quick reference:
- Create
stacks/<service>/withterragrunt.hcl,main.tf,secretssymlink - Add Cloudflare DNS in
terraform.tfvars - Apply platform stack (for DNS):
cd stacks/platform && terragrunt apply --non-interactive - Apply service:
cd stacks/<service> && terragrunt apply --non-interactive
Useful Commands
bash scripts/cluster_healthcheck.sh # Cluster health (24 checks)
bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL
cd stacks/<service> && terragrunt apply --non-interactive # Apply single stack
cd stacks && terragrunt run --all --non-interactive -- plan # Plan all
terraform fmt -recursive # Format all
CI/CD
- Woodpecker CI (
.woodpecker/): pushes applyplatformstack, hosted athttps://ci.viktorbarzin.me - TLS renewal pipeline: cron-triggered
renew2.sh(certbot + Cloudflare DNS) - ALWAYS add
[ci skip]to commit messages when you've already applied locally - After committing, run
git push origin masterto sync
Infrastructure
- Proxmox hypervisor (192.168.1.127) — see
.claude/reference/proxmox-inventory.mdfor full VM table - Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
- NFS:
10.0.10.15, Redis:redis.redis.svc.cluster.local - Docker registry pull-through cache at
10.0.20.10(ports 5000/5010/5020/5030/5040) - GPU workloads need:
node_selector = { "gpu": "true" }+toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }
Node Rebuild Procedure
To rebuild a K8s worker node from scratch (e.g., after disk failure or corruption):
- Drain the node (if still reachable):
kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data - Delete the node from K8s:
kubectl delete node k8s-nodeX - Destroy the VM in Proxmox (or via Terraform: remove from
stacks/infra/main.tfand apply) - Ensure K8s template exists: The template
ubuntu-2404-cloudinit-k8s-template(VMID 2000) must exist. If not, applystacks/infra/to recreate it. - Get a fresh join command:
ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command' - Update
k8s_join_commandinterraform.tfvarswith the new join command - Create the new VM: Add it back in
stacks/infra/main.tfandcd stacks/infra && terragrunt apply --non-interactive - Wait for cloud-init: The VM will install packages, configure containerd mirrors, and join the cluster automatically via cloud-init
- Verify the node joined:
kubectl get nodes— should show the new node asReady - For GPU node (k8s-node1) only: Apply the platform stack to re-apply GPU label and taint:
cd stacks/platform && terragrunt apply --non-interactive(thenull_resource.gpu_node_configin the nvidia module handles this) - Verify containerd mirrors:
ssh wizard@<node-ip> 'ls /etc/containerd/certs.d/'— should show docker.io, ghcr.io, quay.io, registry.k8s.io, reg.kyverno.io
Note: kubeadm tokens expire after 24h by default. Generate a fresh one just before creating the VM.
Git Operations
- Git is slow — commands can take 30+ seconds. Use
GIT_OPTIONAL_LOCKS=0if git hangs. - Commit only specific files. ALWAYS ask user before pushing.
Prometheus Alerts
- Rules in
modules/kubernetes/monitoring/prometheus_chart_values.tpl - Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
Tier System & Resource Governance
- 0-core: Critical infra (ingress, DNS, VPN, auth) | 1-cluster: Redis, metrics, security | 2-gpu: GPU workloads | 3-edge: User-facing | 4-aux: Optional
- Kyverno-based governance in
modules/kubernetes/kyverno/resource-governance.tf:- PriorityClasses:
tier-0-core(1M) throughtier-4-aux(200K, preemption=Never) - LimitRange defaults (Kyverno generate): auto-created per namespace tier
- ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label
resource-governance/custom-quota=true) - Priority injection (Kyverno mutate): sets priorityClassName on Pods
- PriorityClasses:
- Custom quota override: monitoring, crowdsec, nvidia, realestate-crawler
User Preferences
- Calendar: Nextcloud at
https://nextcloud.viktorbarzin.me - Home Assistant: ha-london (default) at
https://ha-london.viktorbarzin.me, ha-sofia athttps://ha-sofia.viktorbarzin.me. "ha"/"HA" = ha-london. - Frontend: Svelte for all new web apps
- Pod monitoring: Never use
sleep— spawn background subagent withkubectl get pods -winstead
Reference Data
.claude/reference/service-catalog.md— Full service catalog (70+ services) with Cloudflare domains.claude/reference/proxmox-inventory.md— VM table, hardware specs, network topology, GPU config.claude/reference/github-api.md— GitHub API patterns with curl examples.claude/reference/authentik-state.md— Current applications, groups, users, login sources
Service-Specific Notes
Authentik (Identity Provider)
- URL:
https://authentik.viktorbarzin.me| API:/api/v3/| Token:authentik_api_tokenin tfvars - Architecture: 3 server + 3 worker + 3 PgBouncer + embedded outpost
- Database: PostgreSQL via
postgresql.dbaas:5432, PgBouncer atpgbouncer.authentik:6432 - Traefik integration: Forward auth via
protected = truein ingress_factory - OIDC for K8s: Issuer
https://authentik.viktorbarzin.me/application/o/kubernetes/, clientkubernetes(public) - For management tasks, current state, and OIDC gotchas: see
authentikandauthentik-oidc-kubernetesskills - For current apps/groups/users snapshot: see
.claude/reference/authentik-state.md
AFFiNE (Visual Canvas)
- Image:
ghcr.io/toeverything/affine:stable| Port: 3010 | Requires: PostgreSQL + Redis - Migration: Init container runs
node ./scripts/self-host-predeploy.js - Storage: NFS
/mnt/main/affine→/root/.affine/storageand/root/.affine/config
Wyoming Whisper (STT)
- Image:
rhasspy/wyoming-whisper:latest| Port: 10300/TCP (Wyoming protocol) - Model:
small-int8(CPU-only) | Access:10.0.20.202:10300(internal, no public DNS) - HA Integration: Wyoming Protocol in ha-london
Gramps Web (Genealogy)
- Image:
ghcr.io/gramps-project/grampsweb:latest| Port: 5000 | URL:https://family.viktorbarzin.me - Components: Web app + Celery worker (2 containers in 1 pod) | Redis: DB 2 (broker), DB 3 (rate limiting)
- Storage: NFS
/mnt/main/grampswebwith sub_paths
Loki + Alloy (Log Collection)
- Loki:
grafana/loki:3.6.5(single binary, 6Gi RAM, 7d retention) - Alloy:
grafana/alloy:v1.13.0(DaemonSet, 128Mi/pod) - Storage: NFS PV
/mnt/main/loki/loki(15Gi), WAL on tmpfs (2Gi) - Alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap
loki-alert-rules) - Troubleshooting: "entry too far behind" on first start → restart Alloy DaemonSet
OpenClaw (AI Agent Gateway)
- Image:
ghcr.io/openclaw/openclaw:2026.2.9| Port: 18789 | URL:https://openclaw.viktorbarzin.me - Init container: Downloads kubectl, terraform, git-crypt; clones infra repo
- ServiceAccount:
openclawwithcluster-adminClusterRoleBinding - Model providers: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API
Service Versions (as of 2026-02)
Immich v2.4.1 | AFFiNE stable | Whisper latest | Loki 3.6.5 | Alloy v1.13.0 | OpenClaw 2026.2.9