Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.
46 KiB
Executable file
Infrastructure Repository Knowledge
Instructions for Claude
- When the user says "remember" something: Always update this file (
.claude/CLAUDE.md) with the information so it persists across sessions - When discovering new patterns or versions: Add them to the appropriate section below
- When making infrastructure changes: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
- After every significant change: Proactively update this file (
.claude/CLAUDE.md) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically. - After updating any
.claude/files: Always commit them immediately (git add .claude/ && git commit -m "[ci skip] update claude knowledge") to avoid building up unstaged changes. - Skills available: Check
.claude/skills/directory for specialized workflows (e.g.,setup-project.mdfor deploying new services) - CRITICAL: All infrastructure changes must go through Terraform/Terragrunt. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform
.tffiles and apply withterragrunt apply. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Usekubectlonly for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes. - CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files unless they are encrypted (e.g., via git-crypt). Secrets belong in
terraform.tfvars(which is git-crypt encrypted) or in thesecrets/directory. Never hardcode credentials in.tffiles, scripts,.claude/files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (terraform.tfvars→main.tf→ module variables). - CRITICAL: NEVER commit secrets — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
- New services MUST have CI/CD: Set up Drone CI pipeline (
.drone.yml) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy. - New services MUST have monitoring: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.
Execution Environment
- File operations: Read, Edit, Write, Glob, Grep tools
- Git commands: git status, git log, git diff, git add, git commit, git reset, etc.
- Shell commands: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
- CRITICAL: Always run terragrunt/terraform locally, never on the remote server via SSH:
cd stacks/<service> && terragrunt apply --non-interactive - kubectl: Use
kubectl --kubeconfig $(pwd)/configfor cluster access - GitHub API: Use
curlwith token from tfvars (see GitHub & Drone CI section below).ghCLI is blocked by sandbox restrictions. - Drone CI API: Use
curlwith token from tfvars (see GitHub & Drone CI section below).
Overview
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under stacks/, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.
Static File Paths (NEVER CHANGE)
- Main config:
terraform.tfvars- All secrets, DNS, Cloudflare config, WireGuard peers - Root Terragrunt:
terragrunt.hcl- Root Terragrunt config (providers, backend, var loading) - Service stacks:
stacks/<service>/- Individual service stacks (each hasterragrunt.hcl+main.tfwith resources inline) - Infra stack:
stacks/infra/- Proxmox VM resources (templates, docker-registry, VMs) - Platform stack:
stacks/platform/- Core infrastructure services (22 modules inmodules/subdir) - Per-stack state:
state/stacks/<service>/terraform.tfstate- Per-stack state files (gitignored) - Service resources:
stacks/<service>/main.tf- Service resources defined directly in stack root - Platform modules:
stacks/platform/modules/<service>/- Platform service modules - Shared modules:
modules/kubernetes/ingress_factory/,modules/kubernetes/setup_tls_secret/ - Secrets:
secrets/- git-crypt encrypted TLS certs and keys
Network Topology (Static IPs)
┌─────────────────────────────────────────────────────────────────┐
│ 10.0.10.0/24 - Management Network │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.10.10 - Wizard (main server) │
│ 10.0.10.15 - NFS Server (TrueNAS) - /mnt/main/* │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 10.0.20.0/24 - Kubernetes Network │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.20.1 - pfSense Gateway │
│ 10.0.20.10 - Docker Registry VM (MAC: DE:AD:BE:EF:22:22) │
│ 10.0.20.100 - k8s-master │
│ 10.0.20.101 - Technitium DNS │
│ 10.0.20.102 - MetalLB IP Pool Start │
│ 10.0.20.200 - MetalLB IP Pool End │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 192.168.1.0/24 - Physical Network │
├─────────────────────────────────────────────────────────────────┤
│ 192.168.1.127 - Proxmox Hypervisor │
└─────────────────────────────────────────────────────────────────┘
Domains
- Public:
viktorbarzin.me(Cloudflare-managed) - Internal:
viktorbarzin.lan(Technitium DNS)
Directory Structure
terragrunt.hcl- Root Terragrunt configuration (providers, backend, variable loading)stacks/- Individual Terragrunt stacks (one per service)stacks/infra/- Proxmox VM resources (templates, docker-registry)stacks/platform/- Core infrastructure (22 services instacks/platform/modules/)stacks/<service>/- Individual service stacks (resources directly inmain.tf)stacks/platform/modules/<service>/- Platform service module source codemodules/kubernetes/- Only shared utility modules:ingress_factory/,setup_tls_secret/modules/create-vm/- Proxmox VM creation modulestate/- Per-stack Terraform state files (gitignored)secrets/- Encrypted secrets (TLS certs, keys) via git-cryptcli/- Go CLI tool for infrastructure managementscripts/- Helper scripts (cluster management, node updates)playbooks/- Ansible playbooks for node configurationdiagram/- Infrastructure diagrams (Python-based)
Key Patterns
- Each service in
modules/kubernetes/<service>/main.tfdefines its own namespace, deployments, services, and ingress - NFS storage from
10.0.10.15for persistent data - TLS secrets managed via
setup_tls_secretmodule - Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
- HTTP/3 enabled on Traefik (
http3.enabled=true,advertisedPort=443on websecure entrypoint) and Cloudflare (cloudflare_zone_settings_overridewithhttp3="on") - GPU workloads use
node_selector = { "gpu": "true" } - Services expose to
*.viktorbarzin.medomains
NFS Volume Pattern
Prefer inline NFS volumes over separate PV/PVC resources. Use the nfs {} block directly in pod/deployment/cronjob specs:
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/<service>"
}
}
Only use PV/PVC when the Helm chart requires existingClaim (like the Nextcloud Helm chart).
Adding NFS Exports
To add a new NFS exported directory:
- Edit
secrets/nfs_directories.txt- add the new directory path, keep the list sorted - Run
secrets/nfs_exports.shfrom thesecrets/directory to update the NFS share via TrueNAS API
Factory Pattern (for multi-user services)
Used when a service needs one instance per user. Structure:
stacks/<service>/
├── main.tf # Namespace, TLS secret, user module calls
└── factory/
└── main.tf # Deployment, service, ingress templates with ${var.name}
Examples: actualbudget, freedify
To add a new user:
- Export NFS share at
/mnt/main/<service>/<username>in TrueNAS - Add Cloudflare route in tfvars
- Add module block in main.tf calling factory
Init Container Pattern (for database migrations)
Use when a service needs to run database migrations before starting:
init_container {
name = "migration"
image = "service-image:tag"
command = ["sh", "-c", "migration-command"]
dynamic "env" {
for_each = local.common_env
content {
name = env.value.name
value = env.value.value
}
}
}
Example: AFFiNE runs node ./scripts/self-host-predeploy.js in init container.
SMTP/Email Configuration
When configuring services to use the mailserver:
- Use public hostname:
mail.viktorbarzin.me(for TLS cert validation) - Do NOT use:
mailserver.mailserver.svc.cluster.local(TLS cert mismatch) - Port: 587 (STARTTLS)
- Credentials: Use existing accounts from
mailserver_accountsin tfvars - Common email:
info@viktorbarzin.mefor service notifications
Terragrunt Architecture
- Root
terragrunt.hclprovides DRY provider, backend, and variable loading for all stacks - Each stack contains its resources directly:
stacks/<service>/main.tfhas variable declarations, locals, and all Terraform resources inline - Platform modules live at
stacks/platform/modules/<service>/, referenced assource = "./modules/<service>" - Shared utility modules (
ingress_factory,setup_tls_secret,dockerhub_secret,oauth-proxy) remain atmodules/kubernetes/and are referenced with relative paths from each module - State isolation: each stack has its own state file at
state/stacks/<service>/terraform.tfstate - Dependencies: service stacks depend on
platformstack viadependencyblock in theirterragrunt.hcl - Variables loaded from
terraform.tfvarsautomatically (unused vars silently ignored viaextra_arguments) secrets/symlinks in each stack for TLS cert resolution (path.rootworkaround)- Terragrunt v0.99+: use
--non-interactive(not--terragrunt-non-interactive) - run-all syntax:
terragrunt run --all -- <command>(notterragrunt run-all) - The
platformstack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.) - Individual service stacks are for services that can be deployed independently
Adding a New Service
When adding a new service to the cluster:
- Create
stacks/<service>/directory with:terragrunt.hcl- Include root config, declareplatformdependencymain.tf- All resources defined directly (variables, locals, namespace, deployments, services, ingress)secrets- Symlink to../../secrets(for TLS cert path resolution)
- Add Cloudflare DNS record in
terraform.tfvars(cloudflare_proxied_namesorcloudflare_non_proxied_names) - Apply the cloudflared stack:
cd stacks/platform && terragrunt apply --non-interactive - Apply the new service:
cd stacks/<service> && terragrunt apply --non-interactive
Common Variables
tls_secret_name- TLS certificate secret nametier- Deployment tier label- Service-specific passwords passed as variables
Service Versions (as of 2026-02)
- Immich: v2.4.1
- Freedify: latest (music streaming, factory pattern)
- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
- Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
- Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
- Gramps Web: latest (genealogy, uses Redis + Celery)
- Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
- Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
- OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)
Useful Commands
# Cluster health check — ALWAYS use this to check cluster status
bash scripts/cluster_healthcheck.sh # Full color report
bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL
bash scripts/cluster_healthcheck.sh --json # Machine-readable
bash scripts/cluster_healthcheck.sh --fix # Auto-delete evicted pods
# Apply a single service stack
cd stacks/<service> && terragrunt apply --non-interactive
# Plan a single service stack
cd stacks/<service> && terragrunt plan --non-interactive
# Plan all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- plan
# Apply all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- apply
# Format all terraform files
terraform fmt -recursive
kubectl get pods -A
Cluster Health Check (scripts/cluster_healthcheck.sh):
- ALWAYS use this script to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
- When adding new healthchecks or monitoring: Always update this script to validate the new component
Terragrunt apply examples:
cd stacks/monitoring && terragrunt apply --non-interactive- Apply monitoringcd stacks/immich && terragrunt apply --non-interactive- Apply immichcd stacks/infra && terragrunt apply --non-interactive- Apply Proxmox VMs / docker registrycd stacks/platform && terragrunt apply --non-interactive- Apply all core/platform services
IMPORTANT: When deploying a new service, you must ALSO apply the platform stack (which includes cloudflared) to create the Cloudflare DNS record:
cd stacks/platform && terragrunt apply --non-interactive
Adding a name to cloudflare_non_proxied_names or cloudflare_proxied_names in terraform.tfvars only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.
Stack Structure
Terragrunt stacks under stacks/:
stacks/infra/- Proxmox VMs, templates, docker-registrystacks/platform/- Core infrastructure (~22 services inmodules/subdir)stacks/<service>/- Individual service stacks (resources directly inmain.tf)
Each stack's terragrunt.hcl includes the root terragrunt.hcl which provides:
- Kubernetes + Helm providers (configured from
terraform.tfvars) - Local backend with per-stack state file (
state/stacks/<service>/terraform.tfstate) - Automatic loading of
terraform.tfvarswith unused vars ignored
Complete Service Catalog
Critical - Network & Auth (Tier: core)
| Service | Description | Stack |
|---|---|---|
| wireguard | VPN server | platform |
| technitium | DNS server (10.0.20.101) | platform |
| headscale | Tailscale control server | platform |
| traefik | Ingress controller (Helm) | platform |
| xray | Proxy/tunnel | platform |
| authentik | Identity provider (SSO) | platform |
| cloudflared | Cloudflare tunnel | platform |
| authelia | Auth middleware | platform |
| monitoring | Prometheus/Grafana/Loki stack | platform |
Storage & Security (Tier: cluster)
| Service | Description | Stack |
|---|---|---|
| vaultwarden | Bitwarden-compatible password manager | platform |
| redis | Shared Redis at redis.redis.svc.cluster.local |
platform |
| immich | Photo management (GPU) | immich |
| nvidia | GPU device plugin | platform |
| metrics-server | K8s metrics | platform |
| uptime-kuma | Status monitoring | platform |
| crowdsec | Security/WAF | platform |
| kyverno | Policy engine | platform |
Admin
| Service | Description | Stack |
|---|---|---|
| k8s-dashboard | Kubernetes dashboard | platform |
| reverse-proxy | Generic reverse proxy | platform |
Active Use
| Service | Description | Stack |
|---|---|---|
| mailserver | Email (docker-mailserver) | mailserver |
| shadowsocks | Proxy | shadowsocks |
| webhook_handler | Webhook processing | webhook_handler |
| tuya-bridge | Smart home bridge | tuya-bridge |
| dawarich | Location history | dawarich |
| owntracks | Location tracking | owntracks |
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming | f1-stream |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
| actualbudget | Budgeting (factory pattern) | actualbudget |
Optional
| Service | Description | Stack |
|---|---|---|
| blog | Personal blog | blog |
| descheduler | Pod descheduler | descheduler |
| drone | CI/CD | drone |
| hackmd | Collaborative markdown | hackmd |
| kms | Key management | kms |
| privatebin | Encrypted pastebin | privatebin |
| vault | HashiCorp Vault | vault |
| reloader | ConfigMap/Secret reloader | reloader |
| city-guesser | Game | city-guesser |
| echo | Echo server | echo |
| url | URL shortener | url |
| excalidraw | Whiteboard | excalidraw |
| travel_blog | Travel blog | travel_blog |
| dashy | Dashboard | dashy |
| send | Firefox Send | send |
| ytdlp | YouTube downloader | ytdlp |
| wealthfolio | Finance tracking | wealthfolio |
| audiobookshelf | Audiobook server | audiobookshelf |
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| ntfy | Push notifications | ntfy |
| cyberchef | Data transformation | cyberchef |
| diun | Docker image update notifier | diun |
| meshcentral | Remote management | meshcentral |
| homepage | Dashboard/startpage | homepage |
| matrix | Matrix chat server | matrix |
| linkwarden | Bookmark manager | linkwarden |
| changedetection | Web change detection | changedetection |
| tandoor | Recipe manager | tandoor |
| n8n | Workflow automation | n8n |
| real-estate-crawler | Property crawler | real-estate-crawler |
| tor-proxy | Tor proxy | tor-proxy |
| forgejo | Git forge | forgejo |
| freshrss | RSS reader | freshrss |
| navidrome | Music streaming | navidrome |
| networking-toolbox | Network tools | networking-toolbox |
| stirling-pdf | PDF tools | stirling-pdf |
| speedtest | Speed testing | speedtest |
| freedify | Music streaming (factory pattern) | freedify |
| netbox | Network documentation | netbox |
| infra-maintenance | Maintenance jobs | infra-maintenance |
| ollama | LLM server (GPU) | ollama |
| frigate | NVR/camera (GPU) | frigate |
| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
| health | Apple Health data dashboard (PostgreSQL) | health |
| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
| openclaw | AI agent gateway (OpenClaw) | openclaw |
Cloudflare Domains
Proxied (CDN + WAF enabled)
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox
Non-Proxied (Direct DNS)
mail, wg, headscale, immich, calibre, vaultwarden, drone,
mailserver-antispam, mailserver-admin, webhook, uptime,
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
isponsorblocktv, speedtest, freedify, rybbit, paperless,
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
Special Subdomains
*.viktor.actualbudget- Actualbudget factory instances*.freedify- Freedify factory instancesmailserver.*- Mail server components (antispam, admin)
CI/CD
- Drone CI (
.drone.yml) for automated deployments - Auto-updates TLS certificates
- ALWAYS add
[ci skip]to commit messages when you've already runterraform applyto avoid triggering CI redundantly - After committing, run
git push origin masterto sync changes
GitHub & Drone CI
GitHub API Access
- Username:
ViktorBarzin - Token location:
terraform.tfvarsasgithub_pat(git-crypt encrypted) - Read token:
grep github_pat terraform.tfvars | cut -d'"' -f2 - Scopes: Full access —
repo,admin:public_key,admin:repo_hook,delete_repo,admin:org,workflow,write:packages, and more ghCLI: Blocked by sandbox restrictions — usecurlwith the GitHub API instead
Common API Patterns
# Read token from tfvars
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
# List repos
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
# Create repo
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
-d '{"name":"repo-name","private":true}'
# Add deploy key
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
# Create webhook (e.g., for Drone CI)
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
-d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
# Get repo info
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>"
Drone CI API Access
- Server:
https://drone.viktorbarzin.me - Token location:
terraform.tfvarsasdrone_api_token(git-crypt encrypted) - Read token:
grep drone_api_token terraform.tfvars | cut -d'"' -f2 - Username:
ViktorBarzin
Common API Patterns
# Read token from tfvars
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
# List repos
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"
# Activate repo in Drone
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"
# Trigger build
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"
# Get build info
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds/<build-number>"
# Add secret to repo
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
-d '{"name":"secret_name","data":"secret_value"}'
Capabilities
With these tokens, Claude can:
- GitHub: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
- Drone CI: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
Infrastructure
- Proxmox hypervisor for VMs (192.168.1.127)
- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
- NFS server at 10.0.10.15 for storage
- Redis shared service at
redis.redis.svc.cluster.local - Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
- Port 5000: docker.io (Docker Hub, with auth)
- Port 5010: ghcr.io
- Port 5020: quay.io
- Port 5030: registry.k8s.io
- Port 5040: reg.kyverno.io
- Worker nodes use
config_path = "/etc/containerd/certs.d"with per-registryhosts.tomlfiles - k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)
Proxmox Host Hardware
- CPU: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
- RAM: 142 GB (Dell R730 server)
- GPU: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- Disks: 1.1TB + 931GB + 10.7TB (local storage)
- Proxmox access:
ssh root@192.168.1.127
Proxmox Network Bridges
- vmbr0: Physical bridge on
eno1, IP192.168.1.127/24— connects to physical/home network (192.168.1.0/24) - vmbr1: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)
Proxmox VM Inventory
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|---|---|---|---|---|---|---|---|
| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall, routes between all networks |
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM on management network |
| 103 | home-assistant | running | 8 | 16GB | vmbr1:vlan10(down), vmbr0 | 32G | Home Assistant, net0 link disabled, uses vmbr0 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup Server (not in use) |
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Kubernetes control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 passthrough (hostpci0) |
| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM on physical network |
| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7×256G+1T | NFS server (10.0.10.15), multiple data disks |
VM Templates (stopped, used for cloning)
| VMID | Name | Purpose |
|---|---|---|
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base template for non-K8s VMs |
| 1001 | docker-registry-template | Template for docker registry VM |
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base template for K8s nodes |
Network Connectivity Summary
- pfSense (101) bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
- K8s cluster (200-204) + docker-registry (220) are all on VLAN 20 (kubernetes network)
- TrueNAS (9000) + devvm (102) + PBS (105) are on VLAN 10 (management network)
- Home Assistant (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
- Windows10 (300) is on physical network (vmbr0) only
GPU Node (k8s-node1)
- VMID: 201
- PCIe Passthrough:
0000:06:00.0(NVIDIA Tesla T4) - Taint:
nvidia.com/gpu=true:NoSchedule- Only GPU workloads can run here - Label:
gpu=true - GPU workloads must have both:
node_selector = { "gpu": "true" }toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }
- Taint is applied via
null_resource.gpu_node_taintinmodules/kubernetes/nvidia/main.tf
Git Operations (IMPORTANT)
- Git is slow on this repo due to many files - commands can take 30+ seconds
- Use
GIT_OPTIONAL_LOCKS=0prefix if git hangs - Always commit only specific files you changed, not everything
- ALWAYS ask user before pushing to remote - never push without explicit confirmation
Prometheus Alerts
- Alert rules are in
modules/kubernetes/monitoring/prometheus_chart_values.tpl - Under
serverFiles.alerting_rules.yml.groups - Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
- kube-state-metrics provides:
kube_deployment_*,kube_statefulset_*,kube_daemonset_*
Tier System
- 0-core: Critical infrastructure (ingress, DNS, VPN, auth)
- 1-cluster: Cluster services (Redis, metrics, security)
- 2-gpu: GPU workloads (Immich, Ollama, Frigate)
- 3-edge: User-facing services
- 4-aux: Optional/auxiliary services
Resource Governance (Kyverno-based)
Four layers of noisy-neighbor protection, all defined in modules/kubernetes/kyverno/resource-governance.tf:
- PriorityClasses:
tier-0-core(1M) throughtier-4-aux(200K).tier-4-auxusespreemption_policy=Never. - LimitRange defaults (Kyverno generate): Auto-creates
tier-defaultsLimitRange in namespaces based on tier label. Only affects containers without explicit resources. - ResourceQuotas (Kyverno generate): Auto-creates
tier-quotaResourceQuota in namespaces with tier labels. Excludes namespaces withresource-governance/custom-quota=truelabel. - Priority injection (Kyverno mutate): Sets
priorityClassNameon Pods based on namespace tier label.
Custom quota override: Add label resource-governance/custom-quota: "true" to namespace, then define a custom kubernetes_resource_quota in the service's Terraform module. Currently used by: monitoring, crowdsec.
LimitRange defaults by tier:
| Tier | Default Req | Default Limit | Max |
|---|---|---|---|
| 0-core | 100m/128Mi | 2/4Gi | 8/16Gi |
| 1-cluster | 100m/128Mi | 2/4Gi | 4/8Gi |
| 2-gpu | 100m/256Mi | 4/8Gi | 8/16Gi |
| 3-edge | 50m/128Mi | 1/2Gi | 4/8Gi |
| 4-aux | 25m/64Mi | 500m/1Gi | 2/4Gi |
ResourceQuota hard limits by tier:
| Tier | Req CPU | Req Mem | Lim CPU | Lim Mem | Pods |
|---|---|---|---|---|---|
| 0-core | 8 | 8Gi | 32 | 64Gi | 100 |
| 1-cluster | 4 | 4Gi | 16 | 32Gi | 30 |
| 2-gpu | 8 | 8Gi | 48 | 96Gi | 40 |
| 3-edge | 4 | 4Gi | 16 | 32Gi | 30 |
| 4-aux | 2 | 2Gi | 8 | 16Gi | 20 |
User Preferences
Calendar
- Default calendar: Nextcloud (always use unless otherwise specified)
- Nextcloud URL:
https://nextcloud.viktorbarzin.me - CalDAV endpoint:
https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/
Home Assistant
- Default smart home: Home Assistant (always use for smart home control)
- Two deployments:
- ha-london (default):
https://ha-london.viktorbarzin.me| Script:.claude/home-assistant.py| SSH:ssh pi@192.168.8.103, config at/home/pi/docker/homeAssistant/ - ha-sofia:
https://ha-sofia.viktorbarzin.me| Script:.claude/home-assistant-sofia.py| SSH:ssh vbarzin@192.168.1.8, config at/config/
- ha-london (default):
- Aliases: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
Development
- Frontend framework: Svelte (user is learning it, so use Svelte for all new web apps)
Pod Monitoring After Updates
- Never use
sleepto wait for pods — instead, spawn a background subagent (Task tool withrun_in_background: true) that continuously checks pod state (e.g.,kubectl get pods -n <namespace> -w) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.
Skills & Workflows
Skills are specialized workflows for common tasks. Located in .claude/skills/.
Available Skills
setup-project (.claude/skills/setup-project/SKILL.md)
- Deploy new self-hosted services from GitHub repos
- Automated workflow: Docker image → Terraform module → Deploy
- Handles database setup, ingress, DNS configuration
- When to use: User provides GitHub URL or wants to deploy a new service
- Example: "Deploy [GitHub repo] to the cluster"
extend-vm-storage (.claude/skills/extend-vm-storage/SKILL.md)
- Extend disk storage on K8s node VMs (Proxmox-hosted)
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
- When to use: A k8s node needs more disk space
- Example: "Extend storage on k8s-node2 by 64G"
Service-Specific Notes
Authentik (Identity Provider)
- Helm Chart:
authentikv2025.10.3 fromhttps://charts.goauthentik.io/ - URL:
https://authentik.viktorbarzin.me - API:
https://authentik.viktorbarzin.me/api/v3/ - API Token: Stored in
terraform.tfvarsasauthentik_api_token(non-expiring, superuser, identifier:claude-code-permanent). Read with:grep authentik_api_token terraform.tfvars | cut -d'"' -f2 - Namespace:
authentik(tier: cluster) - Architecture: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
- Database: PostgreSQL via
postgresql.dbaas:5432, pooled through PgBouncer atpgbouncer.authentik:6432 - Redis: Shared at
redis.redis.svc.cluster.local - Terraform:
modules/kubernetes/authentik/main.tf(Helm),pgbouncer.tf(connection pooling)
Authentik API Management
To call the API, use:
curl -s -H "Authorization: Bearer <TOKEN>" "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
Key API endpoints:
core/users/— List/create/update/delete userscore/groups/— List/create/update/delete groupscore/applications/— List/create applicationsproviders/all/— List all providers (OAuth2, Proxy, etc.)providers/oauth2/— OAuth2/OIDC providers specificallyproviders/proxy/— Proxy providers (forward auth)flows/instances/— List flowsstages/all/— List stagessources/all/— List sources (Google, GitHub, etc.)outposts/instances/— List outpostspropertymappings/all/— List property mappingsrbac/roles/— List roles
Current Applications (9)
| Application | Provider Type | Auth Flow |
|---|---|---|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| Matrix | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
Current Groups (9)
| Group | Parent | Superuser | Purpose |
|---|---|---|---|
| Allow Login Users | — | No | Parent group for login-permitted users |
| authentik Admins | — | Yes | Full admin access |
| authentik Read-only | — | No | Read-only access (has role) |
| Headscale Users | Allow Login Users | No | VPN access |
| Home Server Admins | Allow Login Users | No | Server admin access |
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
| kubernetes-admins | — | No | K8s cluster-admin RBAC |
| kubernetes-power-users | — | No | K8s power-user RBAC |
| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
Current Users (7 real users)
| Username | Name | Type | Groups |
|---|---|---|---|
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users |
| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users |
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
Login Sources (Social Login)
- Google (OAuth) — user matching by identifier
- GitHub (OAuth) — user matching by email_link
- Facebook (OAuth) — user matching by email_link
- All use the same authentication flow (
1a779f24) and enrollment flow (87572804)
Authorization Flows
- Explicit consent (
default-provider-authorization-explicit-consent): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare - Implicit consent (
default-provider-authorization-implicit-consent): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove
Traefik Integration
- Forward auth middleware:
authentik-forward-authin Traefik namespace - Outpost endpoint:
http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik - Services opt in via
protected = trueiningress_factory - Response headers:
X-authentik-username,X-authentik-uid,X-authentik-email,X-authentik-name,X-authentik-groups,Set-Cookie
OIDC for Kubernetes API
- Issuer:
https://authentik.viktorbarzin.me/application/o/kubernetes/ - Client ID:
kubernetes(public client, no secret) - Username claim:
email, Groups claim:groups - Signing key:
authentik Self-signed Certificate(must be assigned to the provider or JWKS will be empty) - Redirect URIs: Regex mode
http://localhost:.*andhttp://127\.0\.0\.1:.*(kubelogin picks random ports) - Configured via: SSH to kube-apiserver manifest (
modules/kubernetes/rbac/apiserver-oidc.tf) - RBAC module:
modules/kubernetes/rbac/main.tf— admin/power-user/namespace-owner roles - Self-service portal:
modules/kubernetes/k8s-portal/— SvelteKit app athttps://k8s-portal.viktorbarzin.me - User definition:
k8s_usersvariable interraform.tfvars - Audit logging: Enabled via
modules/kubernetes/rbac/audit-policy.tf, logs at/var/log/kubernetes/audit.log
CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:
- Signing key MUST be assigned to the OAuth2 provider. Without it, the JWKS endpoint returns
{}and kube-apiserver can't validate tokens. - Email mapping must set
email_verified: True. The default Authentik email scope mapping hardcodesemail_verified: False, which causes kube-apiserver to reject the token withoidc: email not verified. Use a custom scope mapping:return {"email": request.user.email, "email_verified": True} - kubelogin needs
--oidc-extra-scopeforemail,profile,groups. Without these, onlyopenidis requested and the token lacks theemailclaim, causingoidc: parse username claims "email": claim not present. - Redirect URIs must use regex mode (
http://localhost:.*) because kubelogin picks random ports, not just 8000/18000. - Kubelet static pod manifest changes require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple
touchor kubelet restart is not enough. - Property mappings endpoint in Authentik 2025.10.x is
propertymappings/provider/scope/(not the olderpropertymappings/scope/).
Common Management Tasks
Add a new OAuth2 application:
- Create OAuth2 provider:
POST /api/v3/providers/oauth2/with client_id, client_secret, redirect_uris, authorization_flow, etc. - Create application:
POST /api/v3/core/applications/with name, slug, provider pk - (Optional) Bind to group policy for access control
Add a user to a group:
# Get group pk, then PATCH with updated users list
curl -X PATCH -H "Authorization: Bearer <TOKEN>" -H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/groups/<group-pk>/" \
-d '{"users": [<existing_user_pks>, <new_user_pk>]}'
Protect a service with forward auth:
Set protected = true in the service's ingress_factory call in Terraform.
AFFiNE (Visual Canvas)
- Image:
ghcr.io/toeverything/affine:stable - Port: 3010
- Requires: PostgreSQL + Redis
- Migration: Init container runs
node ./scripts/self-host-predeploy.js - Storage: NFS at
/mnt/main/affinemounted to/root/.affine/storageand/root/.affine/config - Key env vars:
AFFINE_SERVER_EXTERNAL_URL- Public URL (e.g.,https://affine.viktorbarzin.me)AFFINE_SERVER_HTTPS- Set totruebehind TLS ingressDATABASE_URL- PostgreSQL connection stringREDIS_SERVER_HOST- Redis hostnameMAILER_*- SMTP configuration for email invites
- Local-first: Data stored in browser by default; syncs to server when user creates account
- Docs: https://docs.affine.pro/self-host-affine
Wyoming Whisper (STT for Home Assistant)
- Image:
rhasspy/wyoming-whisper:latest - Port: 10300/TCP (Wyoming protocol)
- Model:
small-int8(CPU-optimized, no CUDA variant available from upstream) - Runs on: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
- Storage: NFS at
/mnt/main/whisper→/data(model cache) - Exposure: Internal only via Traefik TCP entrypoint
whisper-tcp→ IngressRouteTCP - Access:
10.0.20.202:10300(Traefik LB IP, no public DNS) - HA Integration: Wyoming Protocol integration in ha-london, host
10.0.20.202, port10300 - No GPU acceleration: Official image is CPU-only (Debian + PyTorch CPU). The
mib1185/wyoming-faster-whisper-cudaimage exists but requires self-build.
Gramps Web (Genealogy)
- Image:
ghcr.io/gramps-project/grampsweb:latest - Port: 5000
- URL:
https://family.viktorbarzin.me - Components: Web app + Celery worker (2 containers in 1 pod)
- Requires: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
- Storage: NFS at
/mnt/main/grampswebwith sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp - Key env vars:
GRAMPSWEB_SECRET_KEY- Flask secret key (generated viarandom_password)GRAMPSWEB_TREE- Tree nameGRAMPSWEB_BASE_URL- Public URLGRAMPSWEB_CELERY_CONFIG__broker_url/result_backend- Redis connectionGRAMPSWEB_REGISTRATION_DISABLED- Set toTrueGRAMPSWEB_EMAIL_*- SMTP configurationGRAMPSWEB_LLM_*- Ollama AI integration
- Celery command:
celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2 - Registration: Disabled; first user created via UI setup wizard
Loki + Alloy (Centralized Log Collection)
- Loki image:
grafana/loki:3.6.5(Helm chart, single binary mode) - Alloy image:
grafana/alloy:v1.13.0(Helm chart, DaemonSet) - Config files:
modules/kubernetes/monitoring/loki.tf,loki.yaml,alloy.yaml - Port: 3100/TCP (Loki API)
- Storage: NFS PV at
/mnt/main/loki/loki(15Gi), WAL on tmpfs (2Gi in-memory) - Memory: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
- Disk-friendly tuning:
max_chunk_age: 24h,chunk_idle_period: 12h— holds chunks in memory, flushes ~once/day - Retention: 7 days (
retention_period: 168h), compactor enforces deletion - Crash policy: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
- Ruler: Evaluates LogQL alert rules, fires to
http://prometheus-alertmanager.monitoring.svc.cluster.local:9093 - Alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap
loki-alert-rules) - Grafana: Datasource UID
P8E80F9AEF21F6940, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned) - Sysctl DaemonSet:
sysctl-inotifysetsfs.inotify.max_user_watches=1048576on all nodes (required for Alloy fsnotify) - Disabled components: gateway, chunksCache, resultsCache (not needed for single binary)
- Key paths: Compactor at
/var/loki/compactor, ruler scratch at/var/loki/scratch(must be under/var/loki— root FS is read-only) - Querying: Grafana Explore with LogQL, e.g.
{namespace="monitoring"} |= "error" - Troubleshooting: If "entry too far behind" errors on first start, restart Alloy DaemonSet (
kubectl rollout restart ds -n monitoring alloy) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart
OpenClaw (AI Agent Gateway)
- Image:
ghcr.io/openclaw/openclaw:2026.2.9 - Port: 18789
- URL:
https://openclaw.viktorbarzin.me(authentik-protected) - Namespace:
openclaw(tier: aux) - Formerly:
moltbot— renamed in Feb 2026 - Architecture: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
- Init container: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
- ServiceAccount:
openclawwithcluster-adminClusterRoleBinding (for managing cluster resources) - Storage: NFS at
/mnt/main/openclaw/workspace(git repo) and/mnt/main/openclaw/data(persistent data) - Config:
openclaw.jsonConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults - Variables:
openclaw_ssh_key,openclaw_skill_secretsinterraform.tfvars - Skill secrets: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
- Model providers: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)