infra/.claude/CLAUDE.md
Viktor Barzin 116c4d9c30 [ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
2026-02-22 15:23:27 +00:00

46 KiB
Executable file
Raw Blame History

Infrastructure Repository Knowledge

Instructions for Claude

  • When the user says "remember" something: Always update this file (.claude/CLAUDE.md) with the information so it persists across sessions
  • When discovering new patterns or versions: Add them to the appropriate section below
  • When making infrastructure changes: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
  • After every significant change: Proactively update this file (.claude/CLAUDE.md) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically.
  • After updating any .claude/ files: Always commit them immediately (git add .claude/ && git commit -m "[ci skip] update claude knowledge") to avoid building up unstaged changes.
  • Skills available: Check .claude/skills/ directory for specialized workflows (e.g., setup-project.md for deploying new services)
  • CRITICAL: All infrastructure changes must go through Terraform/Terragrunt. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform .tf files and apply with terragrunt apply. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use kubectl only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes.
  • CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files unless they are encrypted (e.g., via git-crypt). Secrets belong in terraform.tfvars (which is git-crypt encrypted) or in the secrets/ directory. Never hardcode credentials in .tf files, scripts, .claude/ files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (terraform.tfvarsmain.tf → module variables).
  • CRITICAL: NEVER commit secrets — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
  • New services MUST have CI/CD: Set up Drone CI pipeline (.drone.yml) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy.
  • New services MUST have monitoring: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.

Execution Environment

  • File operations: Read, Edit, Write, Glob, Grep tools
  • Git commands: git status, git log, git diff, git add, git commit, git reset, etc.
  • Shell commands: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
  • CRITICAL: Always run terragrunt/terraform locally, never on the remote server via SSH:
    cd stacks/<service> && terragrunt apply --non-interactive
    
  • kubectl: Use kubectl --kubeconfig $(pwd)/config for cluster access
  • GitHub API: Use curl with token from tfvars (see GitHub & Drone CI section below). gh CLI is blocked by sandbox restrictions.
  • Drone CI API: Use curl with token from tfvars (see GitHub & Drone CI section below).

Overview

Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under stacks/, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.

Static File Paths (NEVER CHANGE)

  • Main config: terraform.tfvars - All secrets, DNS, Cloudflare config, WireGuard peers
  • Root Terragrunt: terragrunt.hcl - Root Terragrunt config (providers, backend, var loading)
  • Service stacks: stacks/<service>/ - Individual service stacks (each has terragrunt.hcl + main.tf with resources inline)
  • Infra stack: stacks/infra/ - Proxmox VM resources (templates, docker-registry, VMs)
  • Platform stack: stacks/platform/ - Core infrastructure services (22 modules in modules/ subdir)
  • Per-stack state: state/stacks/<service>/terraform.tfstate - Per-stack state files (gitignored)
  • Service resources: stacks/<service>/main.tf - Service resources defined directly in stack root
  • Platform modules: stacks/platform/modules/<service>/ - Platform service modules
  • Shared modules: modules/kubernetes/ingress_factory/, modules/kubernetes/setup_tls_secret/
  • Secrets: secrets/ - git-crypt encrypted TLS certs and keys

Network Topology (Static IPs)

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.10.0/24 - Management Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.10.10  - Wizard (main server)                               │
│ 10.0.10.15  - NFS Server (TrueNAS) - /mnt/main/*                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.20.0/24 - Kubernetes Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.20.1   - pfSense Gateway                                   │
│ 10.0.20.10  - Docker Registry VM (MAC: DE:AD:BE:EF:22:22)       │
│ 10.0.20.100 - k8s-master                                        │
│ 10.0.20.101 - Technitium DNS                                    │
│ 10.0.20.102 - MetalLB IP Pool Start                             │
│ 10.0.20.200 - MetalLB IP Pool End                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 192.168.1.0/24 - Physical Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 192.168.1.127 - Proxmox Hypervisor                              │
└─────────────────────────────────────────────────────────────────┘

Domains

  • Public: viktorbarzin.me (Cloudflare-managed)
  • Internal: viktorbarzin.lan (Technitium DNS)

Directory Structure

  • terragrunt.hcl - Root Terragrunt configuration (providers, backend, variable loading)
  • stacks/ - Individual Terragrunt stacks (one per service)
  • stacks/infra/ - Proxmox VM resources (templates, docker-registry)
  • stacks/platform/ - Core infrastructure (22 services in stacks/platform/modules/)
  • stacks/<service>/ - Individual service stacks (resources directly in main.tf)
  • stacks/platform/modules/<service>/ - Platform service module source code
  • modules/kubernetes/ - Only shared utility modules: ingress_factory/, setup_tls_secret/
  • modules/create-vm/ - Proxmox VM creation module
  • state/ - Per-stack Terraform state files (gitignored)
  • secrets/ - Encrypted secrets (TLS certs, keys) via git-crypt
  • cli/ - Go CLI tool for infrastructure management
  • scripts/ - Helper scripts (cluster management, node updates)
  • playbooks/ - Ansible playbooks for node configuration
  • diagram/ - Infrastructure diagrams (Python-based)

Key Patterns

  • Each service in modules/kubernetes/<service>/main.tf defines its own namespace, deployments, services, and ingress
  • NFS storage from 10.0.10.15 for persistent data
  • TLS secrets managed via setup_tls_secret module
  • Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
  • HTTP/3 enabled on Traefik (http3.enabled=true, advertisedPort=443 on websecure entrypoint) and Cloudflare (cloudflare_zone_settings_override with http3="on")
  • GPU workloads use node_selector = { "gpu": "true" }
  • Services expose to *.viktorbarzin.me domains

NFS Volume Pattern

Prefer inline NFS volumes over separate PV/PVC resources. Use the nfs {} block directly in pod/deployment/cronjob specs:

volume {
  name = "data"
  nfs {
    server = "10.0.10.15"
    path   = "/mnt/main/<service>"
  }
}

Only use PV/PVC when the Helm chart requires existingClaim (like the Nextcloud Helm chart).

Adding NFS Exports

To add a new NFS exported directory:

  1. Edit secrets/nfs_directories.txt - add the new directory path, keep the list sorted
  2. Run secrets/nfs_exports.sh from the secrets/ directory to update the NFS share via TrueNAS API

Factory Pattern (for multi-user services)

Used when a service needs one instance per user. Structure:

stacks/<service>/
├── main.tf           # Namespace, TLS secret, user module calls
└── factory/
    └── main.tf       # Deployment, service, ingress templates with ${var.name}

Examples: actualbudget, freedify

To add a new user:

  1. Export NFS share at /mnt/main/<service>/<username> in TrueNAS
  2. Add Cloudflare route in tfvars
  3. Add module block in main.tf calling factory

Init Container Pattern (for database migrations)

Use when a service needs to run database migrations before starting:

init_container {
  name    = "migration"
  image   = "service-image:tag"
  command = ["sh", "-c", "migration-command"]

  dynamic "env" {
    for_each = local.common_env
    content {
      name  = env.value.name
      value = env.value.value
    }
  }
}

Example: AFFiNE runs node ./scripts/self-host-predeploy.js in init container.

SMTP/Email Configuration

When configuring services to use the mailserver:

  • Use public hostname: mail.viktorbarzin.me (for TLS cert validation)
  • Do NOT use: mailserver.mailserver.svc.cluster.local (TLS cert mismatch)
  • Port: 587 (STARTTLS)
  • Credentials: Use existing accounts from mailserver_accounts in tfvars
  • Common email: info@viktorbarzin.me for service notifications

Terragrunt Architecture

  • Root terragrunt.hcl provides DRY provider, backend, and variable loading for all stacks
  • Each stack contains its resources directly: stacks/<service>/main.tf has variable declarations, locals, and all Terraform resources inline
  • Platform modules live at stacks/platform/modules/<service>/, referenced as source = "./modules/<service>"
  • Shared utility modules (ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy) remain at modules/kubernetes/ and are referenced with relative paths from each module
  • State isolation: each stack has its own state file at state/stacks/<service>/terraform.tfstate
  • Dependencies: service stacks depend on platform stack via dependency block in their terragrunt.hcl
  • Variables loaded from terraform.tfvars automatically (unused vars silently ignored via extra_arguments)
  • secrets/ symlinks in each stack for TLS cert resolution (path.root workaround)
  • Terragrunt v0.99+: use --non-interactive (not --terragrunt-non-interactive)
  • run-all syntax: terragrunt run --all -- <command> (not terragrunt run-all)
  • The platform stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.)
  • Individual service stacks are for services that can be deployed independently

Adding a New Service

When adding a new service to the cluster:

  1. Create stacks/<service>/ directory with:
    • terragrunt.hcl - Include root config, declare platform dependency
    • main.tf - All resources defined directly (variables, locals, namespace, deployments, services, ingress)
    • secrets - Symlink to ../../secrets (for TLS cert path resolution)
  2. Add Cloudflare DNS record in terraform.tfvars (cloudflare_proxied_names or cloudflare_non_proxied_names)
  3. Apply the cloudflared stack: cd stacks/platform && terragrunt apply --non-interactive
  4. Apply the new service: cd stacks/<service> && terragrunt apply --non-interactive

Common Variables

  • tls_secret_name - TLS certificate secret name
  • tier - Deployment tier label
  • Service-specific passwords passed as variables

Service Versions (as of 2026-02)

  • Immich: v2.4.1
  • Freedify: latest (music streaming, factory pattern)
  • AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
  • Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
  • Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
  • Gramps Web: latest (genealogy, uses Redis + Celery)
  • Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
  • Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
  • OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)

Useful Commands

# Cluster health check — ALWAYS use this to check cluster status
bash scripts/cluster_healthcheck.sh            # Full color report
bash scripts/cluster_healthcheck.sh --quiet    # Only WARN/FAIL
bash scripts/cluster_healthcheck.sh --json     # Machine-readable
bash scripts/cluster_healthcheck.sh --fix      # Auto-delete evicted pods

# Apply a single service stack
cd stacks/<service> && terragrunt apply --non-interactive

# Plan a single service stack
cd stacks/<service> && terragrunt plan --non-interactive

# Plan all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- plan

# Apply all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- apply

# Format all terraform files
terraform fmt -recursive

kubectl get pods -A

Cluster Health Check (scripts/cluster_healthcheck.sh):

  • ALWAYS use this script to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
  • Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
  • When adding new healthchecks or monitoring: Always update this script to validate the new component

Terragrunt apply examples:

  • cd stacks/monitoring && terragrunt apply --non-interactive - Apply monitoring
  • cd stacks/immich && terragrunt apply --non-interactive - Apply immich
  • cd stacks/infra && terragrunt apply --non-interactive - Apply Proxmox VMs / docker registry
  • cd stacks/platform && terragrunt apply --non-interactive - Apply all core/platform services

IMPORTANT: When deploying a new service, you must ALSO apply the platform stack (which includes cloudflared) to create the Cloudflare DNS record:

cd stacks/platform && terragrunt apply --non-interactive

Adding a name to cloudflare_non_proxied_names or cloudflare_proxied_names in terraform.tfvars only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.

Stack Structure

Terragrunt stacks under stacks/:

  • stacks/infra/ - Proxmox VMs, templates, docker-registry
  • stacks/platform/ - Core infrastructure (~22 services in modules/ subdir)
  • stacks/<service>/ - Individual service stacks (resources directly in main.tf)

Each stack's terragrunt.hcl includes the root terragrunt.hcl which provides:

  • Kubernetes + Helm providers (configured from terraform.tfvars)
  • Local backend with per-stack state file (state/stacks/<service>/terraform.tfstate)
  • Automatic loading of terraform.tfvars with unused vars ignored

Complete Service Catalog

Critical - Network & Auth (Tier: core)

Service Description Stack
wireguard VPN server platform
technitium DNS server (10.0.20.101) platform
headscale Tailscale control server platform
traefik Ingress controller (Helm) platform
xray Proxy/tunnel platform
authentik Identity provider (SSO) platform
cloudflared Cloudflare tunnel platform
authelia Auth middleware platform
monitoring Prometheus/Grafana/Loki stack platform

Storage & Security (Tier: cluster)

Service Description Stack
vaultwarden Bitwarden-compatible password manager platform
redis Shared Redis at redis.redis.svc.cluster.local platform
immich Photo management (GPU) immich
nvidia GPU device plugin platform
metrics-server K8s metrics platform
uptime-kuma Status monitoring platform
crowdsec Security/WAF platform
kyverno Policy engine platform

Admin

Service Description Stack
k8s-dashboard Kubernetes dashboard platform
reverse-proxy Generic reverse proxy platform

Active Use

Service Description Stack
mailserver Email (docker-mailserver) mailserver
shadowsocks Proxy shadowsocks
webhook_handler Webhook processing webhook_handler
tuya-bridge Smart home bridge tuya-bridge
dawarich Location history dawarich
owntracks Location tracking owntracks
nextcloud File sync/share nextcloud
calibre E-book management calibre
onlyoffice Document editing onlyoffice
f1-stream F1 streaming f1-stream
rybbit Analytics rybbit
isponsorblocktv SponsorBlock for TV isponsorblocktv
actualbudget Budgeting (factory pattern) actualbudget

Optional

Service Description Stack
blog Personal blog blog
descheduler Pod descheduler descheduler
drone CI/CD drone
hackmd Collaborative markdown hackmd
kms Key management kms
privatebin Encrypted pastebin privatebin
vault HashiCorp Vault vault
reloader ConfigMap/Secret reloader reloader
city-guesser Game city-guesser
echo Echo server echo
url URL shortener url
excalidraw Whiteboard excalidraw
travel_blog Travel blog travel_blog
dashy Dashboard dashy
send Firefox Send send
ytdlp YouTube downloader ytdlp
wealthfolio Finance tracking wealthfolio
audiobookshelf Audiobook server audiobookshelf
paperless-ngx Document management paperless-ngx
jsoncrack JSON visualizer jsoncrack
servarr Media automation (Sonarr/Radarr/etc) servarr
ntfy Push notifications ntfy
cyberchef Data transformation cyberchef
diun Docker image update notifier diun
meshcentral Remote management meshcentral
homepage Dashboard/startpage homepage
matrix Matrix chat server matrix
linkwarden Bookmark manager linkwarden
changedetection Web change detection changedetection
tandoor Recipe manager tandoor
n8n Workflow automation n8n
real-estate-crawler Property crawler real-estate-crawler
tor-proxy Tor proxy tor-proxy
forgejo Git forge forgejo
freshrss RSS reader freshrss
navidrome Music streaming navidrome
networking-toolbox Network tools networking-toolbox
stirling-pdf PDF tools stirling-pdf
speedtest Speed testing speedtest
freedify Music streaming (factory pattern) freedify
netbox Network documentation netbox
infra-maintenance Maintenance jobs infra-maintenance
ollama LLM server (GPU) ollama
frigate NVR/camera (GPU) frigate
ebook2audiobook E-book to audio (GPU) ebook2audiobook
affine Visual canvas/whiteboard (PostgreSQL + Redis) affine
health Apple Health data dashboard (PostgreSQL) health
whisper Wyoming Faster Whisper STT (CPU on GPU node) whisper
grampsweb Genealogy web app (Gramps Web) grampsweb
openclaw AI agent gateway (OpenClaw) openclaw

Cloudflare Domains

Proxied (CDN + WAF enabled)

blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox

Non-Proxied (Direct DNS)

mail, wg, headscale, immich, calibre, vaultwarden, drone,
mailserver-antispam, mailserver-admin, webhook, uptime,
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
isponsorblocktv, speedtest, freedify, rybbit, paperless,
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw

Special Subdomains

  • *.viktor.actualbudget - Actualbudget factory instances
  • *.freedify - Freedify factory instances
  • mailserver.* - Mail server components (antispam, admin)

CI/CD

  • Drone CI (.drone.yml) for automated deployments
  • Auto-updates TLS certificates
  • ALWAYS add [ci skip] to commit messages when you've already run terraform apply to avoid triggering CI redundantly
  • After committing, run git push origin master to sync changes

GitHub & Drone CI

GitHub API Access

  • Username: ViktorBarzin
  • Token location: terraform.tfvars as github_pat (git-crypt encrypted)
  • Read token: grep github_pat terraform.tfvars | cut -d'"' -f2
  • Scopes: Full access — repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages, and more
  • gh CLI: Blocked by sandbox restrictions — use curl with the GitHub API instead

Common API Patterns

# Read token from tfvars
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)

# List repos
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"

# Create repo
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
  -d '{"name":"repo-name","private":true}'

# Add deploy key
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
  -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'

# Create webhook (e.g., for Drone CI)
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
  -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'

# Get repo info
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>"

Drone CI API Access

  • Server: https://drone.viktorbarzin.me
  • Token location: terraform.tfvars as drone_api_token (git-crypt encrypted)
  • Read token: grep drone_api_token terraform.tfvars | cut -d'"' -f2
  • Username: ViktorBarzin

Common API Patterns

# Read token from tfvars
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)

# List repos
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"

# Activate repo in Drone
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"

# Trigger build
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"

# Get build info
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds/<build-number>"

# Add secret to repo
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
  -d '{"name":"secret_name","data":"secret_value"}'

Capabilities

With these tokens, Claude can:

  • GitHub: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
  • Drone CI: Activate repos, trigger/monitor builds, manage secrets, configure pipelines

Infrastructure

  • Proxmox hypervisor for VMs (192.168.1.127)
  • Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
  • NFS server at 10.0.10.15 for storage
  • Redis shared service at redis.redis.svc.cluster.local
  • Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
    • Port 5000: docker.io (Docker Hub, with auth)
    • Port 5010: ghcr.io
    • Port 5020: quay.io
    • Port 5030: registry.k8s.io
    • Port 5040: reg.kyverno.io
    • Worker nodes use config_path = "/etc/containerd/certs.d" with per-registry hosts.toml files
    • k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)

Proxmox Host Hardware

  • CPU: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
  • RAM: 142 GB (Dell R730 server)
  • GPU: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
  • Disks: 1.1TB + 931GB + 10.7TB (local storage)
  • Proxmox access: ssh root@192.168.1.127

Proxmox Network Bridges

  • vmbr0: Physical bridge on eno1, IP 192.168.1.127/24 — connects to physical/home network (192.168.1.0/24)
  • vmbr1: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)

Proxmox VM Inventory

VMID Name Status CPUs RAM Network Disk Notes
101 pfsense running 8 16GB vmbr0, vmbr1:vlan10, vmbr1:vlan20 32G Gateway/firewall, routes between all networks
102 devvm running 16 8GB vmbr1:vlan10 100G Development VM on management network
103 home-assistant running 8 16GB vmbr1:vlan10(down), vmbr0 32G Home Assistant, net0 link disabled, uses vmbr0
105 pbs stopped 16 8GB vmbr1:vlan10 32G Proxmox Backup Server (not in use)
200 k8s-master running 8 16GB vmbr1:vlan20 64G Kubernetes control plane (10.0.20.100)
201 k8s-node1 running 16 24GB vmbr1:vlan20 128G GPU node, Tesla T4 passthrough (hostpci0)
202 k8s-node2 running 8 16GB vmbr1:vlan20 64G K8s worker node
203 k8s-node3 running 8 16GB vmbr1:vlan20 64G K8s worker node
204 k8s-node4 running 8 16GB vmbr1:vlan20 64G K8s worker node
220 docker-registry running 4 4GB vmbr1:vlan20 64G Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10)
300 Windows10 running 16 8GB vmbr0 100G Windows VM on physical network
9000 truenas running 16 16GB vmbr1:vlan10 32G+7×256G+1T NFS server (10.0.10.15), multiple data disks

VM Templates (stopped, used for cloning)

VMID Name Purpose
1000 ubuntu-2404-cloudinit-non-k8s-template Base template for non-K8s VMs
1001 docker-registry-template Template for docker registry VM
2000 ubuntu-2404-cloudinit-k8s-template Base template for K8s nodes

Network Connectivity Summary

  • pfSense (101) bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
  • K8s cluster (200-204) + docker-registry (220) are all on VLAN 20 (kubernetes network)
  • TrueNAS (9000) + devvm (102) + PBS (105) are on VLAN 10 (management network)
  • Home Assistant (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
  • Windows10 (300) is on physical network (vmbr0) only

GPU Node (k8s-node1)

  • VMID: 201
  • PCIe Passthrough: 0000:06:00.0 (NVIDIA Tesla T4)
  • Taint: nvidia.com/gpu=true:NoSchedule - Only GPU workloads can run here
  • Label: gpu=true
  • GPU workloads must have both:
    • node_selector = { "gpu": "true" }
    • toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }
  • Taint is applied via null_resource.gpu_node_taint in modules/kubernetes/nvidia/main.tf

Git Operations (IMPORTANT)

  • Git is slow on this repo due to many files - commands can take 30+ seconds
  • Use GIT_OPTIONAL_LOCKS=0 prefix if git hangs
  • Always commit only specific files you changed, not everything
  • ALWAYS ask user before pushing to remote - never push without explicit confirmation

Prometheus Alerts

  • Alert rules are in modules/kubernetes/monitoring/prometheus_chart_values.tpl
  • Under serverFiles.alerting_rules.yml.groups
  • Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
  • kube-state-metrics provides: kube_deployment_*, kube_statefulset_*, kube_daemonset_*

Tier System

  • 0-core: Critical infrastructure (ingress, DNS, VPN, auth)
  • 1-cluster: Cluster services (Redis, metrics, security)
  • 2-gpu: GPU workloads (Immich, Ollama, Frigate)
  • 3-edge: User-facing services
  • 4-aux: Optional/auxiliary services

Resource Governance (Kyverno-based)

Four layers of noisy-neighbor protection, all defined in modules/kubernetes/kyverno/resource-governance.tf:

  1. PriorityClasses: tier-0-core (1M) through tier-4-aux (200K). tier-4-aux uses preemption_policy=Never.
  2. LimitRange defaults (Kyverno generate): Auto-creates tier-defaults LimitRange in namespaces based on tier label. Only affects containers without explicit resources.
  3. ResourceQuotas (Kyverno generate): Auto-creates tier-quota ResourceQuota in namespaces with tier labels. Excludes namespaces with resource-governance/custom-quota=true label.
  4. Priority injection (Kyverno mutate): Sets priorityClassName on Pods based on namespace tier label.

Custom quota override: Add label resource-governance/custom-quota: "true" to namespace, then define a custom kubernetes_resource_quota in the service's Terraform module. Currently used by: monitoring, crowdsec.

LimitRange defaults by tier:

Tier Default Req Default Limit Max
0-core 100m/128Mi 2/4Gi 8/16Gi
1-cluster 100m/128Mi 2/4Gi 4/8Gi
2-gpu 100m/256Mi 4/8Gi 8/16Gi
3-edge 50m/128Mi 1/2Gi 4/8Gi
4-aux 25m/64Mi 500m/1Gi 2/4Gi

ResourceQuota hard limits by tier:

Tier Req CPU Req Mem Lim CPU Lim Mem Pods
0-core 8 8Gi 32 64Gi 100
1-cluster 4 4Gi 16 32Gi 30
2-gpu 8 8Gi 48 96Gi 40
3-edge 4 4Gi 16 32Gi 30
4-aux 2 2Gi 8 16Gi 20

User Preferences

Calendar

  • Default calendar: Nextcloud (always use unless otherwise specified)
  • Nextcloud URL: https://nextcloud.viktorbarzin.me
  • CalDAV endpoint: https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/

Home Assistant

  • Default smart home: Home Assistant (always use for smart home control)
  • Two deployments:
    • ha-london (default): https://ha-london.viktorbarzin.me | Script: .claude/home-assistant.py | SSH: ssh pi@192.168.8.103, config at /home/pi/docker/homeAssistant/
    • ha-sofia: https://ha-sofia.viktorbarzin.me | Script: .claude/home-assistant-sofia.py | SSH: ssh vbarzin@192.168.1.8, config at /config/
  • Aliases: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.

Development

  • Frontend framework: Svelte (user is learning it, so use Svelte for all new web apps)

Pod Monitoring After Updates

  • Never use sleep to wait for pods — instead, spawn a background subagent (Task tool with run_in_background: true) that continuously checks pod state (e.g., kubectl get pods -n <namespace> -w) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.

Skills & Workflows

Skills are specialized workflows for common tasks. Located in .claude/skills/.

Available Skills

setup-project (.claude/skills/setup-project/SKILL.md)

  • Deploy new self-hosted services from GitHub repos
  • Automated workflow: Docker image → Terraform module → Deploy
  • Handles database setup, ingress, DNS configuration
  • When to use: User provides GitHub URL or wants to deploy a new service
  • Example: "Deploy [GitHub repo] to the cluster"

extend-vm-storage (.claude/skills/extend-vm-storage/SKILL.md)

  • Extend disk storage on K8s node VMs (Proxmox-hosted)
  • Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
  • When to use: A k8s node needs more disk space
  • Example: "Extend storage on k8s-node2 by 64G"

Service-Specific Notes

Authentik (Identity Provider)

  • Helm Chart: authentik v2025.10.3 from https://charts.goauthentik.io/
  • URL: https://authentik.viktorbarzin.me
  • API: https://authentik.viktorbarzin.me/api/v3/
  • API Token: Stored in terraform.tfvars as authentik_api_token (non-expiring, superuser, identifier: claude-code-permanent). Read with: grep authentik_api_token terraform.tfvars | cut -d'"' -f2
  • Namespace: authentik (tier: cluster)
  • Architecture: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
  • Database: PostgreSQL via postgresql.dbaas:5432, pooled through PgBouncer at pgbouncer.authentik:6432
  • Redis: Shared at redis.redis.svc.cluster.local
  • Terraform: modules/kubernetes/authentik/main.tf (Helm), pgbouncer.tf (connection pooling)

Authentik API Management

To call the API, use:

curl -s -H "Authorization: Bearer <TOKEN>" "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"

Key API endpoints:

  • core/users/ — List/create/update/delete users
  • core/groups/ — List/create/update/delete groups
  • core/applications/ — List/create applications
  • providers/all/ — List all providers (OAuth2, Proxy, etc.)
  • providers/oauth2/ — OAuth2/OIDC providers specifically
  • providers/proxy/ — Proxy providers (forward auth)
  • flows/instances/ — List flows
  • stages/all/ — List stages
  • sources/all/ — List sources (Google, GitHub, etc.)
  • outposts/instances/ — List outposts
  • propertymappings/all/ — List property mappings
  • rbac/roles/ — List roles

Current Applications (9)

Application Provider Type Auth Flow
Cloudflare Access OAuth2/OIDC explicit consent
Domain wide catch all Proxy (forward auth) implicit consent
Grafana OAuth2/OIDC implicit consent
Headscale OAuth2/OIDC explicit consent
Immich OAuth2/OIDC explicit consent
Kubernetes OAuth2/OIDC (public) implicit consent
linkwarden OAuth2/OIDC explicit consent
Matrix OAuth2/OIDC implicit consent
wrongmove OAuth2/OIDC implicit consent

Current Groups (9)

Group Parent Superuser Purpose
Allow Login Users No Parent group for login-permitted users
authentik Admins Yes Full admin access
authentik Read-only No Read-only access (has role)
Headscale Users Allow Login Users No VPN access
Home Server Admins Allow Login Users No Server admin access
Wrongmove Users Allow Login Users No Real-estate app access
kubernetes-admins No K8s cluster-admin RBAC
kubernetes-power-users No K8s power-user RBAC
kubernetes-namespace-owners No K8s namespace-owner RBAC

Current Users (7 real users)

Username Name Type Groups
akadmin authentik Default Admin internal authentik Admins, Home Server Admins, Headscale Users
vbarzin@gmail.com Viktor Barzin internal authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users
emil.barzin@gmail.com Emil Barzin internal Home Server Admins, Headscale Users
ancaelena98@gmail.com Anca Milea external Wrongmove Users, Headscale Users
vabbit81@gmail.com GHEORGHE Milea external Headscale Users
valentinakolevabarzina@gmail.com Валентина Колева-Барзина internal Headscale Users
anca.r.cristian10@gmail.com internal Wrongmove Users
kadir.tugan@gmail.com Kadir internal Wrongmove Users

Login Sources (Social Login)

  • Google (OAuth) — user matching by identifier
  • GitHub (OAuth) — user matching by email_link
  • Facebook (OAuth) — user matching by email_link
  • All use the same authentication flow (1a779f24) and enrollment flow (87572804)

Authorization Flows

  • Explicit consent (default-provider-authorization-explicit-consent): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare
  • Implicit consent (default-provider-authorization-implicit-consent): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove

Traefik Integration

  • Forward auth middleware: authentik-forward-auth in Traefik namespace
  • Outpost endpoint: http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik
  • Services opt in via protected = true in ingress_factory
  • Response headers: X-authentik-username, X-authentik-uid, X-authentik-email, X-authentik-name, X-authentik-groups, Set-Cookie

OIDC for Kubernetes API

  • Issuer: https://authentik.viktorbarzin.me/application/o/kubernetes/
  • Client ID: kubernetes (public client, no secret)
  • Username claim: email, Groups claim: groups
  • Signing key: authentik Self-signed Certificate (must be assigned to the provider or JWKS will be empty)
  • Redirect URIs: Regex mode http://localhost:.* and http://127\.0\.0\.1:.* (kubelogin picks random ports)
  • Configured via: SSH to kube-apiserver manifest (modules/kubernetes/rbac/apiserver-oidc.tf)
  • RBAC module: modules/kubernetes/rbac/main.tf — admin/power-user/namespace-owner roles
  • Self-service portal: modules/kubernetes/k8s-portal/ — SvelteKit app at https://k8s-portal.viktorbarzin.me
  • User definition: k8s_users variable in terraform.tfvars
  • Audit logging: Enabled via modules/kubernetes/rbac/audit-policy.tf, logs at /var/log/kubernetes/audit.log

CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:

  1. Signing key MUST be assigned to the OAuth2 provider. Without it, the JWKS endpoint returns {} and kube-apiserver can't validate tokens.
  2. Email mapping must set email_verified: True. The default Authentik email scope mapping hardcodes email_verified: False, which causes kube-apiserver to reject the token with oidc: email not verified. Use a custom scope mapping: return {"email": request.user.email, "email_verified": True}
  3. kubelogin needs --oidc-extra-scope for email, profile, groups. Without these, only openid is requested and the token lacks the email claim, causing oidc: parse username claims "email": claim not present.
  4. Redirect URIs must use regex mode (http://localhost:.*) because kubelogin picks random ports, not just 8000/18000.
  5. Kubelet static pod manifest changes require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple touch or kubelet restart is not enough.
  6. Property mappings endpoint in Authentik 2025.10.x is propertymappings/provider/scope/ (not the older propertymappings/scope/).

Common Management Tasks

Add a new OAuth2 application:

  1. Create OAuth2 provider: POST /api/v3/providers/oauth2/ with client_id, client_secret, redirect_uris, authorization_flow, etc.
  2. Create application: POST /api/v3/core/applications/ with name, slug, provider pk
  3. (Optional) Bind to group policy for access control

Add a user to a group:

# Get group pk, then PATCH with updated users list
curl -X PATCH -H "Authorization: Bearer <TOKEN>" -H "Content-Type: application/json" \
  "https://authentik.viktorbarzin.me/api/v3/core/groups/<group-pk>/" \
  -d '{"users": [<existing_user_pks>, <new_user_pk>]}'

Protect a service with forward auth: Set protected = true in the service's ingress_factory call in Terraform.

AFFiNE (Visual Canvas)

  • Image: ghcr.io/toeverything/affine:stable
  • Port: 3010
  • Requires: PostgreSQL + Redis
  • Migration: Init container runs node ./scripts/self-host-predeploy.js
  • Storage: NFS at /mnt/main/affine mounted to /root/.affine/storage and /root/.affine/config
  • Key env vars:
    • AFFINE_SERVER_EXTERNAL_URL - Public URL (e.g., https://affine.viktorbarzin.me)
    • AFFINE_SERVER_HTTPS - Set to true behind TLS ingress
    • DATABASE_URL - PostgreSQL connection string
    • REDIS_SERVER_HOST - Redis hostname
    • MAILER_* - SMTP configuration for email invites
  • Local-first: Data stored in browser by default; syncs to server when user creates account
  • Docs: https://docs.affine.pro/self-host-affine

Wyoming Whisper (STT for Home Assistant)

  • Image: rhasspy/wyoming-whisper:latest
  • Port: 10300/TCP (Wyoming protocol)
  • Model: small-int8 (CPU-optimized, no CUDA variant available from upstream)
  • Runs on: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
  • Storage: NFS at /mnt/main/whisper/data (model cache)
  • Exposure: Internal only via Traefik TCP entrypoint whisper-tcp → IngressRouteTCP
  • Access: 10.0.20.202:10300 (Traefik LB IP, no public DNS)
  • HA Integration: Wyoming Protocol integration in ha-london, host 10.0.20.202, port 10300
  • No GPU acceleration: Official image is CPU-only (Debian + PyTorch CPU). The mib1185/wyoming-faster-whisper-cuda image exists but requires self-build.

Gramps Web (Genealogy)

  • Image: ghcr.io/gramps-project/grampsweb:latest
  • Port: 5000
  • URL: https://family.viktorbarzin.me
  • Components: Web app + Celery worker (2 containers in 1 pod)
  • Requires: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
  • Storage: NFS at /mnt/main/grampsweb with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp
  • Key env vars:
    • GRAMPSWEB_SECRET_KEY - Flask secret key (generated via random_password)
    • GRAMPSWEB_TREE - Tree name
    • GRAMPSWEB_BASE_URL - Public URL
    • GRAMPSWEB_CELERY_CONFIG__broker_url / result_backend - Redis connection
    • GRAMPSWEB_REGISTRATION_DISABLED - Set to True
    • GRAMPSWEB_EMAIL_* - SMTP configuration
    • GRAMPSWEB_LLM_* - Ollama AI integration
  • Celery command: celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2
  • Registration: Disabled; first user created via UI setup wizard

Loki + Alloy (Centralized Log Collection)

  • Loki image: grafana/loki:3.6.5 (Helm chart, single binary mode)
  • Alloy image: grafana/alloy:v1.13.0 (Helm chart, DaemonSet)
  • Config files: modules/kubernetes/monitoring/loki.tf, loki.yaml, alloy.yaml
  • Port: 3100/TCP (Loki API)
  • Storage: NFS PV at /mnt/main/loki/loki (15Gi), WAL on tmpfs (2Gi in-memory)
  • Memory: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
  • Disk-friendly tuning: max_chunk_age: 24h, chunk_idle_period: 12h — holds chunks in memory, flushes ~once/day
  • Retention: 7 days (retention_period: 168h), compactor enforces deletion
  • Crash policy: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
  • Ruler: Evaluates LogQL alert rules, fires to http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
  • Alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap loki-alert-rules)
  • Grafana: Datasource UID P8E80F9AEF21F6940, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned)
  • Sysctl DaemonSet: sysctl-inotify sets fs.inotify.max_user_watches=1048576 on all nodes (required for Alloy fsnotify)
  • Disabled components: gateway, chunksCache, resultsCache (not needed for single binary)
  • Key paths: Compactor at /var/loki/compactor, ruler scratch at /var/loki/scratch (must be under /var/loki — root FS is read-only)
  • Querying: Grafana Explore with LogQL, e.g. {namespace="monitoring"} |= "error"
  • Troubleshooting: If "entry too far behind" errors on first start, restart Alloy DaemonSet (kubectl rollout restart ds -n monitoring alloy) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart

OpenClaw (AI Agent Gateway)

  • Image: ghcr.io/openclaw/openclaw:2026.2.9
  • Port: 18789
  • URL: https://openclaw.viktorbarzin.me (authentik-protected)
  • Namespace: openclaw (tier: aux)
  • Formerly: moltbot — renamed in Feb 2026
  • Architecture: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
  • Init container: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
  • ServiceAccount: openclaw with cluster-admin ClusterRoleBinding (for managing cluster resources)
  • Storage: NFS at /mnt/main/openclaw/workspace (git repo) and /mnt/main/openclaw/data (persistent data)
  • Config: openclaw.json ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults
  • Variables: openclaw_ssh_key, openclaw_skill_secrets in terraform.tfvars
  • Skill secrets: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
  • Model providers: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)