Viktor Barzin 116c4d9c30 [ci skip] Remove legacy files and orphaned modules

Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.

2026-02-22 15:23:27 +00:00

46 KiB

Executable file

Raw Blame History

Infrastructure Repository Knowledge

Instructions for Claude

When the user says "remember" something: Always update this file (.claude/CLAUDE.md) with the information so it persists across sessions
When discovering new patterns or versions: Add them to the appropriate section below
When making infrastructure changes: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
After every significant change: Proactively update this file (.claude/CLAUDE.md) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically.
After updating any .claude/ files: Always commit them immediately (git add .claude/ && git commit -m "[ci skip] update claude knowledge") to avoid building up unstaged changes.
Skills available: Check .claude/skills/ directory for specialized workflows (e.g., setup-project.md for deploying new services)
CRITICAL: All infrastructure changes must go through Terraform/Terragrunt. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform .tf files and apply with terragrunt apply. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use kubectl only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes.
CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files unless they are encrypted (e.g., via git-crypt). Secrets belong in terraform.tfvars (which is git-crypt encrypted) or in the secrets/ directory. Never hardcode credentials in .tf files, scripts, .claude/ files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (terraform.tfvars → main.tf → module variables).
CRITICAL: NEVER commit secrets — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
New services MUST have CI/CD: Set up Drone CI pipeline (.drone.yml) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy.
New services MUST have monitoring: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.

Execution Environment

File operations: Read, Edit, Write, Glob, Grep tools
Git commands: git status, git log, git diff, git add, git commit, git reset, etc.
Shell commands: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
CRITICAL: Always run terragrunt/terraform locally, never on the remote server via SSH:
```
cd stacks/<service> && terragrunt apply --non-interactive
```
kubectl: Use kubectl --kubeconfig $(pwd)/config for cluster access
GitHub API: Use curl with token from tfvars (see GitHub & Drone CI section below). gh CLI is blocked by sandbox restrictions.
Drone CI API: Use curl with token from tfvars (see GitHub & Drone CI section below).

Overview

Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under stacks/, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.

Static File Paths (NEVER CHANGE)

Main config: terraform.tfvars - All secrets, DNS, Cloudflare config, WireGuard peers
Root Terragrunt: terragrunt.hcl - Root Terragrunt config (providers, backend, var loading)
Service stacks: stacks/<service>/ - Individual service stacks (each has terragrunt.hcl + main.tf with resources inline)
Infra stack: stacks/infra/ - Proxmox VM resources (templates, docker-registry, VMs)
Platform stack: stacks/platform/ - Core infrastructure services (22 modules in modules/ subdir)
Per-stack state: state/stacks/<service>/terraform.tfstate - Per-stack state files (gitignored)
Service resources: stacks/<service>/main.tf - Service resources defined directly in stack root
Platform modules: stacks/platform/modules/<service>/ - Platform service modules
Shared modules: modules/kubernetes/ingress_factory/, modules/kubernetes/setup_tls_secret/
Secrets: secrets/ - git-crypt encrypted TLS certs and keys

Network Topology (Static IPs)

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.10.0/24 - Management Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.10.10  - Wizard (main server)                               │
│ 10.0.10.15  - NFS Server (TrueNAS) - /mnt/main/*                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.20.0/24 - Kubernetes Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.20.1   - pfSense Gateway                                   │
│ 10.0.20.10  - Docker Registry VM (MAC: DE:AD:BE:EF:22:22)       │
│ 10.0.20.100 - k8s-master                                        │
│ 10.0.20.101 - Technitium DNS                                    │
│ 10.0.20.102 - MetalLB IP Pool Start                             │
│ 10.0.20.200 - MetalLB IP Pool End                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 192.168.1.0/24 - Physical Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 192.168.1.127 - Proxmox Hypervisor                              │
└─────────────────────────────────────────────────────────────────┘

Domains

Public: viktorbarzin.me (Cloudflare-managed)
Internal: viktorbarzin.lan (Technitium DNS)

Directory Structure

terragrunt.hcl - Root Terragrunt configuration (providers, backend, variable loading)
stacks/ - Individual Terragrunt stacks (one per service)
stacks/infra/ - Proxmox VM resources (templates, docker-registry)
stacks/platform/ - Core infrastructure (22 services in stacks/platform/modules/)
stacks/<service>/ - Individual service stacks (resources directly in main.tf)
stacks/platform/modules/<service>/ - Platform service module source code
modules/kubernetes/ - Only shared utility modules: ingress_factory/, setup_tls_secret/
modules/create-vm/ - Proxmox VM creation module
state/ - Per-stack Terraform state files (gitignored)
secrets/ - Encrypted secrets (TLS certs, keys) via git-crypt
cli/ - Go CLI tool for infrastructure management
scripts/ - Helper scripts (cluster management, node updates)
playbooks/ - Ansible playbooks for node configuration
diagram/ - Infrastructure diagrams (Python-based)

Key Patterns

Each service in modules/kubernetes/<service>/main.tf defines its own namespace, deployments, services, and ingress
NFS storage from 10.0.10.15 for persistent data
TLS secrets managed via setup_tls_secret module
Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
HTTP/3 enabled on Traefik (http3.enabled=true, advertisedPort=443 on websecure entrypoint) and Cloudflare (cloudflare_zone_settings_override with http3="on")
GPU workloads use node_selector = { "gpu": "true" }
Services expose to *.viktorbarzin.me domains

NFS Volume Pattern

Prefer inline NFS volumes over separate PV/PVC resources. Use the nfs {} block directly in pod/deployment/cronjob specs:

volume {
  name = "data"
  nfs {
    server = "10.0.10.15"
    path   = "/mnt/main/<service>"
  }
}

Only use PV/PVC when the Helm chart requires existingClaim (like the Nextcloud Helm chart).

Adding NFS Exports

To add a new NFS exported directory:

Edit secrets/nfs_directories.txt - add the new directory path, keep the list sorted
Run secrets/nfs_exports.sh from the secrets/ directory to update the NFS share via TrueNAS API

Factory Pattern (for multi-user services)

Used when a service needs one instance per user. Structure:

stacks/<service>/
├── main.tf           # Namespace, TLS secret, user module calls
└── factory/
    └── main.tf       # Deployment, service, ingress templates with ${var.name}

Examples: actualbudget, freedify

To add a new user:

Export NFS share at /mnt/main/<service>/<username> in TrueNAS
Add Cloudflare route in tfvars
Add module block in main.tf calling factory

Init Container Pattern (for database migrations)

Use when a service needs to run database migrations before starting:

init_container {
  name    = "migration"
  image   = "service-image:tag"
  command = ["sh", "-c", "migration-command"]

  dynamic "env" {
    for_each = local.common_env
    content {
      name  = env.value.name
      value = env.value.value
    }
  }
}

Example: AFFiNE runs node ./scripts/self-host-predeploy.js in init container.

SMTP/Email Configuration

When configuring services to use the mailserver:

Use public hostname: mail.viktorbarzin.me (for TLS cert validation)
Do NOT use: mailserver.mailserver.svc.cluster.local (TLS cert mismatch)
Port: 587 (STARTTLS)
Credentials: Use existing accounts from mailserver_accounts in tfvars
Common email: info@viktorbarzin.me for service notifications

Terragrunt Architecture

Root terragrunt.hcl provides DRY provider, backend, and variable loading for all stacks
Each stack contains its resources directly: stacks/<service>/main.tf has variable declarations, locals, and all Terraform resources inline
Platform modules live at stacks/platform/modules/<service>/, referenced as source = "./modules/<service>"
Shared utility modules (ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy) remain at modules/kubernetes/ and are referenced with relative paths from each module
State isolation: each stack has its own state file at state/stacks/<service>/terraform.tfstate
Dependencies: service stacks depend on platform stack via dependency block in their terragrunt.hcl
Variables loaded from terraform.tfvars automatically (unused vars silently ignored via extra_arguments)
secrets/ symlinks in each stack for TLS cert resolution (path.root workaround)
Terragrunt v0.99+: use --non-interactive (not --terragrunt-non-interactive)
run-all syntax: terragrunt run --all -- <command> (not terragrunt run-all)
The platform stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.)
Individual service stacks are for services that can be deployed independently

Adding a New Service

When adding a new service to the cluster:

Create stacks/<service>/ directory with:
- terragrunt.hcl - Include root config, declare platform dependency
- main.tf - All resources defined directly (variables, locals, namespace, deployments, services, ingress)
- secrets - Symlink to ../../secrets (for TLS cert path resolution)
Add Cloudflare DNS record in terraform.tfvars (cloudflare_proxied_names or cloudflare_non_proxied_names)
Apply the cloudflared stack: cd stacks/platform && terragrunt apply --non-interactive
Apply the new service: cd stacks/<service> && terragrunt apply --non-interactive

Common Variables

tls_secret_name - TLS certificate secret name
tier - Deployment tier label
Service-specific passwords passed as variables

Service Versions (as of 2026-02)

Immich: v2.4.1
Freedify: latest (music streaming, factory pattern)
AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
Gramps Web: latest (genealogy, uses Redis + Celery)
Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)

Useful Commands

# Cluster health check — ALWAYS use this to check cluster status
bash scripts/cluster_healthcheck.sh            # Full color report
bash scripts/cluster_healthcheck.sh --quiet    # Only WARN/FAIL
bash scripts/cluster_healthcheck.sh --json     # Machine-readable
bash scripts/cluster_healthcheck.sh --fix      # Auto-delete evicted pods

# Apply a single service stack
cd stacks/<service> && terragrunt apply --non-interactive

# Plan a single service stack
cd stacks/<service> && terragrunt plan --non-interactive

# Plan all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- plan

# Apply all stacks (full DAG)
cd stacks && terragrunt run --all --non-interactive -- apply

# Format all terraform files
terraform fmt -recursive

kubectl get pods -A

Cluster Health Check (scripts/cluster_healthcheck.sh):

ALWAYS use this script to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
When adding new healthchecks or monitoring: Always update this script to validate the new component

Terragrunt apply examples:

cd stacks/monitoring && terragrunt apply --non-interactive - Apply monitoring
cd stacks/immich && terragrunt apply --non-interactive - Apply immich
cd stacks/infra && terragrunt apply --non-interactive - Apply Proxmox VMs / docker registry
cd stacks/platform && terragrunt apply --non-interactive - Apply all core/platform services

IMPORTANT: When deploying a new service, you must ALSO apply the platform stack (which includes cloudflared) to create the Cloudflare DNS record:

cd stacks/platform && terragrunt apply --non-interactive

Adding a name to cloudflare_non_proxied_names or cloudflare_proxied_names in terraform.tfvars only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.

Stack Structure

Terragrunt stacks under stacks/:

stacks/infra/ - Proxmox VMs, templates, docker-registry
stacks/platform/ - Core infrastructure (~22 services in modules/ subdir)
stacks/<service>/ - Individual service stacks (resources directly in main.tf)

Each stack's terragrunt.hcl includes the root terragrunt.hcl which provides:

Kubernetes + Helm providers (configured from terraform.tfvars)
Local backend with per-stack state file (state/stacks/<service>/terraform.tfstate)
Automatic loading of terraform.tfvars with unused vars ignored

Complete Service Catalog

Critical - Network & Auth (Tier: core)

Service	Description	Stack
wireguard	VPN server	platform
technitium	DNS server (10.0.20.101)	platform
headscale	Tailscale control server	platform
traefik	Ingress controller (Helm)	platform
xray	Proxy/tunnel	platform
authentik	Identity provider (SSO)	platform
cloudflared	Cloudflare tunnel	platform
authelia	Auth middleware	platform
monitoring	Prometheus/Grafana/Loki stack	platform

Storage & Security (Tier: cluster)

Service	Description	Stack
vaultwarden	Bitwarden-compatible password manager	platform
redis	Shared Redis at `redis.redis.svc.cluster.local`	platform
immich	Photo management (GPU)	immich
nvidia	GPU device plugin	platform
metrics-server	K8s metrics	platform
uptime-kuma	Status monitoring	platform
crowdsec	Security/WAF	platform
kyverno	Policy engine	platform

Admin

Service	Description	Stack
k8s-dashboard	Kubernetes dashboard	platform
reverse-proxy	Generic reverse proxy	platform

Active Use

Service	Description	Stack
mailserver	Email (docker-mailserver)	mailserver
shadowsocks	Proxy	shadowsocks
webhook_handler	Webhook processing	webhook_handler
tuya-bridge	Smart home bridge	tuya-bridge
dawarich	Location history	dawarich
owntracks	Location tracking	owntracks
nextcloud	File sync/share	nextcloud
calibre	E-book management	calibre
onlyoffice	Document editing	onlyoffice
f1-stream	F1 streaming	f1-stream
rybbit	Analytics	rybbit
isponsorblocktv	SponsorBlock for TV	isponsorblocktv
actualbudget	Budgeting (factory pattern)	actualbudget

Optional

Service	Description	Stack
blog	Personal blog	blog
descheduler	Pod descheduler	descheduler
drone	CI/CD	drone
hackmd	Collaborative markdown	hackmd
kms	Key management	kms
privatebin	Encrypted pastebin	privatebin
vault	HashiCorp Vault	vault
reloader	ConfigMap/Secret reloader	reloader
city-guesser	Game	city-guesser
echo	Echo server	echo
url	URL shortener	url
excalidraw	Whiteboard	excalidraw
travel_blog	Travel blog	travel_blog
dashy	Dashboard	dashy
send	Firefox Send	send
ytdlp	YouTube downloader	ytdlp
wealthfolio	Finance tracking	wealthfolio
audiobookshelf	Audiobook server	audiobookshelf
paperless-ngx	Document management	paperless-ngx
jsoncrack	JSON visualizer	jsoncrack
servarr	Media automation (Sonarr/Radarr/etc)	servarr
ntfy	Push notifications	ntfy
cyberchef	Data transformation	cyberchef
diun	Docker image update notifier	diun
meshcentral	Remote management	meshcentral
homepage	Dashboard/startpage	homepage
matrix	Matrix chat server	matrix
linkwarden	Bookmark manager	linkwarden
changedetection	Web change detection	changedetection
tandoor	Recipe manager	tandoor
n8n	Workflow automation	n8n
real-estate-crawler	Property crawler	real-estate-crawler
tor-proxy	Tor proxy	tor-proxy
forgejo	Git forge	forgejo
freshrss	RSS reader	freshrss
navidrome	Music streaming	navidrome
networking-toolbox	Network tools	networking-toolbox
stirling-pdf	PDF tools	stirling-pdf
speedtest	Speed testing	speedtest
freedify	Music streaming (factory pattern)	freedify
netbox	Network documentation	netbox
infra-maintenance	Maintenance jobs	infra-maintenance
ollama	LLM server (GPU)	ollama
frigate	NVR/camera (GPU)	frigate
ebook2audiobook	E-book to audio (GPU)	ebook2audiobook
affine	Visual canvas/whiteboard (PostgreSQL + Redis)	affine
health	Apple Health data dashboard (PostgreSQL)	health
whisper	Wyoming Faster Whisper STT (CPU on GPU node)	whisper
grampsweb	Genealogy web app (Gramps Web)	grampsweb
openclaw	AI agent gateway (OpenClaw)	openclaw

Cloudflare Domains

Proxied (CDN + WAF enabled)

blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox

Non-Proxied (Direct DNS)

mail, wg, headscale, immich, calibre, vaultwarden, drone,
mailserver-antispam, mailserver-admin, webhook, uptime,
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
isponsorblocktv, speedtest, freedify, rybbit, paperless,
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw

Special Subdomains

*.viktor.actualbudget - Actualbudget factory instances
*.freedify - Freedify factory instances
mailserver.* - Mail server components (antispam, admin)

CI/CD

Drone CI (.drone.yml) for automated deployments
Auto-updates TLS certificates
ALWAYS add [ci skip] to commit messages when you've already run terraform apply to avoid triggering CI redundantly
After committing, run git push origin master to sync changes

GitHub & Drone CI

GitHub API Access

Username: ViktorBarzin
Token location: terraform.tfvars as github_pat (git-crypt encrypted)
Read token: grep github_pat terraform.tfvars | cut -d'"' -f2
Scopes: Full access — repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages, and more
gh CLI: Blocked by sandbox restrictions — use curl with the GitHub API instead

Common API Patterns

# Read token from tfvars
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)

# List repos
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"

# Create repo
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
  -d '{"name":"repo-name","private":true}'

# Add deploy key
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
  -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'

# Create webhook (e.g., for Drone CI)
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
  -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'

# Get repo info
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>"

Drone CI API Access

Server: https://drone.viktorbarzin.me
Token location: terraform.tfvars as drone_api_token (git-crypt encrypted)
Read token: grep drone_api_token terraform.tfvars | cut -d'"' -f2
Username: ViktorBarzin

Common API Patterns

# Read token from tfvars
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)

# List repos
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"

# Activate repo in Drone
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"

# Trigger build
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"

# Get build info
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds/<build-number>"

# Add secret to repo
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
  -d '{"name":"secret_name","data":"secret_value"}'

Capabilities

With these tokens, Claude can:

GitHub: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
Drone CI: Activate repos, trigger/monitor builds, manage secrets, configure pipelines

Infrastructure

Proxmox hypervisor for VMs (192.168.1.127)
Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
NFS server at 10.0.10.15 for storage
Redis shared service at redis.redis.svc.cluster.local
Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
- Port 5000: docker.io (Docker Hub, with auth)
- Port 5010: ghcr.io
- Port 5020: quay.io
- Port 5030: registry.k8s.io
- Port 5040: reg.kyverno.io
- Worker nodes use config_path = "/etc/containerd/certs.d" with per-registry hosts.toml files
- k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)

Proxmox Host Hardware

CPU: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
RAM: 142 GB (Dell R730 server)
GPU: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
Disks: 1.1TB + 931GB + 10.7TB (local storage)
Proxmox access: ssh root@192.168.1.127

Proxmox Network Bridges

vmbr0: Physical bridge on eno1, IP 192.168.1.127/24 — connects to physical/home network (192.168.1.0/24)
vmbr1: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)

Proxmox VM Inventory

VMID	Name	Status	CPUs	RAM	Network	Disk	Notes
101	pfsense	running	8	16GB	vmbr0, vmbr1:vlan10, vmbr1:vlan20	32G	Gateway/firewall, routes between all networks
102	devvm	running	16	8GB	vmbr1:vlan10	100G	Development VM on management network
103	home-assistant	running	8	16GB	vmbr1:vlan10(down), vmbr0	32G	Home Assistant, net0 link disabled, uses vmbr0
105	pbs	stopped	16	8GB	vmbr1:vlan10	32G	Proxmox Backup Server (not in use)
200	k8s-master	running	8	16GB	vmbr1:vlan20	64G	Kubernetes control plane (10.0.20.100)
201	k8s-node1	running	16	24GB	vmbr1:vlan20	128G	GPU node, Tesla T4 passthrough (hostpci0)
202	k8s-node2	running	8	16GB	vmbr1:vlan20	64G	K8s worker node
203	k8s-node3	running	8	16GB	vmbr1:vlan20	64G	K8s worker node
204	k8s-node4	running	8	16GB	vmbr1:vlan20	64G	K8s worker node
220	docker-registry	running	4	4GB	vmbr1:vlan20	64G	Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10)
300	Windows10	running	16	8GB	vmbr0	100G	Windows VM on physical network
9000	truenas	running	16	16GB	vmbr1:vlan10	32G+7×256G+1T	NFS server (10.0.10.15), multiple data disks

VM Templates (stopped, used for cloning)

VMID	Name	Purpose
1000	ubuntu-2404-cloudinit-non-k8s-template	Base template for non-K8s VMs
1001	docker-registry-template	Template for docker registry VM
2000	ubuntu-2404-cloudinit-k8s-template	Base template for K8s nodes

Network Connectivity Summary

pfSense (101) bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
K8s cluster (200-204) + docker-registry (220) are all on VLAN 20 (kubernetes network)
TrueNAS (9000) + devvm (102) + PBS (105) are on VLAN 10 (management network)
Home Assistant (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
Windows10 (300) is on physical network (vmbr0) only

GPU Node (k8s-node1)

VMID: 201
PCIe Passthrough: 0000:06:00.0 (NVIDIA Tesla T4)
Taint: nvidia.com/gpu=true:NoSchedule - Only GPU workloads can run here
Label: gpu=true
GPU workloads must have both:
- node_selector = { "gpu": "true" }
- toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }
Taint is applied via null_resource.gpu_node_taint in modules/kubernetes/nvidia/main.tf

Git Operations (IMPORTANT)

Git is slow on this repo due to many files - commands can take 30+ seconds
Use GIT_OPTIONAL_LOCKS=0 prefix if git hangs
Always commit only specific files you changed, not everything
ALWAYS ask user before pushing to remote - never push without explicit confirmation

Prometheus Alerts

Alert rules are in modules/kubernetes/monitoring/prometheus_chart_values.tpl
Under serverFiles.alerting_rules.yml.groups
Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
kube-state-metrics provides: kube_deployment_*, kube_statefulset_*, kube_daemonset_*

Tier System

0-core: Critical infrastructure (ingress, DNS, VPN, auth)
1-cluster: Cluster services (Redis, metrics, security)
2-gpu: GPU workloads (Immich, Ollama, Frigate)
3-edge: User-facing services
4-aux: Optional/auxiliary services

Resource Governance (Kyverno-based)

Four layers of noisy-neighbor protection, all defined in modules/kubernetes/kyverno/resource-governance.tf:

PriorityClasses: tier-0-core (1M) through tier-4-aux (200K). tier-4-aux uses preemption_policy=Never.
LimitRange defaults (Kyverno generate): Auto-creates tier-defaults LimitRange in namespaces based on tier label. Only affects containers without explicit resources.
ResourceQuotas (Kyverno generate): Auto-creates tier-quota ResourceQuota in namespaces with tier labels. Excludes namespaces with resource-governance/custom-quota=true label.
Priority injection (Kyverno mutate): Sets priorityClassName on Pods based on namespace tier label.

Custom quota override: Add label resource-governance/custom-quota: "true" to namespace, then define a custom kubernetes_resource_quota in the service's Terraform module. Currently used by: monitoring, crowdsec.

LimitRange defaults by tier:

Tier	Default Req	Default Limit	Max
0-core	100m/128Mi	2/4Gi	8/16Gi
1-cluster	100m/128Mi	2/4Gi	4/8Gi
2-gpu	100m/256Mi	4/8Gi	8/16Gi
3-edge	50m/128Mi	1/2Gi	4/8Gi
4-aux	25m/64Mi	500m/1Gi	2/4Gi

ResourceQuota hard limits by tier:

Tier	Req CPU	Req Mem	Lim CPU	Lim Mem	Pods
0-core	8	8Gi	32	64Gi	100
1-cluster	4	4Gi	16	32Gi	30
2-gpu	8	8Gi	48	96Gi	40
3-edge	4	4Gi	16	32Gi	30
4-aux	2	2Gi	8	16Gi	20

User Preferences

Calendar

Default calendar: Nextcloud (always use unless otherwise specified)
Nextcloud URL: https://nextcloud.viktorbarzin.me
CalDAV endpoint: https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/

Home Assistant

Default smart home: Home Assistant (always use for smart home control)
Two deployments:
- ha-london (default): https://ha-london.viktorbarzin.me | Script: .claude/home-assistant.py | SSH: ssh pi@192.168.8.103, config at /home/pi/docker/homeAssistant/
- ha-sofia: https://ha-sofia.viktorbarzin.me | Script: .claude/home-assistant-sofia.py | SSH: ssh vbarzin@192.168.1.8, config at /config/
Aliases: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.

Development

Frontend framework: Svelte (user is learning it, so use Svelte for all new web apps)

Pod Monitoring After Updates

Never use sleep to wait for pods — instead, spawn a background subagent (Task tool with run_in_background: true) that continuously checks pod state (e.g., kubectl get pods -n <namespace> -w) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.

Skills & Workflows

Skills are specialized workflows for common tasks. Located in .claude/skills/.

Available Skills

setup-project (.claude/skills/setup-project/SKILL.md)

Deploy new self-hosted services from GitHub repos
Automated workflow: Docker image → Terraform module → Deploy
Handles database setup, ingress, DNS configuration
When to use: User provides GitHub URL or wants to deploy a new service
Example: "Deploy [GitHub repo] to the cluster"

extend-vm-storage (.claude/skills/extend-vm-storage/SKILL.md)

Extend disk storage on K8s node VMs (Proxmox-hosted)
Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
When to use: A k8s node needs more disk space
Example: "Extend storage on k8s-node2 by 64G"

Service-Specific Notes

Authentik (Identity Provider)

Helm Chart: authentik v2025.10.3 from https://charts.goauthentik.io/
URL: https://authentik.viktorbarzin.me
API: https://authentik.viktorbarzin.me/api/v3/
API Token: Stored in terraform.tfvars as authentik_api_token (non-expiring, superuser, identifier: claude-code-permanent). Read with: grep authentik_api_token terraform.tfvars | cut -d'"' -f2
Namespace: authentik (tier: cluster)
Architecture: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
Database: PostgreSQL via postgresql.dbaas:5432, pooled through PgBouncer at pgbouncer.authentik:6432
Redis: Shared at redis.redis.svc.cluster.local
Terraform: modules/kubernetes/authentik/main.tf (Helm), pgbouncer.tf (connection pooling)

Authentik API Management

To call the API, use:

curl -s -H "Authorization: Bearer <TOKEN>" "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"

Key API endpoints:

core/users/ — List/create/update/delete users
core/groups/ — List/create/update/delete groups
core/applications/ — List/create applications
providers/all/ — List all providers (OAuth2, Proxy, etc.)
providers/oauth2/ — OAuth2/OIDC providers specifically
providers/proxy/ — Proxy providers (forward auth)
flows/instances/ — List flows
stages/all/ — List stages
sources/all/ — List sources (Google, GitHub, etc.)
outposts/instances/ — List outposts
propertymappings/all/ — List property mappings
rbac/roles/ — List roles

Current Applications (9)

Application	Provider Type	Auth Flow
Cloudflare Access	OAuth2/OIDC	explicit consent
Domain wide catch all	Proxy (forward auth)	implicit consent
Grafana	OAuth2/OIDC	implicit consent
Headscale	OAuth2/OIDC	explicit consent
Immich	OAuth2/OIDC	explicit consent
Kubernetes	OAuth2/OIDC (public)	implicit consent
linkwarden	OAuth2/OIDC	explicit consent
Matrix	OAuth2/OIDC	implicit consent
wrongmove	OAuth2/OIDC	implicit consent

Current Groups (9)

Group	Parent	Superuser	Purpose
Allow Login Users	—	No	Parent group for login-permitted users
authentik Admins	—	Yes	Full admin access
authentik Read-only	—	No	Read-only access (has role)
Headscale Users	Allow Login Users	No	VPN access
Home Server Admins	Allow Login Users	No	Server admin access
Wrongmove Users	Allow Login Users	No	Real-estate app access
kubernetes-admins	—	No	K8s cluster-admin RBAC
kubernetes-power-users	—	No	K8s power-user RBAC
kubernetes-namespace-owners	—	No	K8s namespace-owner RBAC

Current Users (7 real users)

Username	Name	Type	Groups
akadmin	authentik Default Admin	internal	authentik Admins, Home Server Admins, Headscale Users
vbarzin@gmail.com	Viktor Barzin	internal	authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users
emil.barzin@gmail.com	Emil Barzin	internal	Home Server Admins, Headscale Users
ancaelena98@gmail.com	Anca Milea	external	Wrongmove Users, Headscale Users
vabbit81@gmail.com	GHEORGHE Milea	external	Headscale Users
valentinakolevabarzina@gmail.com	Валентина Колева-Барзина	internal	Headscale Users
anca.r.cristian10@gmail.com	—	internal	Wrongmove Users
kadir.tugan@gmail.com	Kadir	internal	Wrongmove Users

Google (OAuth) — user matching by identifier
GitHub (OAuth) — user matching by email_link
Facebook (OAuth) — user matching by email_link
All use the same authentication flow (1a779f24) and enrollment flow (87572804)

Authorization Flows

Explicit consent (default-provider-authorization-explicit-consent): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare
Implicit consent (default-provider-authorization-implicit-consent): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove

Traefik Integration

Forward auth middleware: authentik-forward-auth in Traefik namespace
Outpost endpoint: http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik
Services opt in via protected = true in ingress_factory
Response headers: X-authentik-username, X-authentik-uid, X-authentik-email, X-authentik-name, X-authentik-groups, Set-Cookie

OIDC for Kubernetes API

Issuer: https://authentik.viktorbarzin.me/application/o/kubernetes/
Client ID: kubernetes (public client, no secret)
Username claim: email, Groups claim: groups
Signing key: authentik Self-signed Certificate (must be assigned to the provider or JWKS will be empty)
Redirect URIs: Regex mode http://localhost:.* and http://127\.0\.0\.1:.* (kubelogin picks random ports)
Configured via: SSH to kube-apiserver manifest (modules/kubernetes/rbac/apiserver-oidc.tf)
RBAC module: modules/kubernetes/rbac/main.tf — admin/power-user/namespace-owner roles
Self-service portal: modules/kubernetes/k8s-portal/ — SvelteKit app at https://k8s-portal.viktorbarzin.me
User definition: k8s_users variable in terraform.tfvars
Audit logging: Enabled via modules/kubernetes/rbac/audit-policy.tf, logs at /var/log/kubernetes/audit.log

CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:

Signing key MUST be assigned to the OAuth2 provider. Without it, the JWKS endpoint returns {} and kube-apiserver can't validate tokens.
Email mapping must set email_verified: True. The default Authentik email scope mapping hardcodes email_verified: False, which causes kube-apiserver to reject the token with oidc: email not verified. Use a custom scope mapping: return {"email": request.user.email, "email_verified": True}
kubelogin needs --oidc-extra-scope for email, profile, groups. Without these, only openid is requested and the token lacks the email claim, causing oidc: parse username claims "email": claim not present.
Redirect URIs must use regex mode (http://localhost:.*) because kubelogin picks random ports, not just 8000/18000.
Kubelet static pod manifest changes require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple touch or kubelet restart is not enough.
Property mappings endpoint in Authentik 2025.10.x is propertymappings/provider/scope/ (not the older propertymappings/scope/).

Common Management Tasks

Add a new OAuth2 application:

Create OAuth2 provider: POST /api/v3/providers/oauth2/ with client_id, client_secret, redirect_uris, authorization_flow, etc.
Create application: POST /api/v3/core/applications/ with name, slug, provider pk
(Optional) Bind to group policy for access control

Add a user to a group:

# Get group pk, then PATCH with updated users list
curl -X PATCH -H "Authorization: Bearer <TOKEN>" -H "Content-Type: application/json" \
  "https://authentik.viktorbarzin.me/api/v3/core/groups/<group-pk>/" \
  -d '{"users": [<existing_user_pks>, <new_user_pk>]}'

Protect a service with forward auth: Set protected = true in the service's ingress_factory call in Terraform.

AFFiNE (Visual Canvas)

Image: ghcr.io/toeverything/affine:stable
Port: 3010
Requires: PostgreSQL + Redis
Migration: Init container runs node ./scripts/self-host-predeploy.js
Storage: NFS at /mnt/main/affine mounted to /root/.affine/storage and /root/.affine/config
Key env vars:
- AFFINE_SERVER_EXTERNAL_URL - Public URL (e.g., https://affine.viktorbarzin.me)
- AFFINE_SERVER_HTTPS - Set to true behind TLS ingress
- DATABASE_URL - PostgreSQL connection string
- REDIS_SERVER_HOST - Redis hostname
- MAILER_* - SMTP configuration for email invites
Local-first: Data stored in browser by default; syncs to server when user creates account
Docs: https://docs.affine.pro/self-host-affine

Wyoming Whisper (STT for Home Assistant)

Image: rhasspy/wyoming-whisper:latest
Port: 10300/TCP (Wyoming protocol)
Model: small-int8 (CPU-optimized, no CUDA variant available from upstream)
Runs on: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
Storage: NFS at /mnt/main/whisper → /data (model cache)
Exposure: Internal only via Traefik TCP entrypoint whisper-tcp → IngressRouteTCP
Access: 10.0.20.202:10300 (Traefik LB IP, no public DNS)
HA Integration: Wyoming Protocol integration in ha-london, host 10.0.20.202, port 10300
No GPU acceleration: Official image is CPU-only (Debian + PyTorch CPU). The mib1185/wyoming-faster-whisper-cuda image exists but requires self-build.

Gramps Web (Genealogy)

Image: ghcr.io/gramps-project/grampsweb:latest
Port: 5000
URL: https://family.viktorbarzin.me
Components: Web app + Celery worker (2 containers in 1 pod)
Requires: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
Storage: NFS at /mnt/main/grampsweb with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp
Key env vars:
- GRAMPSWEB_SECRET_KEY - Flask secret key (generated via random_password)
- GRAMPSWEB_TREE - Tree name
- GRAMPSWEB_BASE_URL - Public URL
- GRAMPSWEB_CELERY_CONFIG__broker_url / result_backend - Redis connection
- GRAMPSWEB_REGISTRATION_DISABLED - Set to True
- GRAMPSWEB_EMAIL_* - SMTP configuration
- GRAMPSWEB_LLM_* - Ollama AI integration
Celery command: celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2
Registration: Disabled; first user created via UI setup wizard

Loki + Alloy (Centralized Log Collection)

Loki image: grafana/loki:3.6.5 (Helm chart, single binary mode)
Alloy image: grafana/alloy:v1.13.0 (Helm chart, DaemonSet)
Config files: modules/kubernetes/monitoring/loki.tf, loki.yaml, alloy.yaml
Port: 3100/TCP (Loki API)
Storage: NFS PV at /mnt/main/loki/loki (15Gi), WAL on tmpfs (2Gi in-memory)
Memory: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
Disk-friendly tuning: max_chunk_age: 24h, chunk_idle_period: 12h — holds chunks in memory, flushes ~once/day
Retention: 7 days (retention_period: 168h), compactor enforces deletion
Crash policy: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
Ruler: Evaluates LogQL alert rules, fires to http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
Alert rules: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap loki-alert-rules)
Grafana: Datasource UID P8E80F9AEF21F6940, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned)
Sysctl DaemonSet: sysctl-inotify sets fs.inotify.max_user_watches=1048576 on all nodes (required for Alloy fsnotify)
Disabled components: gateway, chunksCache, resultsCache (not needed for single binary)
Key paths: Compactor at /var/loki/compactor, ruler scratch at /var/loki/scratch (must be under /var/loki — root FS is read-only)
Querying: Grafana Explore with LogQL, e.g. {namespace="monitoring"} |= "error"
Troubleshooting: If "entry too far behind" errors on first start, restart Alloy DaemonSet (kubectl rollout restart ds -n monitoring alloy) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart

OpenClaw (AI Agent Gateway)

Image: ghcr.io/openclaw/openclaw:2026.2.9
Port: 18789
URL: https://openclaw.viktorbarzin.me (authentik-protected)
Namespace: openclaw (tier: aux)
Formerly: moltbot — renamed in Feb 2026
Architecture: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
Init container: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
ServiceAccount: openclaw with cluster-admin ClusterRoleBinding (for managing cluster resources)
Storage: NFS at /mnt/main/openclaw/workspace (git repo) and /mnt/main/openclaw/data (persistent data)
Config: openclaw.json ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults
Variables: openclaw_ssh_key, openclaw_skill_secrets in terraform.tfvars
Skill secrets: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
Model providers: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)

46 KiB Executable file Raw Blame History Unescape Escape

Infrastructure Repository Knowledge

Instructions for Claude

Execution Environment

Overview

Static File Paths (NEVER CHANGE)

Network Topology (Static IPs)

Domains

Directory Structure

Key Patterns

NFS Volume Pattern

Adding NFS Exports

Factory Pattern (for multi-user services)

Init Container Pattern (for database migrations)

SMTP/Email Configuration

Terragrunt Architecture

Adding a New Service

Common Variables

Service Versions (as of 2026-02)

Useful Commands

Stack Structure

Complete Service Catalog

Critical - Network & Auth (Tier: core)

Storage & Security (Tier: cluster)

Admin

Active Use

Optional

Cloudflare Domains

Proxied (CDN + WAF enabled)

Non-Proxied (Direct DNS)

Special Subdomains

CI/CD

GitHub & Drone CI

GitHub API Access

Common API Patterns

Drone CI API Access

Common API Patterns

Capabilities

Infrastructure

Proxmox Host Hardware

Proxmox Network Bridges

Proxmox VM Inventory

VM Templates (stopped, used for cloning)

Network Connectivity Summary

GPU Node (k8s-node1)

Git Operations (IMPORTANT)

Prometheus Alerts

Tier System

Resource Governance (Kyverno-based)

User Preferences

Calendar

Home Assistant

Development

Pod Monitoring After Updates

Skills & Workflows

Available Skills

Service-Specific Notes

Authentik (Identity Provider)

Authentik API Management

Current Applications (9)

Current Groups (9)

Current Users (7 real users)

Login Sources (Social Login)

Authorization Flows

Traefik Integration

OIDC for Kubernetes API

Common Management Tasks

AFFiNE (Visual Canvas)

Wyoming Whisper (STT for Home Assistant)

Gramps Web (Genealogy)

Loki + Alloy (Centralized Log Collection)

OpenClaw (AI Agent Gateway)

46 KiB

Executable file

Raw Blame History