infra/.claude/CLAUDE.md
2026-02-06 20:10:02 +00:00

19 KiB
Executable file

Infrastructure Repository Knowledge

Instructions for Claude

  • When the user says "remember" something: Always update this file (.claude/CLAUDE.md) with the information so it persists across sessions
  • When discovering new patterns or versions: Add them to the appropriate section below
  • Skills available: Check .claude/skills/ directory for specialized workflows (e.g., setup-project.md for deploying new services)

Execution Environment (CRITICAL)

  • Prefer running commands directly first - only use remote executor as fallback if local execution fails

Commands that work LOCALLY (macOS)

  • File operations: Read, Edit, Write, Glob, Grep tools
  • Git commands: git status, git log, git diff, git add, git commit, git reset, etc.
  • Basic shell: ls, cat, head, tail, find, grep, etc.

Commands that REQUIRE REMOTE EXECUTOR

  • terraform: apply, plan, init, state - needs cluster access
  • kubectl: all k8s commands - needs cluster access
  • helm: chart operations - needs cluster access
  • docker: container operations on remote hosts
  • ssh: connections to infrastructure nodes
  • python/pip: ALL Python and pip commands must run via remote executor
  • Any command interacting with: Proxmox, Kubernetes cluster, NFS server, other infrastructure

Remote Command Execution (FALLBACK)

For commands that fail locally, use the file-based relay. Uses a shared executor at ~/.claude/ on the remote VM.

IMPORTANT: Always use multi-session mode - create a session at the start of each conversation.

Shared Executor Architecture

The executor lives at ~/.claude/ on the remote VM (wizard@10.0.10.10) and serves all projects:

  • ~/.claude/remote-executor.sh - The shared command executor
  • ~/.claude/session-exec.sh - Shared session management
  • ~/.claude/sessions/ → symlink to project sessions (or shared sessions directory)

Each session includes a workdir.txt specifying which project directory to use.

Multi-Session Mode (REQUIRED)

Each Claude session gets isolated command execution:

# 1. Create a session (once per Claude session)
SESSION_ID=$(.claude/session-exec.sh)

# 2. Write command to your session
echo "your-command-here" > .claude/sessions/$SESSION_ID/cmd_input.txt

# 3. Wait and check status
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt

# 4. Read output (when status is "done:*")
cat .claude/sessions/$SESSION_ID/cmd_output.txt

# 5. Cleanup when done (optional - auto-cleaned after 24h)
.claude/session-exec.sh $SESSION_ID cleanup

Status values: ready | running | done:N (N = exit code)

Requires user to start shared executor in another terminal:

# On wizard@10.0.10.10:
~/.claude/remote-executor.sh

Session helper commands:

  • .claude/session-exec.sh - Create new session (returns session ID)
  • .claude/session-exec.sh <id> status - Check session status
  • .claude/session-exec.sh <id> cleanup - Remove a session
  • .claude/session-exec.sh _ list - List all active sessions

Overview

Terraform-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs. Uses git-crypt for secrets encryption.

Static File Paths (NEVER CHANGE)

  • Main config: terraform.tfvars - All secrets, DNS, Cloudflare config, WireGuard peers
  • Root terraform: main.tf - Proxmox provider, VM templates, kubernetes_cluster module
  • K8s services: modules/kubernetes/main.tf - All service module definitions
  • Secrets: secrets/ - git-crypt encrypted TLS certs and keys

Network Topology (Static IPs)

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.10.0/24 - Management Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.10.10  - Wizard (main server / remote executor host)       │
│ 10.0.10.15  - NFS Server (TrueNAS) - /mnt/main/*                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 10.0.20.0/24 - Kubernetes Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.20.1   - pfSense Gateway                                   │
│ 10.0.20.10  - Docker Registry VM (MAC: DE:AD:BE:EF:22:22)       │
│ 10.0.20.100 - k8s-master                                        │
│ 10.0.20.101 - Technitium DNS                                    │
│ 10.0.20.102 - MetalLB IP Pool Start                             │
│ 10.0.20.200 - MetalLB IP Pool End                               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ 192.168.1.0/24 - Physical Network                               │
├─────────────────────────────────────────────────────────────────┤
│ 192.168.1.127 - Proxmox Hypervisor                              │
└─────────────────────────────────────────────────────────────────┘

Domains

  • Public: viktorbarzin.me (Cloudflare-managed)
  • Internal: viktorbarzin.lan (Technitium DNS)

Directory Structure

  • main.tf - Main Terraform entry point, imports all modules
  • modules/kubernetes/ - Kubernetes service deployments (one folder per service)
  • modules/create-vm/ - Proxmox VM creation module
  • secrets/ - Encrypted secrets (TLS certs, keys) via git-crypt
  • cli/ - Go CLI tool for infrastructure management
  • scripts/ - Helper scripts (cluster management, node updates)
  • playbooks/ - Ansible playbooks for node configuration
  • diagram/ - Infrastructure diagrams (Python-based)

Key Patterns

  • Each service in modules/kubernetes/<service>/main.tf defines its own namespace, deployments, services, and ingress
  • NFS storage from 10.0.10.15 for persistent data
  • TLS secrets managed via setup_tls_secret module
  • Ingress uses nginx-ingress with annotations for customization
  • GPU workloads use node_selector = { "gpu": "true" }
  • Services expose to *.viktorbarzin.me domains

NFS Volume Pattern

Prefer inline NFS volumes over separate PV/PVC resources. Use the nfs {} block directly in pod/deployment/cronjob specs:

volume {
  name = "data"
  nfs {
    server = "10.0.10.15"
    path   = "/mnt/main/<service>"
  }
}

Only use PV/PVC when the Helm chart requires existingClaim (like the Nextcloud Helm chart).

Adding NFS Exports

To add a new NFS exported directory (on the remote VM via executor):

  1. Edit nfs_directories.txt - add the new directory path, keep the list sorted
  2. Run nfs_exports.sh to create the NFS export

Factory Pattern (for multi-user services)

Used when a service needs one instance per user. Structure:

modules/kubernetes/<service>/
├── main.tf           # Namespace, TLS secret, user module calls
└── factory/
    └── main.tf       # Deployment, service, ingress templates with ${var.name}

Examples: actualbudget, freedify

To add a new user:

  1. Export NFS share at /mnt/main/<service>/<username> in TrueNAS
  2. Add Cloudflare route in tfvars
  3. Add module block in main.tf calling factory

Init Container Pattern (for database migrations)

Use when a service needs to run database migrations before starting:

init_container {
  name    = "migration"
  image   = "service-image:tag"
  command = ["sh", "-c", "migration-command"]

  dynamic "env" {
    for_each = local.common_env
    content {
      name  = env.value.name
      value = env.value.value
    }
  }
}

Example: AFFiNE runs node ./scripts/self-host-predeploy.js in init container.

SMTP/Email Configuration

When configuring services to use the mailserver:

  • Use public hostname: mail.viktorbarzin.me (for TLS cert validation)
  • Do NOT use: mailserver.mailserver.svc.cluster.local (TLS cert mismatch)
  • Port: 587 (STARTTLS)
  • Credentials: Use existing accounts from mailserver_accounts in tfvars
  • Common email: info@viktorbarzin.me for service notifications

Common Variables

  • tls_secret_name - TLS certificate secret name
  • tier - Deployment tier label
  • Service-specific passwords passed as variables

Service Versions (as of 2025-01)

  • Immich: v2.4.1
  • Freedify: latest (music streaming, factory pattern)
  • AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)

Useful Commands

# ALWAYS use -target for terraform apply (speeds up execution)
terraform apply -target=module.kubernetes_cluster.module.<service_name>
terraform plan -target=module.kubernetes_cluster.module.<service_name>
terraform fmt -recursive
kubectl get pods -A

Terraform target examples:

  • terraform apply -target=module.kubernetes_cluster.module.monitoring - Apply monitoring
  • terraform apply -target=module.kubernetes_cluster.module.immich - Apply immich
  • terraform apply -target=module.docker-registry-vm - Apply docker registry VM
  • Only skip -target when explicitly told to apply everything

Module Structure

Top-level modules in main.tf:

  • module.k8s-node-template - K8s node VM template
  • module.non-k8s-node-template - Non-k8s VM template
  • module.docker-registry-template - Docker registry template
  • module.docker-registry-vm - Docker registry VM
  • module.kubernetes_cluster - Main K8s cluster (contains all services)

Complete Service Catalog

DEFCON Level 1 (Critical - Network & Auth)

Service Description Tier
wireguard VPN server core
technitium DNS server (10.0.20.101) core
headscale Tailscale control server core
nginx-ingress Ingress controller core
xray Proxy/tunnel core
authentik Identity provider (SSO) core
cloudflared Cloudflare tunnel core
authelia Auth middleware core
monitoring Prometheus/Grafana stack core

DEFCON Level 2 (Storage & Security)

Service Description Tier
vaultwarden Bitwarden-compatible password manager cluster
redis Shared Redis at redis.redis.svc.cluster.local cluster
immich Photo management (GPU) gpu
nvidia GPU device plugin gpu
metrics-server K8s metrics cluster
uptime-kuma Status monitoring cluster
crowdsec Security/WAF cluster
kyverno Policy engine cluster

DEFCON Level 3 (Admin)

Service Description Tier
k8s-dashboard Kubernetes dashboard edge
reverse-proxy Generic reverse proxy edge

DEFCON Level 4 (Active Use)

Service Description Tier
mailserver Email (docker-mailserver) edge
shadowsocks Proxy edge
webhook_handler Webhook processing edge
tuya-bridge Smart home bridge edge
dawarich Location history edge
owntracks Location tracking edge
nextcloud File sync/share edge
calibre E-book management edge
onlyoffice Document editing edge
f1-stream F1 streaming edge
rybbit Analytics edge
isponsorblocktv SponsorBlock for TV edge
actualbudget Budgeting (factory pattern) aux

DEFCON Level 5 (Optional)

Service Description Tier
blog Personal blog aux
descheduler Pod descheduler aux
drone CI/CD aux
hackmd Collaborative markdown aux
kms Key management aux
privatebin Encrypted pastebin aux
vault HashiCorp Vault aux
reloader ConfigMap/Secret reloader aux
city-guesser Game aux
echo Echo server aux
url URL shortener aux
excalidraw Whiteboard aux
travel_blog Travel blog aux
dashy Dashboard aux
send Firefox Send aux
ytdlp YouTube downloader aux
wealthfolio Finance tracking aux
audiobookshelf Audiobook server aux
paperless-ngx Document management aux
jsoncrack JSON visualizer aux
servarr Media automation (Sonarr/Radarr/etc) aux
ntfy Push notifications aux
cyberchef Data transformation aux
diun Docker image update notifier aux
meshcentral Remote management aux
homepage Dashboard/startpage aux
matrix Matrix chat server aux
linkwarden Bookmark manager aux
changedetection Web change detection aux
tandoor Recipe manager aux
n8n Workflow automation aux
real-estate-crawler Property crawler aux
tor-proxy Tor proxy aux
forgejo Git forge aux
freshrss RSS reader aux
navidrome Music streaming aux
networking-toolbox Network tools aux
stirling-pdf PDF tools aux
speedtest Speed testing aux
freedify Music streaming (factory pattern) aux
netbox Network documentation aux
infra-maintenance Maintenance jobs aux
ollama LLM server (GPU) gpu
frigate NVR/camera (GPU) gpu
ebook2audiobook E-book to audio (GPU) gpu
affine Visual canvas/whiteboard (PostgreSQL + Redis) aux

Cloudflare Domains

Proxied (CDN + WAF enabled)

blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox

Non-Proxied (Direct DNS)

mail, wg, headscale, immich, calibre, vaultwarden, drone,
mailserver-antispam, mailserver-admin, webhook, uptime,
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
isponsorblocktv, speedtest, freedify, rybbit, paperless,
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
jellyfin, jellyseerr, tdarr, affine

Special Subdomains

  • *.viktor.actualbudget - Actualbudget factory instances
  • *.freedify - Freedify factory instances
  • mailserver.* - Mail server components (antispam, admin)

CI/CD

  • Drone CI (.drone.yml) for automated deployments
  • Auto-updates TLS certificates
  • ALWAYS add [ci skip] to commit messages when you've already run terraform apply to avoid triggering CI redundantly
  • After committing, run git push origin master to sync changes

Infrastructure

  • Proxmox hypervisor for VMs (192.168.1.127)
  • Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
  • NFS server at 10.0.10.15 for storage
  • Redis shared service at redis.redis.svc.cluster.local
  • Docker registry at 10.0.20.10

GPU Node (k8s-node1)

  • Taint: nvidia.com/gpu=true:NoSchedule - Only GPU workloads can run here
  • Label: gpu=true
  • GPU workloads must have both:
    • node_selector = { "gpu": "true" }
    • toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }
  • Taint is applied via null_resource.gpu_node_taint in modules/kubernetes/nvidia/main.tf

Git Operations (IMPORTANT)

  • Git is slow on this repo due to many files - commands can take 30+ seconds
  • Use GIT_OPTIONAL_LOCKS=0 prefix if git hangs
  • Local SSH is blocked - use remote executor to push: echo "git push origin master" > .claude/cmd_input.txt
  • Always commit only specific files you changed, not everything
  • ALWAYS ask user before pushing to remote - never push without explicit confirmation

Prometheus Alerts

  • Alert rules are in modules/kubernetes/monitoring/prometheus_chart_values.tpl
  • Under serverFiles.alerting_rules.yml.groups
  • Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
  • kube-state-metrics provides: kube_deployment_*, kube_statefulset_*, kube_daemonset_*

Tier System

  • 0-core: Critical infrastructure (ingress, DNS, VPN, auth)
  • 1-cluster: Cluster services (Redis, metrics, security)
  • 2-gpu: GPU workloads (Immich, Ollama, Frigate)
  • 3-edge: User-facing services
  • 4-aux: Optional/auxiliary services

User Preferences

Calendar

  • Default calendar: Nextcloud (always use unless otherwise specified)
  • Nextcloud URL: https://nextcloud.viktorbarzin.me
  • CalDAV endpoint: https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/

Home Assistant

  • Default smart home: Home Assistant (always use for smart home control)
  • HA URL: https://ha-london.viktorbarzin.me
  • Script: .claude/home-assistant.py
  • Aliases: "ha" or "HA" = Home Assistant

Remote Executor

  • Always use multi-session mode - never use legacy single-file mode
  • Create a session at the start of each conversation with .claude/session-exec.sh
  • Use session files at .claude/sessions/$SESSION_ID/ for all remote commands

Development

  • Frontend framework: Svelte (user is learning it, so use Svelte for all new web apps)

Skills & Workflows

Skills are specialized workflows for common tasks. Located in .claude/skills/.

Available Skills

setup-project (.claude/skills/setup-project.md)

  • Deploy new self-hosted services from GitHub repos
  • Automated workflow: Docker image → Terraform module → Deploy
  • Handles database setup, ingress, DNS configuration
  • When to use: User provides GitHub URL or wants to deploy a new service
  • Example: "Deploy [GitHub repo] to the cluster"

setup-remote-executor (.claude/skills/setup-remote-executor.md)

  • Set up shared remote executor in new projects
  • Creates session-exec.sh wrapper for the shared executor
  • When to use: Adding Claude Code support to a new project
  • Example: "Set up remote executor for this project"

Service-Specific Notes

AFFiNE (Visual Canvas)

  • Image: ghcr.io/toeverything/affine:stable
  • Port: 3010
  • Requires: PostgreSQL + Redis
  • Migration: Init container runs node ./scripts/self-host-predeploy.js
  • Storage: NFS at /mnt/main/affine mounted to /root/.affine/storage and /root/.affine/config
  • Key env vars:
    • AFFINE_SERVER_EXTERNAL_URL - Public URL (e.g., https://affine.viktorbarzin.me)
    • AFFINE_SERVER_HTTPS - Set to true behind TLS ingress
    • DATABASE_URL - PostgreSQL connection string
    • REDIS_SERVER_HOST - Redis hostname
    • MAILER_* - SMTP configuration for email invites
  • Local-first: Data stored in browser by default; syncs to server when user creates account
  • Docs: https://docs.affine.pro/self-host-affine