459 lines
19 KiB
Markdown
Executable file
459 lines
19 KiB
Markdown
Executable file
# Infrastructure Repository Knowledge
|
|
|
|
## Instructions for Claude
|
|
- **When the user says "remember" something**: Always update this file (`.claude/CLAUDE.md`) with the information so it persists across sessions
|
|
- **When discovering new patterns or versions**: Add them to the appropriate section below
|
|
- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project.md` for deploying new services)
|
|
|
|
## Execution Environment (CRITICAL)
|
|
- **Prefer running commands directly first** - only use remote executor as fallback if local execution fails
|
|
|
|
### Commands that work LOCALLY (macOS)
|
|
- **File operations**: Read, Edit, Write, Glob, Grep tools
|
|
- **Git commands**: git status, git log, git diff, git add, git commit, git reset, etc.
|
|
- **Basic shell**: ls, cat, head, tail, find, grep, etc.
|
|
|
|
### Commands that REQUIRE REMOTE EXECUTOR
|
|
- **terraform**: apply, plan, init, state - needs cluster access
|
|
- **kubectl**: all k8s commands - needs cluster access
|
|
- **helm**: chart operations - needs cluster access
|
|
- **docker**: container operations on remote hosts
|
|
- **ssh**: connections to infrastructure nodes
|
|
- **python/pip**: ALL Python and pip commands must run via remote executor
|
|
- **Any command interacting with**: Proxmox, Kubernetes cluster, NFS server, other infrastructure
|
|
|
|
### Remote Command Execution (FALLBACK)
|
|
For commands that fail locally, use the file-based relay. Uses a **shared executor** at `~/.claude/` on the remote VM.
|
|
|
|
**IMPORTANT: Always use multi-session mode** - create a session at the start of each conversation.
|
|
|
|
#### Shared Executor Architecture
|
|
The executor lives at `~/.claude/` on the remote VM (wizard@10.0.10.10) and serves all projects:
|
|
- `~/.claude/remote-executor.sh` - The shared command executor
|
|
- `~/.claude/session-exec.sh` - Shared session management
|
|
- `~/.claude/sessions/` → symlink to project sessions (or shared sessions directory)
|
|
|
|
Each session includes a `workdir.txt` specifying which project directory to use.
|
|
|
|
#### Multi-Session Mode (REQUIRED)
|
|
Each Claude session gets isolated command execution:
|
|
|
|
```bash
|
|
# 1. Create a session (once per Claude session)
|
|
SESSION_ID=$(.claude/session-exec.sh)
|
|
|
|
# 2. Write command to your session
|
|
echo "your-command-here" > .claude/sessions/$SESSION_ID/cmd_input.txt
|
|
|
|
# 3. Wait and check status
|
|
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
|
|
|
|
# 4. Read output (when status is "done:*")
|
|
cat .claude/sessions/$SESSION_ID/cmd_output.txt
|
|
|
|
# 5. Cleanup when done (optional - auto-cleaned after 24h)
|
|
.claude/session-exec.sh $SESSION_ID cleanup
|
|
```
|
|
|
|
**Status values:** `ready` | `running` | `done:N` (N = exit code)
|
|
|
|
**Requires user to start shared executor in another terminal:**
|
|
```bash
|
|
# On wizard@10.0.10.10:
|
|
~/.claude/remote-executor.sh
|
|
```
|
|
|
|
**Session helper commands:**
|
|
- `.claude/session-exec.sh` - Create new session (returns session ID)
|
|
- `.claude/session-exec.sh <id> status` - Check session status
|
|
- `.claude/session-exec.sh <id> cleanup` - Remove a session
|
|
- `.claude/session-exec.sh _ list` - List all active sessions
|
|
|
|
---
|
|
|
|
## Overview
|
|
Terraform-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs. Uses git-crypt for secrets encryption.
|
|
|
|
## Static File Paths (NEVER CHANGE)
|
|
- **Main config**: `terraform.tfvars` - All secrets, DNS, Cloudflare config, WireGuard peers
|
|
- **Root terraform**: `main.tf` - Proxmox provider, VM templates, kubernetes_cluster module
|
|
- **K8s services**: `modules/kubernetes/main.tf` - All service module definitions
|
|
- **Secrets**: `secrets/` - git-crypt encrypted TLS certs and keys
|
|
|
|
## Network Topology (Static IPs)
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 10.0.10.0/24 - Management Network │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ 10.0.10.10 - Wizard (main server / remote executor host) │
|
|
│ 10.0.10.15 - NFS Server (TrueNAS) - /mnt/main/* │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 10.0.20.0/24 - Kubernetes Network │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ 10.0.20.1 - pfSense Gateway │
|
|
│ 10.0.20.10 - Docker Registry VM (MAC: DE:AD:BE:EF:22:22) │
|
|
│ 10.0.20.100 - k8s-master │
|
|
│ 10.0.20.101 - Technitium DNS │
|
|
│ 10.0.20.102 - MetalLB IP Pool Start │
|
|
│ 10.0.20.200 - MetalLB IP Pool End │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 192.168.1.0/24 - Physical Network │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ 192.168.1.127 - Proxmox Hypervisor │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Domains
|
|
- **Public**: `viktorbarzin.me` (Cloudflare-managed)
|
|
- **Internal**: `viktorbarzin.lan` (Technitium DNS)
|
|
|
|
## Directory Structure
|
|
- `main.tf` - Main Terraform entry point, imports all modules
|
|
- `modules/kubernetes/` - Kubernetes service deployments (one folder per service)
|
|
- `modules/create-vm/` - Proxmox VM creation module
|
|
- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt
|
|
- `cli/` - Go CLI tool for infrastructure management
|
|
- `scripts/` - Helper scripts (cluster management, node updates)
|
|
- `playbooks/` - Ansible playbooks for node configuration
|
|
- `diagram/` - Infrastructure diagrams (Python-based)
|
|
|
|
## Key Patterns
|
|
- Each service in `modules/kubernetes/<service>/main.tf` defines its own namespace, deployments, services, and ingress
|
|
- NFS storage from `10.0.10.15` for persistent data
|
|
- TLS secrets managed via `setup_tls_secret` module
|
|
- Ingress uses nginx-ingress with annotations for customization
|
|
- GPU workloads use `node_selector = { "gpu": "true" }`
|
|
- Services expose to `*.viktorbarzin.me` domains
|
|
|
|
### NFS Volume Pattern
|
|
**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs:
|
|
```hcl
|
|
volume {
|
|
name = "data"
|
|
nfs {
|
|
server = "10.0.10.15"
|
|
path = "/mnt/main/<service>"
|
|
}
|
|
}
|
|
```
|
|
Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart).
|
|
|
|
### Adding NFS Exports
|
|
To add a new NFS exported directory (on the remote VM via executor):
|
|
1. Edit `nfs_directories.txt` - add the new directory path, keep the list sorted
|
|
2. Run `nfs_exports.sh` to create the NFS export
|
|
|
|
### Factory Pattern (for multi-user services)
|
|
Used when a service needs one instance per user. Structure:
|
|
```
|
|
modules/kubernetes/<service>/
|
|
├── main.tf # Namespace, TLS secret, user module calls
|
|
└── factory/
|
|
└── main.tf # Deployment, service, ingress templates with ${var.name}
|
|
```
|
|
Examples: `actualbudget`, `freedify`
|
|
|
|
To add a new user:
|
|
1. Export NFS share at `/mnt/main/<service>/<username>` in TrueNAS
|
|
2. Add Cloudflare route in tfvars
|
|
3. Add module block in main.tf calling factory
|
|
|
|
### Init Container Pattern (for database migrations)
|
|
Use when a service needs to run database migrations before starting:
|
|
```hcl
|
|
init_container {
|
|
name = "migration"
|
|
image = "service-image:tag"
|
|
command = ["sh", "-c", "migration-command"]
|
|
|
|
dynamic "env" {
|
|
for_each = local.common_env
|
|
content {
|
|
name = env.value.name
|
|
value = env.value.value
|
|
}
|
|
}
|
|
}
|
|
```
|
|
Example: AFFiNE runs `node ./scripts/self-host-predeploy.js` in init container.
|
|
|
|
### SMTP/Email Configuration
|
|
When configuring services to use the mailserver:
|
|
- **Use public hostname**: `mail.viktorbarzin.me` (for TLS cert validation)
|
|
- **Do NOT use**: `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch)
|
|
- **Port**: 587 (STARTTLS)
|
|
- **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars
|
|
- **Common email**: `info@viktorbarzin.me` for service notifications
|
|
|
|
## Common Variables
|
|
- `tls_secret_name` - TLS certificate secret name
|
|
- `tier` - Deployment tier label
|
|
- Service-specific passwords passed as variables
|
|
|
|
## Service Versions (as of 2025-01)
|
|
- Immich: v2.4.1
|
|
- Freedify: latest (music streaming, factory pattern)
|
|
- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
|
|
|
|
## Useful Commands
|
|
```bash
|
|
# ALWAYS use -target for terraform apply (speeds up execution)
|
|
terraform apply -target=module.kubernetes_cluster.module.<service_name>
|
|
terraform plan -target=module.kubernetes_cluster.module.<service_name>
|
|
terraform fmt -recursive
|
|
kubectl get pods -A
|
|
```
|
|
|
|
**Terraform target examples:**
|
|
- `terraform apply -target=module.kubernetes_cluster.module.monitoring` - Apply monitoring
|
|
- `terraform apply -target=module.kubernetes_cluster.module.immich` - Apply immich
|
|
- `terraform apply -target=module.docker-registry-vm` - Apply docker registry VM
|
|
- Only skip `-target` when explicitly told to apply everything
|
|
|
|
## Module Structure
|
|
Top-level modules in `main.tf`:
|
|
- `module.k8s-node-template` - K8s node VM template
|
|
- `module.non-k8s-node-template` - Non-k8s VM template
|
|
- `module.docker-registry-template` - Docker registry template
|
|
- `module.docker-registry-vm` - Docker registry VM
|
|
- `module.kubernetes_cluster` - Main K8s cluster (contains all services)
|
|
|
|
---
|
|
|
|
## Complete Service Catalog
|
|
|
|
### DEFCON Level 1 (Critical - Network & Auth)
|
|
| Service | Description | Tier |
|
|
|---------|-------------|------|
|
|
| wireguard | VPN server | core |
|
|
| technitium | DNS server (10.0.20.101) | core |
|
|
| headscale | Tailscale control server | core |
|
|
| nginx-ingress | Ingress controller | core |
|
|
| xray | Proxy/tunnel | core |
|
|
| authentik | Identity provider (SSO) | core |
|
|
| cloudflared | Cloudflare tunnel | core |
|
|
| authelia | Auth middleware | core |
|
|
| monitoring | Prometheus/Grafana stack | core |
|
|
|
|
### DEFCON Level 2 (Storage & Security)
|
|
| Service | Description | Tier |
|
|
|---------|-------------|------|
|
|
| vaultwarden | Bitwarden-compatible password manager | cluster |
|
|
| redis | Shared Redis at `redis.redis.svc.cluster.local` | cluster |
|
|
| immich | Photo management (GPU) | gpu |
|
|
| nvidia | GPU device plugin | gpu |
|
|
| metrics-server | K8s metrics | cluster |
|
|
| uptime-kuma | Status monitoring | cluster |
|
|
| crowdsec | Security/WAF | cluster |
|
|
| kyverno | Policy engine | cluster |
|
|
|
|
### DEFCON Level 3 (Admin)
|
|
| Service | Description | Tier |
|
|
|---------|-------------|------|
|
|
| k8s-dashboard | Kubernetes dashboard | edge |
|
|
| reverse-proxy | Generic reverse proxy | edge |
|
|
|
|
### DEFCON Level 4 (Active Use)
|
|
| Service | Description | Tier |
|
|
|---------|-------------|------|
|
|
| mailserver | Email (docker-mailserver) | edge |
|
|
| shadowsocks | Proxy | edge |
|
|
| webhook_handler | Webhook processing | edge |
|
|
| tuya-bridge | Smart home bridge | edge |
|
|
| dawarich | Location history | edge |
|
|
| owntracks | Location tracking | edge |
|
|
| nextcloud | File sync/share | edge |
|
|
| calibre | E-book management | edge |
|
|
| onlyoffice | Document editing | edge |
|
|
| f1-stream | F1 streaming | edge |
|
|
| rybbit | Analytics | edge |
|
|
| isponsorblocktv | SponsorBlock for TV | edge |
|
|
| actualbudget | Budgeting (factory pattern) | aux |
|
|
|
|
### DEFCON Level 5 (Optional)
|
|
| Service | Description | Tier |
|
|
|---------|-------------|------|
|
|
| blog | Personal blog | aux |
|
|
| descheduler | Pod descheduler | aux |
|
|
| drone | CI/CD | aux |
|
|
| hackmd | Collaborative markdown | aux |
|
|
| kms | Key management | aux |
|
|
| privatebin | Encrypted pastebin | aux |
|
|
| vault | HashiCorp Vault | aux |
|
|
| reloader | ConfigMap/Secret reloader | aux |
|
|
| city-guesser | Game | aux |
|
|
| echo | Echo server | aux |
|
|
| url | URL shortener | aux |
|
|
| excalidraw | Whiteboard | aux |
|
|
| travel_blog | Travel blog | aux |
|
|
| dashy | Dashboard | aux |
|
|
| send | Firefox Send | aux |
|
|
| ytdlp | YouTube downloader | aux |
|
|
| wealthfolio | Finance tracking | aux |
|
|
| audiobookshelf | Audiobook server | aux |
|
|
| paperless-ngx | Document management | aux |
|
|
| jsoncrack | JSON visualizer | aux |
|
|
| servarr | Media automation (Sonarr/Radarr/etc) | aux |
|
|
| ntfy | Push notifications | aux |
|
|
| cyberchef | Data transformation | aux |
|
|
| diun | Docker image update notifier | aux |
|
|
| meshcentral | Remote management | aux |
|
|
| homepage | Dashboard/startpage | aux |
|
|
| matrix | Matrix chat server | aux |
|
|
| linkwarden | Bookmark manager | aux |
|
|
| changedetection | Web change detection | aux |
|
|
| tandoor | Recipe manager | aux |
|
|
| n8n | Workflow automation | aux |
|
|
| real-estate-crawler | Property crawler | aux |
|
|
| tor-proxy | Tor proxy | aux |
|
|
| forgejo | Git forge | aux |
|
|
| freshrss | RSS reader | aux |
|
|
| navidrome | Music streaming | aux |
|
|
| networking-toolbox | Network tools | aux |
|
|
| stirling-pdf | PDF tools | aux |
|
|
| speedtest | Speed testing | aux |
|
|
| freedify | Music streaming (factory pattern) | aux |
|
|
| netbox | Network documentation | aux |
|
|
| infra-maintenance | Maintenance jobs | aux |
|
|
| ollama | LLM server (GPU) | gpu |
|
|
| frigate | NVR/camera (GPU) | gpu |
|
|
| ebook2audiobook | E-book to audio (GPU) | gpu |
|
|
| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | aux |
|
|
|
|
---
|
|
|
|
## Cloudflare Domains
|
|
|
|
### Proxied (CDN + WAF enabled)
|
|
```
|
|
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
|
|
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
|
|
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
|
|
travel, netbox
|
|
```
|
|
|
|
### Non-Proxied (Direct DNS)
|
|
```
|
|
mail, wg, headscale, immich, calibre, vaultwarden, drone,
|
|
mailserver-antispam, mailserver-admin, webhook, uptime,
|
|
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
|
|
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
|
|
isponsorblocktv, speedtest, freedify, rybbit, paperless,
|
|
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
|
|
jellyfin, jellyseerr, tdarr, affine
|
|
```
|
|
|
|
### Special Subdomains
|
|
- `*.viktor.actualbudget` - Actualbudget factory instances
|
|
- `*.freedify` - Freedify factory instances
|
|
- `mailserver.*` - Mail server components (antispam, admin)
|
|
|
|
---
|
|
|
|
## CI/CD
|
|
- Drone CI (`.drone.yml`) for automated deployments
|
|
- Auto-updates TLS certificates
|
|
- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly
|
|
- **After committing, run `git push origin master`** to sync changes
|
|
|
|
## Infrastructure
|
|
- Proxmox hypervisor for VMs (192.168.1.127)
|
|
- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
|
|
- NFS server at 10.0.10.15 for storage
|
|
- Redis shared service at `redis.redis.svc.cluster.local`
|
|
- Docker registry at 10.0.20.10
|
|
|
|
### GPU Node (k8s-node1)
|
|
- **Taint**: `nvidia.com/gpu=true:NoSchedule` - Only GPU workloads can run here
|
|
- **Label**: `gpu=true`
|
|
- GPU workloads must have both:
|
|
- `node_selector = { "gpu": "true" }`
|
|
- `toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }`
|
|
- Taint is applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|
|
|
|
## Git Operations (IMPORTANT)
|
|
- **Git is slow** on this repo due to many files - commands can take 30+ seconds
|
|
- Use `GIT_OPTIONAL_LOCKS=0` prefix if git hangs
|
|
- **Local SSH is blocked** - use remote executor to push: `echo "git push origin master" > .claude/cmd_input.txt`
|
|
- Always commit only specific files you changed, not everything
|
|
- **ALWAYS ask user before pushing to remote** - never push without explicit confirmation
|
|
|
|
## Prometheus Alerts
|
|
- Alert rules are in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
|
|
- Under `serverFiles.alerting_rules.yml.groups`
|
|
- Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
|
|
- kube-state-metrics provides: `kube_deployment_*`, `kube_statefulset_*`, `kube_daemonset_*`
|
|
|
|
## Tier System
|
|
- **0-core**: Critical infrastructure (ingress, DNS, VPN, auth)
|
|
- **1-cluster**: Cluster services (Redis, metrics, security)
|
|
- **2-gpu**: GPU workloads (Immich, Ollama, Frigate)
|
|
- **3-edge**: User-facing services
|
|
- **4-aux**: Optional/auxiliary services
|
|
|
|
---
|
|
|
|
## User Preferences
|
|
|
|
### Calendar
|
|
- **Default calendar**: Nextcloud (always use unless otherwise specified)
|
|
- **Nextcloud URL**: `https://nextcloud.viktorbarzin.me`
|
|
- **CalDAV endpoint**: `https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/`
|
|
|
|
### Home Assistant
|
|
- **Default smart home**: Home Assistant (always use for smart home control)
|
|
- **HA URL**: `https://ha-london.viktorbarzin.me`
|
|
- **Script**: `.claude/home-assistant.py`
|
|
- **Aliases**: "ha" or "HA" = Home Assistant
|
|
|
|
### Remote Executor
|
|
- **Always use multi-session mode** - never use legacy single-file mode
|
|
- Create a session at the start of each conversation with `.claude/session-exec.sh`
|
|
- Use session files at `.claude/sessions/$SESSION_ID/` for all remote commands
|
|
|
|
### Development
|
|
- **Frontend framework**: Svelte (user is learning it, so use Svelte for all new web apps)
|
|
|
|
---
|
|
|
|
## Skills & Workflows
|
|
|
|
Skills are specialized workflows for common tasks. Located in `.claude/skills/`.
|
|
|
|
### Available Skills
|
|
|
|
**setup-project** (`.claude/skills/setup-project.md`)
|
|
- Deploy new self-hosted services from GitHub repos
|
|
- Automated workflow: Docker image → Terraform module → Deploy
|
|
- Handles database setup, ingress, DNS configuration
|
|
- **When to use**: User provides GitHub URL or wants to deploy a new service
|
|
- **Example**: "Deploy [GitHub repo] to the cluster"
|
|
|
|
**setup-remote-executor** (`.claude/skills/setup-remote-executor.md`)
|
|
- Set up shared remote executor in new projects
|
|
- Creates session-exec.sh wrapper for the shared executor
|
|
- **When to use**: Adding Claude Code support to a new project
|
|
- **Example**: "Set up remote executor for this project"
|
|
|
|
---
|
|
|
|
## Service-Specific Notes
|
|
|
|
### AFFiNE (Visual Canvas)
|
|
- **Image**: `ghcr.io/toeverything/affine:stable`
|
|
- **Port**: 3010
|
|
- **Requires**: PostgreSQL + Redis
|
|
- **Migration**: Init container runs `node ./scripts/self-host-predeploy.js`
|
|
- **Storage**: NFS at `/mnt/main/affine` mounted to `/root/.affine/storage` and `/root/.affine/config`
|
|
- **Key env vars**:
|
|
- `AFFINE_SERVER_EXTERNAL_URL` - Public URL (e.g., `https://affine.viktorbarzin.me`)
|
|
- `AFFINE_SERVER_HTTPS` - Set to `true` behind TLS ingress
|
|
- `DATABASE_URL` - PostgreSQL connection string
|
|
- `REDIS_SERVER_HOST` - Redis hostname
|
|
- `MAILER_*` - SMTP configuration for email invites
|
|
- **Local-first**: Data stored in browser by default; syncs to server when user creates account
|
|
- **Docs**: https://docs.affine.pro/self-host-affine
|