2026-01-18 11:04:51 +00:00
# Infrastructure Repository Knowledge
## Instructions for Claude
- **When the user says "remember" something**: Always update this file (`.claude/CLAUDE.md` ) with the information so it persists across sessions
- **When discovering new patterns or versions**: Add them to the appropriate section below
2026-02-07 10:43:43 +00:00
- **When making infrastructure changes**: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
- **After every significant change**: Proactively update this file (`.claude/CLAUDE.md` ) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically.
- **After updating any `.claude/` files**: Always commit them immediately (`git add .claude/ && git commit -m "[ci skip] update claude knowledge"` ) to avoid building up unstaged changes.
2026-02-06 20:10:02 +00:00
- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project.md` for deploying new services)
2026-02-22 14:12:37 +00:00
- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform `.tf` files and apply with `terragrunt apply` . The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use `kubectl` only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes.
2026-02-16 23:06:41 +00:00
- **CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files** unless they are encrypted (e.g., via git-crypt). Secrets belong in `terraform.tfvars` (which is git-crypt encrypted) or in the `secrets/` directory. Never hardcode credentials in `.tf` files, scripts, `.claude/` files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (`terraform.tfvars` → `main.tf` → module variables).
2026-02-19 23:48:44 +00:00
- **CRITICAL: NEVER commit secrets** — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
- **New services MUST have CI/CD**: Set up Drone CI pipeline (`.drone.yml` ) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy.
- **New services MUST have monitoring**: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.
2026-01-18 11:04:51 +00:00
2026-02-07 10:43:43 +00:00
## Execution Environment
2026-01-18 15:40:43 +00:00
- **File operations**: Read, Edit, Write, Glob, Grep tools
- **Git commands**: git status, git log, git diff, git add, git commit, git reset, etc.
2026-02-22 14:12:37 +00:00
- **Shell commands**: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
- **CRITICAL: Always run terragrunt/terraform locally**, never on the remote server via SSH:
2026-02-07 13:41:41 +00:00
```bash
2026-02-22 14:12:37 +00:00
cd stacks/< service > & & terragrunt apply --non-interactive
2026-02-07 13:41:41 +00:00
```
- **kubectl**: Use `kubectl --kubeconfig $(pwd)/config` for cluster access
2026-02-21 19:14:41 +00:00
- **GitHub API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below). `gh` CLI is blocked by sandbox restrictions.
- **Drone CI API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below).
2026-01-18 15:40:43 +00:00
2026-01-18 11:04:51 +00:00
---
## Overview
2026-02-22 14:12:37 +00:00
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/` , enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.
2026-01-18 11:04:51 +00:00
2026-02-06 20:10:02 +00:00
## Static File Paths (NEVER CHANGE)
- **Main config**: `terraform.tfvars` - All secrets, DNS, Cloudflare config, WireGuard peers
2026-02-22 14:12:37 +00:00
- **Root Terragrunt**: `terragrunt.hcl` - Root Terragrunt config (providers, backend, var loading)
2026-02-22 15:13:55 +00:00
- **Service stacks**: `stacks/<service>/` - Individual service stacks (each has `terragrunt.hcl` + `main.tf` with resources inline)
2026-02-22 14:12:37 +00:00
- **Infra stack**: `stacks/infra/` - Proxmox VM resources (templates, docker-registry, VMs)
2026-02-22 14:39:22 +00:00
- **Platform stack**: `stacks/platform/` - Core infrastructure services (22 modules in `modules/` subdir)
2026-02-22 14:12:37 +00:00
- **Per-stack state**: `state/stacks/<service>/terraform.tfstate` - Per-stack state files (gitignored)
2026-02-22 15:13:55 +00:00
- **Service resources**: `stacks/<service>/main.tf` - Service resources defined directly in stack root
2026-02-22 14:39:22 +00:00
- **Platform modules**: `stacks/platform/modules/<service>/` - Platform service modules
2026-02-22 15:23:27 +00:00
- **Shared modules**: `modules/kubernetes/ingress_factory/` , `modules/kubernetes/setup_tls_secret/`
2026-02-06 20:10:02 +00:00
- **Secrets**: `secrets/` - git-crypt encrypted TLS certs and keys
## Network Topology (Static IPs)
```
┌─────────────────────────────────────────────────────────────────┐
│ 10.0.10.0/24 - Management Network │
├─────────────────────────────────────────────────────────────────┤
2026-02-07 10:43:43 +00:00
│ 10.0.10.10 - Wizard (main server) │
2026-02-06 20:10:02 +00:00
│ 10.0.10.15 - NFS Server (TrueNAS) - /mnt/main/* │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 10.0.20.0/24 - Kubernetes Network │
├─────────────────────────────────────────────────────────────────┤
│ 10.0.20.1 - pfSense Gateway │
│ 10.0.20.10 - Docker Registry VM (MAC: DE:AD:BE:EF:22:22) │
│ 10.0.20.100 - k8s-master │
│ 10.0.20.101 - Technitium DNS │
│ 10.0.20.102 - MetalLB IP Pool Start │
│ 10.0.20.200 - MetalLB IP Pool End │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 192.168.1.0/24 - Physical Network │
├─────────────────────────────────────────────────────────────────┤
│ 192.168.1.127 - Proxmox Hypervisor │
└─────────────────────────────────────────────────────────────────┘
```
## Domains
- **Public**: `viktorbarzin.me` (Cloudflare-managed)
- **Internal**: `viktorbarzin.lan` (Technitium DNS)
2026-01-18 11:04:51 +00:00
## Directory Structure
2026-02-22 14:12:37 +00:00
- `terragrunt.hcl` - Root Terragrunt configuration (providers, backend, variable loading)
- `stacks/` - Individual Terragrunt stacks (one per service)
- `stacks/infra/` - Proxmox VM resources (templates, docker-registry)
2026-02-22 14:39:22 +00:00
- `stacks/platform/` - Core infrastructure (22 services in `stacks/platform/modules/` )
2026-02-22 15:13:55 +00:00
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf` )
2026-02-22 14:39:22 +00:00
- `stacks/platform/modules/<service>/` - Platform service module source code
2026-02-22 15:23:27 +00:00
- `modules/kubernetes/` - **Only shared utility modules** : `ingress_factory/` , `setup_tls_secret/`
2026-01-18 11:04:51 +00:00
- `modules/create-vm/` - Proxmox VM creation module
2026-02-22 14:12:37 +00:00
- `state/` - Per-stack Terraform state files (gitignored)
2026-01-18 11:04:51 +00:00
- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt
- `cli/` - Go CLI tool for infrastructure management
- `scripts/` - Helper scripts (cluster management, node updates)
- `playbooks/` - Ansible playbooks for node configuration
- `diagram/` - Infrastructure diagrams (Python-based)
## Key Patterns
- Each service in `modules/kubernetes/<service>/main.tf` defines its own namespace, deployments, services, and ingress
- NFS storage from `10.0.10.15` for persistent data
- TLS secrets managed via `setup_tls_secret` module
2026-02-07 20:46:14 +00:00
- Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
- HTTP/3 enabled on Traefik (`http3.enabled=true` , `advertisedPort=443` on websecure entrypoint) and Cloudflare (`cloudflare_zone_settings_override` with `http3="on"` )
2026-01-18 11:04:51 +00:00
- GPU workloads use `node_selector = { "gpu": "true" }`
- Services expose to `*.viktorbarzin.me` domains
### NFS Volume Pattern
**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs:
```hcl
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/< service > "
}
}
```
Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart).
2026-02-06 20:10:02 +00:00
### Adding NFS Exports
2026-02-07 10:43:43 +00:00
To add a new NFS exported directory:
2026-02-08 02:41:42 +00:00
1. Edit `secrets/nfs_directories.txt` - add the new directory path, keep the list sorted
2. Run `secrets/nfs_exports.sh` from the `secrets/` directory to update the NFS share via TrueNAS API
2026-02-06 20:10:02 +00:00
2026-01-18 11:04:51 +00:00
### Factory Pattern (for multi-user services)
Used when a service needs one instance per user. Structure:
```
2026-02-22 15:13:55 +00:00
stacks/< service > /
2026-01-18 11:04:51 +00:00
├── main.tf # Namespace, TLS secret, user module calls
└── factory/
└── main.tf # Deployment, service, ingress templates with ${var.name}
```
Examples: `actualbudget` , `freedify`
To add a new user:
1. Export NFS share at `/mnt/main/<service>/<username>` in TrueNAS
2. Add Cloudflare route in tfvars
3. Add module block in main.tf calling factory
2026-02-06 20:10:02 +00:00
### Init Container Pattern (for database migrations)
Use when a service needs to run database migrations before starting:
```hcl
init_container {
name = "migration"
image = "service-image:tag"
command = ["sh", "-c", "migration-command"]
dynamic "env" {
for_each = local.common_env
content {
name = env.value.name
value = env.value.value
}
}
}
```
Example: AFFiNE runs `node ./scripts/self-host-predeploy.js` in init container.
### SMTP/Email Configuration
When configuring services to use the mailserver:
- **Use public hostname**: `mail.viktorbarzin.me` (for TLS cert validation)
- **Do NOT use**: `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch)
- **Port**: 587 (STARTTLS)
- **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars
- **Common email**: `info@viktorbarzin.me` for service notifications
2026-02-22 14:12:37 +00:00
### Terragrunt Architecture
- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading for all stacks
2026-02-22 15:13:55 +00:00
- Each stack contains its resources directly: `stacks/<service>/main.tf` has variable declarations, locals, and all Terraform resources inline
2026-02-22 14:39:22 +00:00
- Platform modules live at `stacks/platform/modules/<service>/` , referenced as `source = "./modules/<service>"`
- Shared utility modules (`ingress_factory` , `setup_tls_secret` , `dockerhub_secret` , `oauth-proxy` ) remain at `modules/kubernetes/` and are referenced with relative paths from each module
2026-02-22 14:12:37 +00:00
- State isolation: each stack has its own state file at `state/stacks/<service>/terraform.tfstate`
- Dependencies: service stacks depend on `platform` stack via `dependency` block in their `terragrunt.hcl`
- Variables loaded from `terraform.tfvars` automatically (unused vars silently ignored via `extra_arguments` )
- `secrets/` symlinks in each stack for TLS cert resolution (`path.root` workaround)
- Terragrunt v0.99+: use `--non-interactive` (not `--terragrunt-non-interactive` )
- run-all syntax: `terragrunt run --all -- <command>` (not `terragrunt run-all` )
- The `platform` stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.)
- Individual service stacks are for services that can be deployed independently
### Adding a New Service
When adding a new service to the cluster:
2026-02-22 14:39:22 +00:00
1. Create `stacks/<service>/` directory with:
2026-02-22 14:12:37 +00:00
- `terragrunt.hcl` - Include root config, declare `platform` dependency
2026-02-22 15:13:55 +00:00
- `main.tf` - All resources defined directly (variables, locals, namespace, deployments, services, ingress)
2026-02-22 14:12:37 +00:00
- `secrets` - Symlink to `../../secrets` (for TLS cert path resolution)
2026-02-22 14:39:22 +00:00
2. Add Cloudflare DNS record in `terraform.tfvars` (`cloudflare_proxied_names` or `cloudflare_non_proxied_names` )
3. Apply the cloudflared stack: `cd stacks/platform && terragrunt apply --non-interactive`
4. Apply the new service: `cd stacks/<service> && terragrunt apply --non-interactive`
2026-02-22 14:12:37 +00:00
2026-01-18 11:04:51 +00:00
## Common Variables
- `tls_secret_name` - TLS certificate secret name
- `tier` - Deployment tier label
- Service-specific passwords passed as variables
2026-02-18 22:39:58 +00:00
## Service Versions (as of 2026-02)
2026-01-18 11:04:51 +00:00
- Immich: v2.4.1
- Freedify: latest (music streaming, factory pattern)
2026-02-06 20:10:02 +00:00
- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
2026-02-08 00:09:03 +00:00
- Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
2026-02-08 01:55:30 +00:00
- Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
2026-02-08 02:18:14 +00:00
- Gramps Web: latest (genealogy, uses Redis + Celery)
2026-02-13 23:46:01 +00:00
- Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
- Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
2026-02-18 22:39:58 +00:00
- OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)
2026-01-18 11:04:51 +00:00
## Useful Commands
```bash
2026-02-16 19:45:31 +00:00
# Cluster health check — ALWAYS use this to check cluster status
bash scripts/cluster_healthcheck.sh # Full color report
bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL
bash scripts/cluster_healthcheck.sh --json # Machine-readable
bash scripts/cluster_healthcheck.sh --fix # Auto-delete evicted pods
2026-02-22 14:12:37 +00:00
# Apply a single service stack
cd stacks/< service > & & terragrunt apply --non-interactive
# Plan a single service stack
cd stacks/< service > & & terragrunt plan --non-interactive
# Plan all stacks (full DAG)
cd stacks & & terragrunt run --all --non-interactive -- plan
# Apply all stacks (full DAG)
cd stacks & & terragrunt run --all --non-interactive -- apply
# Format all terraform files
2026-01-18 11:04:51 +00:00
terraform fmt -recursive
2026-02-22 14:12:37 +00:00
2026-01-18 11:04:51 +00:00
kubectl get pods -A
```
2026-02-16 19:45:31 +00:00
**Cluster Health Check** (`scripts/cluster_healthcheck.sh` ):
2026-02-16 21:19:49 +00:00
- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
2026-02-21 23:57:04 +00:00
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
2026-02-16 19:45:31 +00:00
- **When adding new healthchecks or monitoring**: Always update this script to validate the new component
2026-02-22 14:12:37 +00:00
**Terragrunt apply examples:**
- `cd stacks/monitoring && terragrunt apply --non-interactive` - Apply monitoring
- `cd stacks/immich && terragrunt apply --non-interactive` - Apply immich
- `cd stacks/infra && terragrunt apply --non-interactive` - Apply Proxmox VMs / docker registry
- `cd stacks/platform && terragrunt apply --non-interactive` - Apply all core/platform services
2026-01-18 11:04:51 +00:00
2026-02-22 14:12:37 +00:00
**IMPORTANT: When deploying a new service**, you must ALSO apply the `platform` stack (which includes `cloudflared` ) to create the Cloudflare DNS record:
2026-02-08 02:21:29 +00:00
```bash
2026-02-22 14:12:37 +00:00
cd stacks/platform & & terragrunt apply --non-interactive
2026-02-08 02:21:29 +00:00
```
2026-02-22 14:12:37 +00:00
Adding a name to `cloudflare_non_proxied_names` or `cloudflare_proxied_names` in `terraform.tfvars` only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.
2026-02-08 02:21:29 +00:00
2026-02-22 14:12:37 +00:00
## Stack Structure
Terragrunt stacks under `stacks/` :
- `stacks/infra/` - Proxmox VMs, templates, docker-registry
2026-02-22 14:39:22 +00:00
- `stacks/platform/` - Core infrastructure (~22 services in `modules/` subdir)
2026-02-22 15:13:55 +00:00
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf` )
2026-02-22 14:12:37 +00:00
Each stack's `terragrunt.hcl` includes the root `terragrunt.hcl` which provides:
- Kubernetes + Helm providers (configured from `terraform.tfvars` )
- Local backend with per-stack state file (`state/stacks/<service>/terraform.tfstate` )
- Automatic loading of `terraform.tfvars` with unused vars ignored
2026-01-18 11:04:51 +00:00
2026-02-06 20:10:02 +00:00
---
## Complete Service Catalog
2026-02-22 14:12:37 +00:00
### Critical - Network & Auth (Tier: core)
| Service | Description | Stack |
|---------|-------------|-------|
| wireguard | VPN server | platform |
| technitium | DNS server (10.0.20.101) | platform |
| headscale | Tailscale control server | platform |
| traefik | Ingress controller (Helm) | platform |
| xray | Proxy/tunnel | platform |
| authentik | Identity provider (SSO) | platform |
| cloudflared | Cloudflare tunnel | platform |
| authelia | Auth middleware | platform |
| monitoring | Prometheus/Grafana/Loki stack | platform |
### Storage & Security (Tier: cluster)
| Service | Description | Stack |
|---------|-------------|-------|
| vaultwarden | Bitwarden-compatible password manager | platform |
| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
| immich | Photo management (GPU) | immich |
| nvidia | GPU device plugin | platform |
| metrics-server | K8s metrics | platform |
| uptime-kuma | Status monitoring | platform |
| crowdsec | Security/WAF | platform |
| kyverno | Policy engine | platform |
### Admin
| Service | Description | Stack |
|---------|-------------|-------|
| k8s-dashboard | Kubernetes dashboard | platform |
| reverse-proxy | Generic reverse proxy | platform |
### Active Use
| Service | Description | Stack |
|---------|-------------|-------|
| mailserver | Email (docker-mailserver) | mailserver |
| shadowsocks | Proxy | shadowsocks |
| webhook_handler | Webhook processing | webhook_handler |
| tuya-bridge | Smart home bridge | tuya-bridge |
| dawarich | Location history | dawarich |
| owntracks | Location tracking | owntracks |
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming | f1-stream |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
| actualbudget | Budgeting (factory pattern) | actualbudget |
### Optional
| Service | Description | Stack |
|---------|-------------|-------|
| blog | Personal blog | blog |
| descheduler | Pod descheduler | descheduler |
| drone | CI/CD | drone |
| hackmd | Collaborative markdown | hackmd |
| kms | Key management | kms |
| privatebin | Encrypted pastebin | privatebin |
| vault | HashiCorp Vault | vault |
| reloader | ConfigMap/Secret reloader | reloader |
| city-guesser | Game | city-guesser |
| echo | Echo server | echo |
| url | URL shortener | url |
| excalidraw | Whiteboard | excalidraw |
| travel_blog | Travel blog | travel_blog |
| dashy | Dashboard | dashy |
| send | Firefox Send | send |
| ytdlp | YouTube downloader | ytdlp |
| wealthfolio | Finance tracking | wealthfolio |
| audiobookshelf | Audiobook server | audiobookshelf |
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| ntfy | Push notifications | ntfy |
| cyberchef | Data transformation | cyberchef |
| diun | Docker image update notifier | diun |
| meshcentral | Remote management | meshcentral |
| homepage | Dashboard/startpage | homepage |
| matrix | Matrix chat server | matrix |
| linkwarden | Bookmark manager | linkwarden |
| changedetection | Web change detection | changedetection |
| tandoor | Recipe manager | tandoor |
| n8n | Workflow automation | n8n |
| real-estate-crawler | Property crawler | real-estate-crawler |
| tor-proxy | Tor proxy | tor-proxy |
| forgejo | Git forge | forgejo |
| freshrss | RSS reader | freshrss |
| navidrome | Music streaming | navidrome |
| networking-toolbox | Network tools | networking-toolbox |
| stirling-pdf | PDF tools | stirling-pdf |
| speedtest | Speed testing | speedtest |
| freedify | Music streaming (factory pattern) | freedify |
| netbox | Network documentation | netbox |
| infra-maintenance | Maintenance jobs | infra-maintenance |
| ollama | LLM server (GPU) | ollama |
| frigate | NVR/camera (GPU) | frigate |
| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
| health | Apple Health data dashboard (PostgreSQL) | health |
| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
| openclaw | AI agent gateway (OpenClaw) | openclaw |
2026-02-06 20:10:02 +00:00
---
## Cloudflare Domains
### Proxied (CDN + WAF enabled)
```
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
travel, netbox
```
### Non-Proxied (Direct DNS)
```
mail, wg, headscale, immich, calibre, vaultwarden, drone,
mailserver-antispam, mailserver-admin, webhook, uptime,
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
isponsorblocktv, speedtest, freedify, rybbit, paperless,
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
2026-02-18 22:39:58 +00:00
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
2026-02-06 20:10:02 +00:00
```
### Special Subdomains
- `*.viktor.actualbudget` - Actualbudget factory instances
- `*.freedify` - Freedify factory instances
- `mailserver.*` - Mail server components (antispam, admin)
---
2026-01-18 11:04:51 +00:00
## CI/CD
- Drone CI (`.drone.yml` ) for automated deployments
2026-02-22 17:47:06 +00:00
- **Default pipeline**: On push, runs `terragrunt run --all -- apply -auto-approve` across all stacks (installs Terraform 1.5.7 + Terragrunt 0.99.4 in Alpine)
- **TLS renewal pipeline**: Cron-triggered, runs `renew2.sh` (certbot + Cloudflare DNS) — no Terraform/Terragrunt needed
- **Build CLI pipeline**: Builds Docker image from `cli/Dockerfile` (unchanged)
2026-01-18 11:04:51 +00:00
- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly
- **After committing, run `git push origin master` ** to sync changes
2026-02-21 19:14:41 +00:00
## GitHub & Drone CI
### GitHub API Access
- **Username**: `ViktorBarzin`
- **Token location**: `terraform.tfvars` as `github_pat` (git-crypt encrypted)
- **Read token**: `grep github_pat terraform.tfvars | cut -d'"' -f2`
- **Scopes**: Full access — `repo` , `admin:public_key` , `admin:repo_hook` , `delete_repo` , `admin:org` , `workflow` , `write:packages` , and more
- **`gh` CLI**: Blocked by sandbox restrictions — use `curl` with the GitHub API instead
#### Common API Patterns
```bash
# Read token from tfvars
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
# List repos
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
# Create repo
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
-d '{"name":"repo-name","private":true}'
# Add deploy key
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/< repo > /keys" \
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
# Create webhook (e.g., for Drone CI)
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/< repo > /hooks" \
-d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
# Get repo info
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/< repo > "
```
### Drone CI API Access
- **Server**: `https://drone.viktorbarzin.me`
- **Token location**: `terraform.tfvars` as `drone_api_token` (git-crypt encrypted)
- **Read token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2`
- **Username**: `ViktorBarzin`
#### Common API Patterns
```bash
# Read token from tfvars
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
# List repos
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"
# Activate repo in Drone
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/< repo > "
# Trigger build
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/< repo > /builds"
# Get build info
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/< repo > /builds/< build-number > "
# Add secret to repo
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/< repo > /secrets" \
-d '{"name":"secret_name","data":"secret_value"}'
```
### Capabilities
With these tokens, Claude can:
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
2026-01-18 11:04:51 +00:00
## Infrastructure
2026-02-06 20:10:02 +00:00
- Proxmox hypervisor for VMs (192.168.1.127)
2026-01-18 13:40:31 +00:00
- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
2026-01-18 11:04:51 +00:00
- NFS server at 10.0.10.15 for storage
- Redis shared service at `redis.redis.svc.cluster.local`
2026-02-14 13:09:03 +00:00
- Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
- Port 5000: docker.io (Docker Hub, with auth)
- Port 5010: ghcr.io
- Port 5020: quay.io
- Port 5030: registry.k8s.io
- Port 5040: reg.kyverno.io
- Worker nodes use `config_path = "/etc/containerd/certs.d"` with per-registry `hosts.toml` files
- k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)
2026-02-06 20:10:02 +00:00
2026-02-07 21:37:38 +00:00
### Proxmox Host Hardware
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
- **RAM**: 142 GB (Dell R730 server)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
- **Proxmox access**: `ssh root@192.168.1.127`
### Proxmox Network Bridges
- **vmbr0**: Physical bridge on `eno1` , IP `192.168.1.127/24` — connects to physical/home network (192.168.1.0/24)
- **vmbr1**: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)
### Proxmox VM Inventory
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------|
| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall, routes between all networks |
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM on management network |
| 103 | home-assistant | running | 8 | 16GB | vmbr1:vlan10(down), vmbr0 | 32G | Home Assistant, net0 link disabled, uses vmbr0 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup Server (not in use) |
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Kubernetes control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 passthrough (hostpci0) |
| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM on physical network |
| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7× 256G+1T | NFS server (10.0.10.15), multiple data disks |
#### VM Templates (stopped, used for cloning)
| VMID | Name | Purpose |
|------|------|---------|
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base template for non-K8s VMs |
| 1001 | docker-registry-template | Template for docker registry VM |
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base template for K8s nodes |
#### Network Connectivity Summary
- **pfSense (101)** bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
- **K8s cluster** (200-204) + **docker-registry** (220) are all on VLAN 20 (kubernetes network)
- **TrueNAS** (9000) + **devvm** (102) + **PBS** (105) are on VLAN 10 (management network)
- **Home Assistant** (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
- **Windows10** (300) is on physical network (vmbr0) only
2026-02-06 20:10:02 +00:00
### GPU Node (k8s-node1)
2026-02-07 21:37:38 +00:00
- **VMID**: 201
- **PCIe Passthrough**: `0000:06:00.0` (NVIDIA Tesla T4)
2026-02-06 20:10:02 +00:00
- **Taint**: `nvidia.com/gpu=true:NoSchedule` - Only GPU workloads can run here
- **Label**: `gpu=true`
- GPU workloads must have both:
- `node_selector = { "gpu": "true" }`
- `toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }`
- Taint is applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
2026-01-18 13:40:31 +00:00
## Git Operations (IMPORTANT)
- **Git is slow** on this repo due to many files - commands can take 30+ seconds
- Use `GIT_OPTIONAL_LOCKS=0` prefix if git hangs
- Always commit only specific files you changed, not everything
2026-01-18 15:40:43 +00:00
- **ALWAYS ask user before pushing to remote** - never push without explicit confirmation
2026-01-18 13:40:31 +00:00
## Prometheus Alerts
- Alert rules are in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
- Under `serverFiles.alerting_rules.yml.groups`
- Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
- kube-state-metrics provides: `kube_deployment_*` , `kube_statefulset_*` , `kube_daemonset_*`
2026-02-06 20:10:02 +00:00
## Tier System
- **0-core**: Critical infrastructure (ingress, DNS, VPN, auth)
- **1-cluster**: Cluster services (Redis, metrics, security)
- **2-gpu**: GPU workloads (Immich, Ollama, Frigate)
- **3-edge**: User-facing services
- **4-aux**: Optional/auxiliary services
2026-02-15 18:48:33 +00:00
### Resource Governance (Kyverno-based)
Four layers of noisy-neighbor protection, all defined in `modules/kubernetes/kyverno/resource-governance.tf` :
1. **PriorityClasses** : `tier-0-core` (1M) through `tier-4-aux` (200K). `tier-4-aux` uses `preemption_policy=Never` .
2. **LimitRange defaults** (Kyverno generate): Auto-creates `tier-defaults` LimitRange in namespaces based on tier label. Only affects containers without explicit resources.
3. **ResourceQuotas** (Kyverno generate): Auto-creates `tier-quota` ResourceQuota in namespaces with tier labels. Excludes namespaces with `resource-governance/custom-quota=true` label.
4. **Priority injection** (Kyverno mutate): Sets `priorityClassName` on Pods based on namespace tier label.
**Custom quota override**: Add label `resource-governance/custom-quota: "true"` to namespace, then define a custom `kubernetes_resource_quota` in the service's Terraform module. Currently used by: monitoring, crowdsec.
**LimitRange defaults by tier**:
| Tier | Default Req | Default Limit | Max |
|------|------------|--------------|-----|
| 0-core | 100m/128Mi | 2/4Gi | 8/16Gi |
| 1-cluster | 100m/128Mi | 2/4Gi | 4/8Gi |
| 2-gpu | 100m/256Mi | 4/8Gi | 8/16Gi |
| 3-edge | 50m/128Mi | 1/2Gi | 4/8Gi |
| 4-aux | 25m/64Mi | 500m/1Gi | 2/4Gi |
[ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods
Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
**ResourceQuota hard limits by tier**:
| Tier | Req CPU | Req Mem | Lim CPU | Lim Mem | Pods |
|------|---------|---------|---------|---------|------|
| 0-core | 8 | 8Gi | 32 | 64Gi | 100 |
| 1-cluster | 4 | 4Gi | 16 | 32Gi | 30 |
| 2-gpu | 8 | 8Gi | 48 | 96Gi | 40 |
| 3-edge | 4 | 4Gi | 16 | 32Gi | 30 |
| 4-aux | 2 | 2Gi | 8 | 16Gi | 20 |
2026-02-06 20:10:02 +00:00
---
## User Preferences
### Calendar
- **Default calendar**: Nextcloud (always use unless otherwise specified)
- **Nextcloud URL**: `https://nextcloud.viktorbarzin.me`
- **CalDAV endpoint**: `https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/`
### Home Assistant
- **Default smart home**: Home Assistant (always use for smart home control)
2026-02-07 21:26:05 +00:00
- **Two deployments**:
2026-02-07 23:26:43 +00:00
- **ha-london** (default): `https://ha-london.viktorbarzin.me` | Script: `.claude/home-assistant.py` | SSH: `ssh pi@192.168.8.103` , config at `/home/pi/docker/homeAssistant/`
2026-02-07 22:21:30 +00:00
- **ha-sofia**: `https://ha-sofia.viktorbarzin.me` | Script: `.claude/home-assistant-sofia.py` | SSH: `ssh vbarzin@192.168.1.8` , config at `/config/`
2026-02-07 21:26:05 +00:00
- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
2026-02-06 20:10:02 +00:00
### Development
- **Frontend framework**: Svelte (user is learning it, so use Svelte for all new web apps)
2026-02-15 17:48:42 +00:00
### Pod Monitoring After Updates
- **Never use `sleep` to wait for pods** — instead, spawn a background subagent (Task tool with `run_in_background: true` ) that continuously checks pod state (e.g., `kubectl get pods -n <namespace> -w` ) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.
2026-02-06 20:10:02 +00:00
---
## Skills & Workflows
Skills are specialized workflows for common tasks. Located in `.claude/skills/` .
### Available Skills
2026-02-17 21:09:12 +00:00
**setup-project** (`.claude/skills/setup-project/SKILL.md` )
2026-02-06 20:10:02 +00:00
- Deploy new self-hosted services from GitHub repos
- Automated workflow: Docker image → Terraform module → Deploy
- Handles database setup, ingress, DNS configuration
- **When to use**: User provides GitHub URL or wants to deploy a new service
- **Example**: "Deploy [GitHub repo] to the cluster"
2026-02-17 21:09:12 +00:00
**extend-vm-storage** (`.claude/skills/extend-vm-storage/SKILL.md` )
2026-02-13 22:08:46 +00:00
- Extend disk storage on K8s node VMs (Proxmox-hosted)
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
- **When to use**: A k8s node needs more disk space
- **Example**: "Extend storage on k8s-node2 by 64G"
2026-02-06 20:10:02 +00:00
---
## Service-Specific Notes
2026-02-17 21:10:40 +00:00
### Authentik (Identity Provider)
- **Helm Chart**: `authentik` v2025.10.3 from `https://charts.goauthentik.io/`
- **URL**: `https://authentik.viktorbarzin.me`
- **API**: `https://authentik.viktorbarzin.me/api/v3/`
2026-02-17 22:03:55 +00:00
- **API Token**: Stored in `terraform.tfvars` as `authentik_api_token` (non-expiring, superuser, identifier: `claude-code-permanent` ). Read with: `grep authentik_api_token terraform.tfvars | cut -d'"' -f2`
2026-02-17 21:10:40 +00:00
- **Namespace**: `authentik` (tier: cluster)
- **Architecture**: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
- **Database**: PostgreSQL via `postgresql.dbaas:5432` , pooled through PgBouncer at `pgbouncer.authentik:6432`
- **Redis**: Shared at `redis.redis.svc.cluster.local`
- **Terraform**: `modules/kubernetes/authentik/main.tf` (Helm), `pgbouncer.tf` (connection pooling)
#### Authentik API Management
To call the API, use:
```bash
curl -s -H "Authorization: Bearer < TOKEN > " "https://authentik.viktorbarzin.me/api/v3/< endpoint > /"
```
Key API endpoints:
- `core/users/` — List/create/update/delete users
- `core/groups/` — List/create/update/delete groups
- `core/applications/` — List/create applications
- `providers/all/` — List all providers (OAuth2, Proxy, etc.)
- `providers/oauth2/` — OAuth2/OIDC providers specifically
- `providers/proxy/` — Proxy providers (forward auth)
- `flows/instances/` — List flows
- `stages/all/` — List stages
- `sources/all/` — List sources (Google, GitHub, etc.)
- `outposts/instances/` — List outposts
- `propertymappings/all/` — List property mappings
- `rbac/roles/` — List roles
2026-02-17 22:16:46 +00:00
#### Current Applications (9)
2026-02-17 21:10:40 +00:00
| Application | Provider Type | Auth Flow |
|-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
2026-02-17 22:16:46 +00:00
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
2026-02-17 21:10:40 +00:00
| linkwarden | OAuth2/OIDC | explicit consent |
| Matrix | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
2026-02-17 22:16:46 +00:00
#### Current Groups (9)
2026-02-17 21:10:40 +00:00
| Group | Parent | Superuser | Purpose |
|-------|--------|-----------|---------|
| Allow Login Users | — | No | Parent group for login-permitted users |
| authentik Admins | — | Yes | Full admin access |
| authentik Read-only | — | No | Read-only access (has role) |
| Headscale Users | Allow Login Users | No | VPN access |
| Home Server Admins | Allow Login Users | No | Server admin access |
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
2026-02-17 22:16:46 +00:00
| kubernetes-admins | — | No | K8s cluster-admin RBAC |
| kubernetes-power-users | — | No | K8s power-user RBAC |
| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
2026-02-17 21:10:40 +00:00
#### Current Users (7 real users)
| Username | Name | Type | Groups |
|----------|------|------|--------|
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
| vbarzin@gmail .com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail .com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail .com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail .com | GHEORGHE Milea | external | Headscale Users |
| valentinakolevabarzina@gmail .com | Валентина Колева-Барзина | internal | Headscale Users |
| anca.r.cristian10@gmail .com | — | internal | Wrongmove Users |
| kadir.tugan@gmail .com | Kadir | internal | Wrongmove Users |
#### Login Sources (Social Login)
- **Google** (OAuth) — user matching by identifier
- **GitHub** (OAuth) — user matching by email_link
- **Facebook** (OAuth) — user matching by email_link
- All use the same authentication flow (`1a779f24` ) and enrollment flow (`87572804` )
#### Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent` ): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare
- **Implicit consent** (`default-provider-authorization-implicit-consent` ): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove
#### Traefik Integration
- Forward auth middleware: `authentik-forward-auth` in Traefik namespace
- Outpost endpoint: `http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik`
- Services opt in via `protected = true` in `ingress_factory`
- Response headers: `X-authentik-username` , `X-authentik-uid` , `X-authentik-email` , `X-authentik-name` , `X-authentik-groups` , `Set-Cookie`
#### OIDC for Kubernetes API
2026-02-17 22:16:46 +00:00
- **Issuer**: `https://authentik.viktorbarzin.me/application/o/kubernetes/`
- **Client ID**: `kubernetes` (public client, no secret)
- **Username claim**: `email` , **Groups claim** : `groups`
- **Signing key**: `authentik Self-signed Certificate` (must be assigned to the provider or JWKS will be empty)
- **Redirect URIs**: Regex mode `http://localhost:.*` and `http://127\.0\.0\.1:.*` (kubelogin picks random ports)
- **Configured via**: SSH to kube-apiserver manifest (`modules/kubernetes/rbac/apiserver-oidc.tf` )
- **RBAC module**: `modules/kubernetes/rbac/main.tf` — admin/power-user/namespace-owner roles
- **Self-service portal**: `modules/kubernetes/k8s-portal/` — SvelteKit app at `https://k8s-portal.viktorbarzin.me`
- **User definition**: `k8s_users` variable in `terraform.tfvars`
- **Audit logging**: Enabled via `modules/kubernetes/rbac/audit-policy.tf` , logs at `/var/log/kubernetes/audit.log`
**CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:**
1. **Signing key MUST be assigned** to the OAuth2 provider. Without it, the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
2. **Email mapping must set `email_verified: True`** . The default Authentik email scope mapping hardcodes `email_verified: False` , which causes kube-apiserver to reject the token with `oidc: email not verified` . Use a custom scope mapping: `return {"email": request.user.email, "email_verified": True}`
3. **kubelogin needs `--oidc-extra-scope`** for `email` , `profile` , `groups` . Without these, only `openid` is requested and the token lacks the `email` claim, causing `oidc: parse username claims "email": claim not present` .
4. **Redirect URIs must use regex mode** (`http://localhost:.*` ) because kubelogin picks random ports, not just 8000/18000.
5. **Kubelet static pod manifest changes** require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple `touch` or kubelet restart is not enough.
6. **Property mappings endpoint** in Authentik 2025.10.x is `propertymappings/provider/scope/` (not the older `propertymappings/scope/` ).
2026-02-17 21:10:40 +00:00
#### Common Management Tasks
**Add a new OAuth2 application:**
1. Create OAuth2 provider: `POST /api/v3/providers/oauth2/` with client_id, client_secret, redirect_uris, authorization_flow, etc.
2. Create application: `POST /api/v3/core/applications/` with name, slug, provider pk
3. (Optional) Bind to group policy for access control
**Add a user to a group:**
```bash
# Get group pk, then PATCH with updated users list
curl -X PATCH -H "Authorization: Bearer < TOKEN > " -H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/groups/< group-pk > /" \
-d '{"users": [< existing_user_pks > , < new_user_pk > ]}'
```
**Protect a service with forward auth:**
Set `protected = true` in the service's `ingress_factory` call in Terraform.
2026-02-06 20:10:02 +00:00
### AFFiNE (Visual Canvas)
- **Image**: `ghcr.io/toeverything/affine:stable`
- **Port**: 3010
- **Requires**: PostgreSQL + Redis
- **Migration**: Init container runs `node ./scripts/self-host-predeploy.js`
- **Storage**: NFS at `/mnt/main/affine` mounted to `/root/.affine/storage` and `/root/.affine/config`
- **Key env vars**:
- `AFFINE_SERVER_EXTERNAL_URL` - Public URL (e.g., `https://affine.viktorbarzin.me` )
- `AFFINE_SERVER_HTTPS` - Set to `true` behind TLS ingress
- `DATABASE_URL` - PostgreSQL connection string
- `REDIS_SERVER_HOST` - Redis hostname
- `MAILER_*` - SMTP configuration for email invites
- **Local-first**: Data stored in browser by default; syncs to server when user creates account
- **Docs**: https://docs.affine.pro/self-host-affine
2026-02-08 00:09:03 +00:00
### Wyoming Whisper (STT for Home Assistant)
- **Image**: `rhasspy/wyoming-whisper:latest`
- **Port**: 10300/TCP (Wyoming protocol)
- **Model**: `small-int8` (CPU-optimized, no CUDA variant available from upstream)
- **Runs on**: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
- **Storage**: NFS at `/mnt/main/whisper` → `/data` (model cache)
- **Exposure**: Internal only via Traefik TCP entrypoint `whisper-tcp` → IngressRouteTCP
- **Access**: `10.0.20.202:10300` (Traefik LB IP, no public DNS)
- **HA Integration**: Wyoming Protocol integration in ha-london, host `10.0.20.202` , port `10300`
- **No GPU acceleration**: Official image is CPU-only (Debian + PyTorch CPU). The `mib1185/wyoming-faster-whisper-cuda` image exists but requires self-build.
2026-02-08 02:18:14 +00:00
### Gramps Web (Genealogy)
- **Image**: `ghcr.io/gramps-project/grampsweb:latest`
- **Port**: 5000
- **URL**: `https://family.viktorbarzin.me`
- **Components**: Web app + Celery worker (2 containers in 1 pod)
- **Requires**: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
- **Storage**: NFS at `/mnt/main/grampsweb` with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp
- **Key env vars**:
- `GRAMPSWEB_SECRET_KEY` - Flask secret key (generated via `random_password` )
- `GRAMPSWEB_TREE` - Tree name
- `GRAMPSWEB_BASE_URL` - Public URL
- `GRAMPSWEB_CELERY_CONFIG__broker_url` / `result_backend` - Redis connection
- `GRAMPSWEB_REGISTRATION_DISABLED` - Set to `True`
- `GRAMPSWEB_EMAIL_*` - SMTP configuration
- `GRAMPSWEB_LLM_*` - Ollama AI integration
- **Celery command**: `celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2`
- **Registration**: Disabled; first user created via UI setup wizard
2026-02-13 23:46:01 +00:00
### Loki + Alloy (Centralized Log Collection)
- **Loki image**: `grafana/loki:3.6.5` (Helm chart, single binary mode)
- **Alloy image**: `grafana/alloy:v1.13.0` (Helm chart, DaemonSet)
- **Config files**: `modules/kubernetes/monitoring/loki.tf` , `loki.yaml` , `alloy.yaml`
- **Port**: 3100/TCP (Loki API)
- **Storage**: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi in-memory)
- **Memory**: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
- **Disk-friendly tuning**: `max_chunk_age: 24h` , `chunk_idle_period: 12h` — holds chunks in memory, flushes ~once/day
- **Retention**: 7 days (`retention_period: 168h` ), compactor enforces deletion
- **Crash policy**: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
- **Ruler**: Evaluates LogQL alert rules, fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
- **Alert rules**: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules` )
- **Grafana**: Datasource UID `P8E80F9AEF21F6940` , dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned)
- **Sysctl DaemonSet**: `sysctl-inotify` sets `fs.inotify.max_user_watches=1048576` on all nodes (required for Alloy fsnotify)
- **Disabled components**: gateway, chunksCache, resultsCache (not needed for single binary)
- **Key paths**: Compactor at `/var/loki/compactor` , ruler scratch at `/var/loki/scratch` (must be under `/var/loki` — root FS is read-only)
- **Querying**: Grafana Explore with LogQL, e.g. `{namespace="monitoring"} |= "error"`
- **Troubleshooting**: If "entry too far behind" errors on first start, restart Alloy DaemonSet (`kubectl rollout restart ds -n monitoring alloy` ) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart
2026-02-18 22:39:58 +00:00
### OpenClaw (AI Agent Gateway)
- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9`
- **Port**: 18789
- **URL**: `https://openclaw.viktorbarzin.me` (authentik-protected)
- **Namespace**: `openclaw` (tier: aux)
- **Formerly**: `moltbot` — renamed in Feb 2026
- **Architecture**: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
- **Init container**: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding (for managing cluster resources)
- **Storage**: NFS at `/mnt/main/openclaw/workspace` (git repo) and `/mnt/main/openclaw/data` (persistent data)
- **Config**: `openclaw.json` ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults
- **Variables**: `openclaw_ssh_key` , `openclaw_skill_secrets` in `terraform.tfvars`
- **Skill secrets**: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)