[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data

CLAUDE.md changes: - Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md - Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md - Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md - Extract Authentik state snapshot → .claude/reference/authentik-state.md - Remove Init Container pattern (duplicates setup-project skill) - Remove Poison Fountain service notes (duplicates Anti-AI section) - Consolidate Authentik section (link to skills + reference) - Remove resource limit tables (kept tier definitions inline) Skill merges (37→32): - helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting - containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching - (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00 · 2026-02-22 22:11:31 +00:00 · abe89c926e
commit abe89c926e
parent d3d0b4281c
10 changed files with 749 additions and 1166 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -6,100 +6,40 @@
 - **When making infrastructure changes**: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
 - **After every significant change**: Proactively update this file (`.claude/CLAUDE.md`) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically.
 - **After updating any `.claude/` files**: Always commit them immediately (`git add .claude/ && git commit -m "[ci skip] update claude knowledge"`) to avoid building up unstaged changes.
- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project.md` for deploying new services)
- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform `.tf` files and apply with `terragrunt apply`. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use `kubectl` only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes.
- **CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files** unless they are encrypted (e.g., via git-crypt). Secrets belong in `terraform.tfvars` (which is git-crypt encrypted) or in the `secrets/` directory. Never hardcode credentials in `.tf` files, scripts, `.claude/` files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (`terraform.tfvars` → `main.tf` → module variables).
- **CRITICAL: NEVER commit secrets** — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
- **New services MUST have CI/CD**: Set up Drone CI pipeline (`.drone.yml`) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy.
- **New services MUST have monitoring**: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.
+- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project` for deploying new services)
+- **Reference data**: Check `.claude/reference/` for inventory tables, API patterns, and current state snapshots
+- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (kubectl apply/edit/patch, helm install, docker run). Use `kubectl` only for read-only operations and ephemeral debugging.
+- **CRITICAL: NEVER put sensitive data** (API keys, passwords, tokens, credentials) into committed files unless encrypted via git-crypt. Secrets belong in `terraform.tfvars` or `secrets/` directory.
+- **CRITICAL: NEVER commit secrets** — triple-check before every commit. Zero exceptions.
+- **New services MUST have CI/CD** (Drone CI pipeline) and **monitoring** (Prometheus alerts and/or Uptime Kuma).

 ## Execution Environment
- **File operations**: Read, Edit, Write, Glob, Grep tools
- **Git commands**: git status, git log, git diff, git add, git commit, git reset, etc.
- **Shell commands**: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
- **CRITICAL: Always run terragrunt/terraform locally**, never on the remote server via SSH:
-  ```bash
-  cd stacks/<service> && terragrunt apply --non-interactive
-  ```
- **kubectl**: Use `kubectl --kubeconfig $(pwd)/config` for cluster access
- **GitHub API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below). `gh` CLI is blocked by sandbox restrictions.
- **Drone CI API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below).
+- **Terraform/Terragrunt**: Always run locally: `cd stacks/<service> && terragrunt apply --non-interactive`
+- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
+- **GitHub/Drone API**: Use `curl` with tokens from tfvars (see `.claude/reference/github-drone-api.md`). `gh` CLI is blocked by sandbox.

 ---

 ## Overview
-Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.
+Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`. Uses git-crypt for secrets encryption.

-## Static File Paths (NEVER CHANGE)
- **Main config**: `terraform.tfvars` - All secrets, DNS, Cloudflare config, WireGuard peers
- **Root Terragrunt**: `terragrunt.hcl` - Root Terragrunt config (providers, backend, var loading)
- **Service stacks**: `stacks/<service>/` - Individual service stacks (each has `terragrunt.hcl` + `main.tf` with resources inline)
- **Infra stack**: `stacks/infra/` - Proxmox VM resources (templates, docker-registry, VMs)
- **Platform stack**: `stacks/platform/` - Core infrastructure services (22 modules in `modules/` subdir)
- **Per-stack state**: `state/stacks/<service>/terraform.tfstate` - Per-stack state files (gitignored)
- **Service resources**: `stacks/<service>/main.tf` - Service resources defined directly in stack root
- **Platform modules**: `stacks/platform/modules/<service>/` - Platform service modules
- **Shared modules**: `modules/kubernetes/ingress_factory/`, `modules/kubernetes/setup_tls_secret/`
- **Secrets**: `secrets/` - git-crypt encrypted TLS certs and keys
-
-## Network Topology (Static IPs)
-```
-┌─────────────────────────────────────────────────────────────────┐
-│ 10.0.10.0/24 - Management Network                               │
-├─────────────────────────────────────────────────────────────────┤
-│ 10.0.10.10  - Wizard (main server)                               │
-│ 10.0.10.15  - NFS Server (TrueNAS) - /mnt/main/*                │
-└─────────────────────────────────────────────────────────────────┘
-
-┌─────────────────────────────────────────────────────────────────┐
-│ 10.0.20.0/24 - Kubernetes Network                               │
-├─────────────────────────────────────────────────────────────────┤
-│ 10.0.20.1   - pfSense Gateway                                   │
-│ 10.0.20.10  - Docker Registry VM (MAC: DE:AD:BE:EF:22:22)       │
-│ 10.0.20.100 - k8s-master                                        │
-│ 10.0.20.101 - Technitium DNS                                    │
-│ 10.0.20.102 - MetalLB IP Pool Start                             │
-│ 10.0.20.200 - MetalLB IP Pool End                               │
-└─────────────────────────────────────────────────────────────────┘
-
-┌─────────────────────────────────────────────────────────────────┐
-│ 192.168.1.0/24 - Physical Network                               │
-├─────────────────────────────────────────────────────────────────┤
-│ 192.168.1.127 - Proxmox Hypervisor                              │
-└─────────────────────────────────────────────────────────────────┘
-```
+## Key File Paths
+- `terraform.tfvars` — All secrets, DNS, Cloudflare config, WireGuard peers (git-crypt encrypted)
+- `terragrunt.hcl` — Root config (providers, backend, variable loading)
+- `stacks/<service>/` — Individual service stacks (`terragrunt.hcl` + `main.tf`)
+- `stacks/platform/` — Core infrastructure (~22 services in `modules/` subdir)
+- `stacks/infra/` — Proxmox VM resources
+- `modules/kubernetes/ingress_factory/`, `setup_tls_secret/` — Shared utility modules
+- `secrets/` — git-crypt encrypted TLS certs and keys

 ## Domains
 - **Public**: `viktorbarzin.me` (Cloudflare-managed)
 - **Internal**: `viktorbarzin.lan` (Technitium DNS)

-## Directory Structure
- `terragrunt.hcl` - Root Terragrunt configuration (providers, backend, variable loading)
- `stacks/` - Individual Terragrunt stacks (one per service)
- `stacks/infra/` - Proxmox VM resources (templates, docker-registry)
- `stacks/platform/` - Core infrastructure (22 services in `stacks/platform/modules/`)
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf`)
- `stacks/platform/modules/<service>/` - Platform service module source code
- `modules/kubernetes/` - **Only shared utility modules**: `ingress_factory/`, `setup_tls_secret/`
- `modules/create-vm/` - Proxmox VM creation module
- `state/` - Per-stack Terraform state files (gitignored)
- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt
- `cli/` - Go CLI tool for infrastructure management
- `scripts/` - Helper scripts (cluster management, node updates)
- `playbooks/` - Ansible playbooks for node configuration
- `diagram/` - Infrastructure diagrams (Python-based)
-
 ## Key Patterns
- Each service in `modules/kubernetes/<service>/main.tf` defines its own namespace, deployments, services, and ingress
- NFS storage from `10.0.10.15` for persistent data
- TLS secrets managed via `setup_tls_secret` module
- Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
- HTTP/3 enabled on Traefik (`http3.enabled=true`, `advertisedPort=443` on websecure entrypoint) and Cloudflare (`cloudflare_zone_settings_override` with `http3="on"`)
- GPU workloads use `node_selector = { "gpu": "true" }`
- Services expose to `*.viktorbarzin.me` domains

 ### NFS Volume Pattern
-**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs:
+**Prefer inline NFS volumes** over separate PV/PVC resources:
 ```hcl
 volume {
  name = "data"
@ -109,773 +49,142 @@ volume {
  }
 }
 ```
-Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart).
+Only use PV/PVC when a Helm chart requires `existingClaim`.

 ### Adding NFS Exports
-To add a new NFS exported directory:
-1. Edit `secrets/nfs_directories.txt` - add the new directory path, keep the list sorted
-2. Run `secrets/nfs_exports.sh` from the `secrets/` directory to update the NFS share via TrueNAS API
+1. Edit `secrets/nfs_directories.txt` — add path, keep sorted
+2. Run `secrets/nfs_exports.sh` from `secrets/` to update TrueNAS

-### Factory Pattern (for multi-user services)
-Used when a service needs one instance per user. Structure:
-```
-stacks/<service>/
-├── main.tf           # Namespace, TLS secret, user module calls
-└── factory/
-    └── main.tf       # Deployment, service, ingress templates with ${var.name}
-```
-Examples: `actualbudget`, `freedify`
+### Factory Pattern (multi-user services)
+Structure: `stacks/<service>/main.tf` + `factory/main.tf`. Examples: `actualbudget`, `freedify`.
+To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.

-To add a new user:
-1. Export NFS share at `/mnt/main/<service>/<username>` in TrueNAS
-2. Add Cloudflare route in tfvars
-3. Add module block in main.tf calling factory
-
-### Init Container Pattern (for database migrations)
-Use when a service needs to run database migrations before starting:
-```hcl
-init_container {
-  name    = "migration"
-  image   = "service-image:tag"
-  command = ["sh", "-c", "migration-command"]
-
-  dynamic "env" {
-    for_each = local.common_env
-    content {
-      name  = env.value.name
-      value = env.value.value
-    }
-  }
-}
-```
-Example: AFFiNE runs `node ./scripts/self-host-predeploy.js` in init container.
-
-### SMTP/Email Configuration
-When configuring services to use the mailserver:
- **Use public hostname**: `mail.viktorbarzin.me` (for TLS cert validation)
- **Do NOT use**: `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch)
- **Port**: 587 (STARTTLS)
- **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars
- **Common email**: `info@viktorbarzin.me` for service notifications
+### SMTP/Email
+- **Use**: `mail.viktorbarzin.me` port 587 (STARTTLS). **NOT** `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch).
+- **Credentials**: `mailserver_accounts` in tfvars. Common: `info@viktorbarzin.me`

 ### Anti-AI Scraping (5-Layer Defense)
-All services have anti-AI scraping enabled by default via `anti_ai_scraping = true` in `ingress_factory`. The 5 layers are:
+All services have `anti_ai_scraping = true` by default in `ingress_factory`. Layers:
+1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth → poison-fountain `/auth`. Returns 403 for GPTBot, ClaudeBot, CCBot, etc.
+2. **X-Robots-Tag** (`traefik-anti-ai-headers`): Adds `noai, noimageai`
+3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body injects 5 hidden links before `</body>` to `poison.viktorbarzin.me/article/*`
+4. **Tarpit**: `/article/*` drip-feeds at ~100 bytes/sec
+5. **Poison content**: 50 cached docs from rnsaffn.com/poison2/ (CronJob every 6h, `--http1.1` required)

-1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth middleware → poison-fountain `/auth` endpoint. Checks `User-Agent` against known AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, etc.). Returns 403 for bots, 200 for normal users.
-2. **X-Robots-Tag header** (`traefik-anti-ai-headers`): Adds `noai, noimageai` to all responses.
-3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body plugin injects 5 hidden `<a>` tags before `</body>` linking to `poison.viktorbarzin.me/article/*`. Only injected when request `Accept` header contains `text/html` (browsers/scrapers, not API calls).
-4. **Tarpit** (poison-fountain service): `/article/*` endpoints drip-feed responses at ~100 bytes/sec via chunked transfer encoding, wasting scraper time.
-5. **Poison content**: Cached documents from rnsaffn.com/poison2/ (50 docs, refreshed every 6h via CronJob) served through the tarpit to pollute AI training data.
-
-**Key files:**
- `stacks/poison-fountain/` — Terraform stack (deployment, service, ingress, CronJob)
- `stacks/poison-fountain/app/server.py` — Python HTTP server (ForwardAuth + tarpit)
- `stacks/poison-fountain/app/fetch-poison.sh` — CronJob fetcher (uses `--http1.1`, upstream hangs on HTTP/2)
- `stacks/platform/modules/traefik/middleware.tf` — 3 Traefik middleware CRDs
- `modules/kubernetes/ingress_factory/main.tf` — `anti_ai_scraping` variable (default: true)
-
-**Testing:**
-```bash
-# Trap links (need Accept: text/html for rewrite-body plugin to process)
-curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'
-
-# X-Robots-Tag header
-curl -sI -H "Accept: text/html" https://vaultwarden.viktorbarzin.me/ | grep -i x-robots
-
-# Bot blocking (403 for AI bots, 200 for normal users)
-curl -s -o /dev/null -w "%{http_code}" -A "GPTBot/1.0" https://vaultwarden.viktorbarzin.me/
-
-# Tarpit slow-drip (~100 bytes/sec)
-curl -s -H "Accept: text/html" https://poison.viktorbarzin.me/article/test
-```
-
-**Gotchas:**
- rewrite-body plugin only processes responses when `Accept` header contains `text/html` — `curl` default `Accept: */*` does NOT match. Use `-H "Accept: text/html"` for testing.
- rnsaffn.com/poison2/ hangs on HTTP/2 — fetcher must use `--http1.1`
- NFS cache dir (`/mnt/main/poison-fountain/cache`) must be world-writable (chmod 777) because `curlimages/curl` runs as uid 101
- To disable for a specific service: set `anti_ai_scraping = false` in its `ingress_factory` call
+Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`, `modules/kubernetes/ingress_factory/main.tf`
+Testing: `curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'`
+Disable per-service: `anti_ai_scraping = false` in ingress_factory call.

 ### Terragrunt Architecture
- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading for all stacks
- Each stack contains its resources directly: `stacks/<service>/main.tf` has variable declarations, locals, and all Terraform resources inline
- Platform modules live at `stacks/platform/modules/<service>/`, referenced as `source = "./modules/<service>"`
- Shared utility modules (`ingress_factory`, `setup_tls_secret`, `dockerhub_secret`, `oauth-proxy`) remain at `modules/kubernetes/` and are referenced with relative paths from each module
- State isolation: each stack has its own state file at `state/stacks/<service>/terraform.tfstate`
- Dependencies: service stacks depend on `platform` stack via `dependency` block in their `terragrunt.hcl`
- Variables loaded from `terraform.tfvars` automatically (unused vars silently ignored via `extra_arguments`)
- `secrets/` symlinks in each stack for TLS cert resolution (`path.root` workaround)
- Terragrunt v0.99+: use `--non-interactive` (not `--terragrunt-non-interactive`)
- run-all syntax: `terragrunt run --all -- <command>` (not `terragrunt run-all`)
- The `platform` stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.)
- Individual service stacks are for services that can be deployed independently
+- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading
+- Each stack: `stacks/<service>/main.tf` with resources inline, state at `state/stacks/<service>/terraform.tfstate`
+- Platform modules: `stacks/platform/modules/<service>/`, shared modules: `modules/kubernetes/`
+- Dependencies via `dependency` block; variables from `terraform.tfvars` (unused silently ignored)
+- `secrets/` symlinks in stacks for TLS cert path resolution
+- Syntax: `--non-interactive` (not `--terragrunt-non-interactive`), `terragrunt run --all -- <command>` (not `run-all`)

 ### Adding a New Service
-When adding a new service to the cluster:
-1. Create `stacks/<service>/` directory with:
-   - `terragrunt.hcl` - Include root config, declare `platform` dependency
-   - `main.tf` - All resources defined directly (variables, locals, namespace, deployments, services, ingress)
-   - `secrets` - Symlink to `../../secrets` (for TLS cert path resolution)
-2. Add Cloudflare DNS record in `terraform.tfvars` (`cloudflare_proxied_names` or `cloudflare_non_proxied_names`)
-3. Apply the cloudflared stack: `cd stacks/platform && terragrunt apply --non-interactive`
-4. Apply the new service: `cd stacks/<service> && terragrunt apply --non-interactive`
-
-## Common Variables
- `tls_secret_name` - TLS certificate secret name
- `tier` - Deployment tier label
- Service-specific passwords passed as variables
-
-## Service Versions (as of 2026-02)
- Immich: v2.4.1
- Freedify: latest (music streaming, factory pattern)
- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
- Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
- Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
- Gramps Web: latest (genealogy, uses Redis + Celery)
- Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
- Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
- OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)
+Use the **`setup-project`** skill for the full workflow. Quick reference:
+1. Create `stacks/<service>/` with `terragrunt.hcl`, `main.tf`, `secrets` symlink
+2. Add Cloudflare DNS in `terraform.tfvars`
+3. Apply platform stack (for DNS): `cd stacks/platform && terragrunt apply --non-interactive`
+4. Apply service: `cd stacks/<service> && terragrunt apply --non-interactive`

 ## Useful Commands
 ```bash
-# Cluster health check — ALWAYS use this to check cluster status
-bash scripts/cluster_healthcheck.sh            # Full color report
+bash scripts/cluster_healthcheck.sh            # Cluster health (24 checks)
 bash scripts/cluster_healthcheck.sh --quiet    # Only WARN/FAIL
-bash scripts/cluster_healthcheck.sh --json     # Machine-readable
-bash scripts/cluster_healthcheck.sh --fix      # Auto-delete evicted pods
-
-# Apply a single service stack
-cd stacks/<service> && terragrunt apply --non-interactive
-
-# Plan a single service stack
-cd stacks/<service> && terragrunt plan --non-interactive
-
-# Plan all stacks (full DAG)
-cd stacks && terragrunt run --all --non-interactive -- plan
-
-# Apply all stacks (full DAG)
-cd stacks && terragrunt run --all --non-interactive -- apply
-
-# Format all terraform files
-terraform fmt -recursive
-
-kubectl get pods -A
+cd stacks/<service> && terragrunt apply --non-interactive  # Apply single stack
+cd stacks && terragrunt run --all --non-interactive -- plan  # Plan all
+terraform fmt -recursive                       # Format all
 ```

-**Cluster Health Check** (`scripts/cluster_healthcheck.sh`):
- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
- **When adding new healthchecks or monitoring**: Always update this script to validate the new component
-
-**Terragrunt apply examples:**
- `cd stacks/monitoring && terragrunt apply --non-interactive` - Apply monitoring
- `cd stacks/immich && terragrunt apply --non-interactive` - Apply immich
- `cd stacks/infra && terragrunt apply --non-interactive` - Apply Proxmox VMs / docker registry
- `cd stacks/platform && terragrunt apply --non-interactive` - Apply all core/platform services
-
-**IMPORTANT: When deploying a new service**, you must ALSO apply the `platform` stack (which includes `cloudflared`) to create the Cloudflare DNS record:
-```bash
-cd stacks/platform && terragrunt apply --non-interactive
-```
-Adding a name to `cloudflare_non_proxied_names` or `cloudflare_proxied_names` in `terraform.tfvars` only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.
-
-## Stack Structure
-Terragrunt stacks under `stacks/`:
- `stacks/infra/` - Proxmox VMs, templates, docker-registry
- `stacks/platform/` - Core infrastructure (~22 services in `modules/` subdir)
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf`)
-
-Each stack's `terragrunt.hcl` includes the root `terragrunt.hcl` which provides:
- Kubernetes + Helm providers (configured from `terraform.tfvars`)
- Local backend with per-stack state file (`state/stacks/<service>/terraform.tfstate`)
- Automatic loading of `terraform.tfvars` with unused vars ignored
-
---
-
-## Complete Service Catalog
-
-### Critical - Network & Auth (Tier: core)
-| Service | Description | Stack |
-|---------|-------------|-------|
-| wireguard | VPN server | platform |
-| technitium | DNS server (10.0.20.101) | platform |
-| headscale | Tailscale control server | platform |
-| traefik | Ingress controller (Helm) | platform |
-| xray | Proxy/tunnel | platform |
-| authentik | Identity provider (SSO) | platform |
-| cloudflared | Cloudflare tunnel | platform |
-| authelia | Auth middleware | platform |
-| monitoring | Prometheus/Grafana/Loki stack | platform |
-
-### Storage & Security (Tier: cluster)
-| Service | Description | Stack |
-|---------|-------------|-------|
-| vaultwarden | Bitwarden-compatible password manager | platform |
-| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
-| immich | Photo management (GPU) | immich |
-| nvidia | GPU device plugin | platform |
-| metrics-server | K8s metrics | platform |
-| uptime-kuma | Status monitoring | platform |
-| crowdsec | Security/WAF | platform |
-| kyverno | Policy engine | platform |
-
-### Admin
-| Service | Description | Stack |
-|---------|-------------|-------|
-| k8s-dashboard | Kubernetes dashboard | platform |
-| reverse-proxy | Generic reverse proxy | platform |
-
-### Active Use
-| Service | Description | Stack |
-|---------|-------------|-------|
-| mailserver | Email (docker-mailserver) | mailserver |
-| shadowsocks | Proxy | shadowsocks |
-| webhook_handler | Webhook processing | webhook_handler |
-| tuya-bridge | Smart home bridge | tuya-bridge |
-| dawarich | Location history | dawarich |
-| owntracks | Location tracking | owntracks |
-| nextcloud | File sync/share | nextcloud |
-| calibre | E-book management | calibre |
-| onlyoffice | Document editing | onlyoffice |
-| f1-stream | F1 streaming | f1-stream |
-| rybbit | Analytics | rybbit |
-| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
-| actualbudget | Budgeting (factory pattern) | actualbudget |
-
-### Optional
-| Service | Description | Stack |
-|---------|-------------|-------|
-| blog | Personal blog | blog |
-| descheduler | Pod descheduler | descheduler |
-| drone | CI/CD | drone |
-| hackmd | Collaborative markdown | hackmd |
-| kms | Key management | kms |
-| privatebin | Encrypted pastebin | privatebin |
-| vault | HashiCorp Vault | vault |
-| reloader | ConfigMap/Secret reloader | reloader |
-| city-guesser | Game | city-guesser |
-| echo | Echo server | echo |
-| url | URL shortener | url |
-| excalidraw | Whiteboard | excalidraw |
-| travel_blog | Travel blog | travel_blog |
-| dashy | Dashboard | dashy |
-| send | Firefox Send | send |
-| ytdlp | YouTube downloader | ytdlp |
-| wealthfolio | Finance tracking | wealthfolio |
-| audiobookshelf | Audiobook server | audiobookshelf |
-| paperless-ngx | Document management | paperless-ngx |
-| jsoncrack | JSON visualizer | jsoncrack |
-| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
-| ntfy | Push notifications | ntfy |
-| cyberchef | Data transformation | cyberchef |
-| diun | Docker image update notifier | diun |
-| meshcentral | Remote management | meshcentral |
-| homepage | Dashboard/startpage | homepage |
-| matrix | Matrix chat server | matrix |
-| linkwarden | Bookmark manager | linkwarden |
-| changedetection | Web change detection | changedetection |
-| tandoor | Recipe manager | tandoor |
-| n8n | Workflow automation | n8n |
-| real-estate-crawler | Property crawler | real-estate-crawler |
-| tor-proxy | Tor proxy | tor-proxy |
-| forgejo | Git forge | forgejo |
-| freshrss | RSS reader | freshrss |
-| navidrome | Music streaming | navidrome |
-| networking-toolbox | Network tools | networking-toolbox |
-| stirling-pdf | PDF tools | stirling-pdf |
-| speedtest | Speed testing | speedtest |
-| freedify | Music streaming (factory pattern) | freedify |
-| netbox | Network documentation | netbox |
-| infra-maintenance | Maintenance jobs | infra-maintenance |
-| ollama | LLM server (GPU) | ollama |
-| frigate | NVR/camera (GPU) | frigate |
-| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
-| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
-| health | Apple Health data dashboard (PostgreSQL) | health |
-| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
-| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
-| openclaw | AI agent gateway (OpenClaw) | openclaw |
-| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
-
---
-
-## Cloudflare Domains
-
-### Proxied (CDN + WAF enabled)
-```
-blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
-audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
-changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
-travel, netbox
-```
-
-### Non-Proxied (Direct DNS)
-```
-mail, wg, headscale, immich, calibre, vaultwarden, drone,
-mailserver-antispam, mailserver-admin, webhook, uptime,
-owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
-onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
-isponsorblocktv, speedtest, freedify, rybbit, paperless,
-servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
-jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
-```
-
-### Special Subdomains
- `*.viktor.actualbudget` - Actualbudget factory instances
- `*.freedify` - Freedify factory instances
- `mailserver.*` - Mail server components (antispam, admin)
-
---
-
 ## CI/CD
- Drone CI (`.drone.yml`) for automated deployments
- **Default pipeline**: On push, applies the `platform` stack via `terragrunt apply` (core infrastructure services; installs Terraform 1.5.7 + Terragrunt 0.99.4 in Alpine)
- **TLS renewal pipeline**: Cron-triggered, runs `renew2.sh` (certbot + Cloudflare DNS) — no Terraform/Terragrunt needed
- **Build CLI pipeline**: Builds Docker image from `cli/Dockerfile` (unchanged)
- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly
- **After committing, run `git push origin master`** to sync changes
-
-## GitHub & Drone CI
-
-### GitHub API Access
- **Username**: `ViktorBarzin`
- **Token location**: `terraform.tfvars` as `github_pat` (git-crypt encrypted)
- **Read token**: `grep github_pat terraform.tfvars | cut -d'"' -f2`
- **Scopes**: Full access — `repo`, `admin:public_key`, `admin:repo_hook`, `delete_repo`, `admin:org`, `workflow`, `write:packages`, and more
- **`gh` CLI**: Blocked by sandbox restrictions — use `curl` with the GitHub API instead
-
-#### Common API Patterns
-```bash
-# Read token from tfvars
-GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
-
-# List repos
-curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
-
-# Create repo
-curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
-  -d '{"name":"repo-name","private":true}'
-
-# Add deploy key
-curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
-  -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
-
-# Create webhook (e.g., for Drone CI)
-curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
-  -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
-
-# Get repo info
-curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>"
-```
-
-### Drone CI API Access
- **Server**: `https://drone.viktorbarzin.me`
- **Token location**: `terraform.tfvars` as `drone_api_token` (git-crypt encrypted)
- **Read token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2`
- **Username**: `ViktorBarzin`
-
-#### Common API Patterns
-```bash
-# Read token from tfvars
-DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
-
-# List repos
-curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"
-
-# Activate repo in Drone
-curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"
-
-# Trigger build
-curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"
-
-# Get build info
-curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds/<build-number>"
-
-# Add secret to repo
-curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
-  -d '{"name":"secret_name","data":"secret_value"}'
-```
-
-### Capabilities
-With these tokens, Claude can:
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
+- Drone CI (`.drone.yml`): pushes apply `platform` stack (Terraform 1.5.7 + Terragrunt 0.99.4)
+- TLS renewal pipeline: cron-triggered `renew2.sh` (certbot + Cloudflare DNS)
+- **ALWAYS add `[ci skip]`** to commit messages when you've already applied locally
+- **After committing, run `git push origin master`** to sync

 ## Infrastructure
- Proxmox hypervisor for VMs (192.168.1.127)
- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
- NFS server at 10.0.10.15 for storage
- Redis shared service at `redis.redis.svc.cluster.local`
- Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
-  - Port 5000: docker.io (Docker Hub, with auth)
-  - Port 5010: ghcr.io
-  - Port 5020: quay.io
-  - Port 5030: registry.k8s.io
-  - Port 5040: reg.kyverno.io
-  - Worker nodes use `config_path = "/etc/containerd/certs.d"` with per-registry `hosts.toml` files
-  - k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)
+- Proxmox hypervisor (192.168.1.127) — see `.claude/reference/proxmox-inventory.md` for full VM table
+- Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
+- NFS: `10.0.10.15`, Redis: `redis.redis.svc.cluster.local`
+- Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
+- GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }`

-### Proxmox Host Hardware
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
- **RAM**: 142 GB (Dell R730 server)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
- **Proxmox access**: `ssh root@192.168.1.127`
-
-### Proxmox Network Bridges
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — connects to physical/home network (192.168.1.0/24)
- **vmbr1**: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)
-
-### Proxmox VM Inventory
-
-| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
-|------|------|--------|------|-----|---------|------|-------|
-| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall, routes between all networks |
-| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM on management network |
-| 103 | home-assistant | running | 8 | 16GB | vmbr1:vlan10(down), vmbr0 | 32G | Home Assistant, net0 link disabled, uses vmbr0 |
-| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup Server (not in use) |
-| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Kubernetes control plane (10.0.20.100) |
-| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 passthrough (hostpci0) |
-| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
-| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
-| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
-| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
-| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM on physical network |
-| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7×256G+1T | NFS server (10.0.10.15), multiple data disks |
-
-#### VM Templates (stopped, used for cloning)
-| VMID | Name | Purpose |
-|------|------|---------|
-| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base template for non-K8s VMs |
-| 1001 | docker-registry-template | Template for docker registry VM |
-| 2000 | ubuntu-2404-cloudinit-k8s-template | Base template for K8s nodes |
-
-#### Network Connectivity Summary
- **pfSense (101)** bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
- **K8s cluster** (200-204) + **docker-registry** (220) are all on VLAN 20 (kubernetes network)
- **TrueNAS** (9000) + **devvm** (102) + **PBS** (105) are on VLAN 10 (management network)
- **Home Assistant** (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
- **Windows10** (300) is on physical network (vmbr0) only
-
-### GPU Node (k8s-node1)
- **VMID**: 201
- **PCIe Passthrough**: `0000:06:00.0` (NVIDIA Tesla T4)
- **Taint**: `nvidia.com/gpu=true:NoSchedule` - Only GPU workloads can run here
- **Label**: `gpu=true`
- GPU workloads must have both:
-  - `node_selector = { "gpu": "true" }`
-  - `toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }`
- Taint is applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
-
-## Git Operations (IMPORTANT)
- **Git is slow** on this repo due to many files - commands can take 30+ seconds
- Use `GIT_OPTIONAL_LOCKS=0` prefix if git hangs
- Always commit only specific files you changed, not everything
- **ALWAYS ask user before pushing to remote** - never push without explicit confirmation
+## Git Operations
+- **Git is slow** — commands can take 30+ seconds. Use `GIT_OPTIONAL_LOCKS=0` if git hangs.
+- Commit only specific files. **ALWAYS ask user before pushing**.

 ## Prometheus Alerts
- Alert rules are in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
- Under `serverFiles.alerting_rules.yml.groups`
+- Rules in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
 - Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
- kube-state-metrics provides: `kube_deployment_*`, `kube_statefulset_*`, `kube_daemonset_*`

-## Tier System
- **0-core**: Critical infrastructure (ingress, DNS, VPN, auth)
- **1-cluster**: Cluster services (Redis, metrics, security)
- **2-gpu**: GPU workloads (Immich, Ollama, Frigate)
- **3-edge**: User-facing services
- **4-aux**: Optional/auxiliary services
-
-### Resource Governance (Kyverno-based)
-Four layers of noisy-neighbor protection, all defined in `modules/kubernetes/kyverno/resource-governance.tf`:
-
-1. **PriorityClasses**: `tier-0-core` (1M) through `tier-4-aux` (200K). `tier-4-aux` uses `preemption_policy=Never`.
-2. **LimitRange defaults** (Kyverno generate): Auto-creates `tier-defaults` LimitRange in namespaces based on tier label. Only affects containers without explicit resources.
-3. **ResourceQuotas** (Kyverno generate): Auto-creates `tier-quota` ResourceQuota in namespaces with tier labels. Excludes namespaces with `resource-governance/custom-quota=true` label.
-4. **Priority injection** (Kyverno mutate): Sets `priorityClassName` on Pods based on namespace tier label.
-
-**Custom quota override**: Add label `resource-governance/custom-quota: "true"` to namespace, then define a custom `kubernetes_resource_quota` in the service's Terraform module. Currently used by: monitoring, crowdsec.
-
-**LimitRange defaults by tier**:
-| Tier | Default Req | Default Limit | Max |
-|------|------------|--------------|-----|
-| 0-core | 100m/128Mi | 2/4Gi | 8/16Gi |
-| 1-cluster | 100m/128Mi | 2/4Gi | 4/8Gi |
-| 2-gpu | 100m/256Mi | 4/8Gi | 8/16Gi |
-| 3-edge | 50m/128Mi | 1/2Gi | 4/8Gi |
-| 4-aux | 25m/64Mi | 500m/1Gi | 2/4Gi |
-
-**ResourceQuota hard limits by tier**:
-| Tier | Req CPU | Req Mem | Lim CPU | Lim Mem | Pods |
-|------|---------|---------|---------|---------|------|
-| 0-core | 8 | 8Gi | 32 | 64Gi | 100 |
-| 1-cluster | 4 | 4Gi | 16 | 32Gi | 30 |
-| 2-gpu | 8 | 8Gi | 48 | 96Gi | 40 |
-| 3-edge | 4 | 4Gi | 16 | 32Gi | 30 |
-| 4-aux | 2 | 2Gi | 8 | 16Gi | 20 |
+## Tier System & Resource Governance
+- **0-core**: Critical infra (ingress, DNS, VPN, auth) | **1-cluster**: Redis, metrics, security | **2-gpu**: GPU workloads | **3-edge**: User-facing | **4-aux**: Optional
+- Kyverno-based governance in `modules/kubernetes/kyverno/resource-governance.tf`:
+  1. PriorityClasses: `tier-0-core` (1M) through `tier-4-aux` (200K, preemption=Never)
+  2. LimitRange defaults (Kyverno generate): auto-created per namespace tier
+  3. ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label `resource-governance/custom-quota=true`)
+  4. Priority injection (Kyverno mutate): sets priorityClassName on Pods
+- Custom quota override: monitoring, crowdsec

 ---

 ## User Preferences
-
-### Calendar
- **Default calendar**: Nextcloud (always use unless otherwise specified)
- **Nextcloud URL**: `https://nextcloud.viktorbarzin.me`
- **CalDAV endpoint**: `https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/`
-
-### Home Assistant
- **Default smart home**: Home Assistant (always use for smart home control)
- **Two deployments**:
-  - **ha-london** (default): `https://ha-london.viktorbarzin.me` | Script: `.claude/home-assistant.py` | SSH: `ssh pi@192.168.8.103`, config at `/home/pi/docker/homeAssistant/`
-  - **ha-sofia**: `https://ha-sofia.viktorbarzin.me` | Script: `.claude/home-assistant-sofia.py` | SSH: `ssh vbarzin@192.168.1.8`, config at `/config/`
- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
-
-### Development
- **Frontend framework**: Svelte (user is learning it, so use Svelte for all new web apps)
-
-### Pod Monitoring After Updates
- **Never use `sleep` to wait for pods** — instead, spawn a background subagent (Task tool with `run_in_background: true`) that continuously checks pod state (e.g., `kubectl get pods -n <namespace> -w`) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.
+- **Calendar**: Nextcloud at `https://nextcloud.viktorbarzin.me`
+- **Home Assistant**: ha-london (default) at `https://ha-london.viktorbarzin.me`, ha-sofia at `https://ha-sofia.viktorbarzin.me`. "ha"/"HA" = ha-london.
+- **Frontend**: Svelte for all new web apps
+- **Pod monitoring**: Never use `sleep` — spawn background subagent with `kubectl get pods -w` instead

 ---

-## Skills & Workflows
-
-Skills are specialized workflows for common tasks. Located in `.claude/skills/`.
-
-### Available Skills
-
-**setup-project** (`.claude/skills/setup-project/SKILL.md`)
- Deploy new self-hosted services from GitHub repos
- Automated workflow: Docker image → Terraform module → Deploy
- Handles database setup, ingress, DNS configuration
- **When to use**: User provides GitHub URL or wants to deploy a new service
- **Example**: "Deploy [GitHub repo] to the cluster"
-
-**extend-vm-storage** (`.claude/skills/extend-vm-storage/SKILL.md`)
- Extend disk storage on K8s node VMs (Proxmox-hosted)
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
- **When to use**: A k8s node needs more disk space
- **Example**: "Extend storage on k8s-node2 by 64G"
+## Reference Data
+- `.claude/reference/service-catalog.md` — Full service catalog (70+ services) with Cloudflare domains
+- `.claude/reference/proxmox-inventory.md` — VM table, hardware specs, network topology, GPU config
+- `.claude/reference/github-drone-api.md` — GitHub & Drone CI API patterns with curl examples
+- `.claude/reference/authentik-state.md` — Current applications, groups, users, login sources

 ---

 ## Service-Specific Notes

 ### Authentik (Identity Provider)
- **Helm Chart**: `authentik` v2025.10.3 from `https://charts.goauthentik.io/`
- **URL**: `https://authentik.viktorbarzin.me`
- **API**: `https://authentik.viktorbarzin.me/api/v3/`
- **API Token**: Stored in `terraform.tfvars` as `authentik_api_token` (non-expiring, superuser, identifier: `claude-code-permanent`). Read with: `grep authentik_api_token terraform.tfvars | cut -d'"' -f2`
- **Namespace**: `authentik` (tier: cluster)
- **Architecture**: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
- **Database**: PostgreSQL via `postgresql.dbaas:5432`, pooled through PgBouncer at `pgbouncer.authentik:6432`
- **Redis**: Shared at `redis.redis.svc.cluster.local`
- **Terraform**: `modules/kubernetes/authentik/main.tf` (Helm), `pgbouncer.tf` (connection pooling)
-
-#### Authentik API Management
-To call the API, use:
-```bash
-curl -s -H "Authorization: Bearer <TOKEN>" "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
-```
-
-Key API endpoints:
- `core/users/` — List/create/update/delete users
- `core/groups/` — List/create/update/delete groups
- `core/applications/` — List/create applications
- `providers/all/` — List all providers (OAuth2, Proxy, etc.)
- `providers/oauth2/` — OAuth2/OIDC providers specifically
- `providers/proxy/` — Proxy providers (forward auth)
- `flows/instances/` — List flows
- `stages/all/` — List stages
- `sources/all/` — List sources (Google, GitHub, etc.)
- `outposts/instances/` — List outposts
- `propertymappings/all/` — List property mappings
- `rbac/roles/` — List roles
-
-#### Current Applications (9)
-| Application | Provider Type | Auth Flow |
-|-------------|--------------|-----------|
-| Cloudflare Access | OAuth2/OIDC | explicit consent |
-| Domain wide catch all | Proxy (forward auth) | implicit consent |
-| Grafana | OAuth2/OIDC | implicit consent |
-| Headscale | OAuth2/OIDC | explicit consent |
-| Immich | OAuth2/OIDC | explicit consent |
-| Kubernetes | OAuth2/OIDC (public) | implicit consent |
-| linkwarden | OAuth2/OIDC | explicit consent |
-| Matrix | OAuth2/OIDC | implicit consent |
-| wrongmove | OAuth2/OIDC | implicit consent |
-
-#### Current Groups (9)
-| Group | Parent | Superuser | Purpose |
-|-------|--------|-----------|---------|
-| Allow Login Users | — | No | Parent group for login-permitted users |
-| authentik Admins | — | Yes | Full admin access |
-| authentik Read-only | — | No | Read-only access (has role) |
-| Headscale Users | Allow Login Users | No | VPN access |
-| Home Server Admins | Allow Login Users | No | Server admin access |
-| Wrongmove Users | Allow Login Users | No | Real-estate app access |
-| kubernetes-admins | — | No | K8s cluster-admin RBAC |
-| kubernetes-power-users | — | No | K8s power-user RBAC |
-| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
-
-#### Current Users (7 real users)
-| Username | Name | Type | Groups |
-|----------|------|------|--------|
-| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
-| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
-| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
-| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
-| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
-| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users |
-| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users |
-| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
-
-#### Login Sources (Social Login)
- **Google** (OAuth) — user matching by identifier
- **GitHub** (OAuth) — user matching by email_link
- **Facebook** (OAuth) — user matching by email_link
- All use the same authentication flow (`1a779f24`) and enrollment flow (`87572804`)
-
-#### Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove
-
-#### Traefik Integration
- Forward auth middleware: `authentik-forward-auth` in Traefik namespace
- Outpost endpoint: `http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik`
- Services opt in via `protected = true` in `ingress_factory`
- Response headers: `X-authentik-username`, `X-authentik-uid`, `X-authentik-email`, `X-authentik-name`, `X-authentik-groups`, `Set-Cookie`
-
-#### OIDC for Kubernetes API
- **Issuer**: `https://authentik.viktorbarzin.me/application/o/kubernetes/`
- **Client ID**: `kubernetes` (public client, no secret)
- **Username claim**: `email`, **Groups claim**: `groups`
- **Signing key**: `authentik Self-signed Certificate` (must be assigned to the provider or JWKS will be empty)
- **Redirect URIs**: Regex mode `http://localhost:.*` and `http://127\.0\.0\.1:.*` (kubelogin picks random ports)
- **Configured via**: SSH to kube-apiserver manifest (`modules/kubernetes/rbac/apiserver-oidc.tf`)
- **RBAC module**: `modules/kubernetes/rbac/main.tf` — admin/power-user/namespace-owner roles
- **Self-service portal**: `modules/kubernetes/k8s-portal/` — SvelteKit app at `https://k8s-portal.viktorbarzin.me`
- **User definition**: `k8s_users` variable in `terraform.tfvars`
- **Audit logging**: Enabled via `modules/kubernetes/rbac/audit-policy.tf`, logs at `/var/log/kubernetes/audit.log`
-
-**CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:**
-1. **Signing key MUST be assigned** to the OAuth2 provider. Without it, the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
-2. **Email mapping must set `email_verified: True`**. The default Authentik email scope mapping hardcodes `email_verified: False`, which causes kube-apiserver to reject the token with `oidc: email not verified`. Use a custom scope mapping: `return {"email": request.user.email, "email_verified": True}`
-3. **kubelogin needs `--oidc-extra-scope`** for `email`, `profile`, `groups`. Without these, only `openid` is requested and the token lacks the `email` claim, causing `oidc: parse username claims "email": claim not present`.
-4. **Redirect URIs must use regex mode** (`http://localhost:.*`) because kubelogin picks random ports, not just 8000/18000.
-5. **Kubelet static pod manifest changes** require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple `touch` or kubelet restart is not enough.
-6. **Property mappings endpoint** in Authentik 2025.10.x is `propertymappings/provider/scope/` (not the older `propertymappings/scope/`).
-
-#### Common Management Tasks
-**Add a new OAuth2 application:**
-1. Create OAuth2 provider: `POST /api/v3/providers/oauth2/` with client_id, client_secret, redirect_uris, authorization_flow, etc.
-2. Create application: `POST /api/v3/core/applications/` with name, slug, provider pk
-3. (Optional) Bind to group policy for access control
-
-**Add a user to a group:**
-```bash
-# Get group pk, then PATCH with updated users list
-curl -X PATCH -H "Authorization: Bearer <TOKEN>" -H "Content-Type: application/json" \
-  "https://authentik.viktorbarzin.me/api/v3/core/groups/<group-pk>/" \
-  -d '{"users": [<existing_user_pks>, <new_user_pk>]}'
-```
-
-**Protect a service with forward auth:**
-Set `protected = true` in the service's `ingress_factory` call in Terraform.
+- **URL**: `https://authentik.viktorbarzin.me` | **API**: `/api/v3/` | **Token**: `authentik_api_token` in tfvars
+- **Architecture**: 3 server + 3 worker + 3 PgBouncer + embedded outpost
+- **Database**: PostgreSQL via `postgresql.dbaas:5432`, PgBouncer at `pgbouncer.authentik:6432`
+- **Traefik integration**: Forward auth via `protected = true` in ingress_factory
+- **OIDC for K8s**: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
+- For management tasks, current state, and OIDC gotchas: see `authentik` and `authentik-oidc-kubernetes` skills
+- For current apps/groups/users snapshot: see `.claude/reference/authentik-state.md`

 ### AFFiNE (Visual Canvas)
- **Image**: `ghcr.io/toeverything/affine:stable`
- **Port**: 3010
- **Requires**: PostgreSQL + Redis
+- **Image**: `ghcr.io/toeverything/affine:stable` | **Port**: 3010 | **Requires**: PostgreSQL + Redis
 - **Migration**: Init container runs `node ./scripts/self-host-predeploy.js`
- **Storage**: NFS at `/mnt/main/affine` mounted to `/root/.affine/storage` and `/root/.affine/config`
- **Key env vars**:
-  - `AFFINE_SERVER_EXTERNAL_URL` - Public URL (e.g., `https://affine.viktorbarzin.me`)
-  - `AFFINE_SERVER_HTTPS` - Set to `true` behind TLS ingress
-  - `DATABASE_URL` - PostgreSQL connection string
-  - `REDIS_SERVER_HOST` - Redis hostname
-  - `MAILER_*` - SMTP configuration for email invites
- **Local-first**: Data stored in browser by default; syncs to server when user creates account
- **Docs**: https://docs.affine.pro/self-host-affine
+- **Storage**: NFS `/mnt/main/affine` → `/root/.affine/storage` and `/root/.affine/config`

-### Wyoming Whisper (STT for Home Assistant)
- **Image**: `rhasspy/wyoming-whisper:latest`
- **Port**: 10300/TCP (Wyoming protocol)
- **Model**: `small-int8` (CPU-optimized, no CUDA variant available from upstream)
- **Runs on**: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
- **Storage**: NFS at `/mnt/main/whisper` → `/data` (model cache)
- **Exposure**: Internal only via Traefik TCP entrypoint `whisper-tcp` → IngressRouteTCP
- **Access**: `10.0.20.202:10300` (Traefik LB IP, no public DNS)
- **HA Integration**: Wyoming Protocol integration in ha-london, host `10.0.20.202`, port `10300`
- **No GPU acceleration**: Official image is CPU-only (Debian + PyTorch CPU). The `mib1185/wyoming-faster-whisper-cuda` image exists but requires self-build.
+### Wyoming Whisper (STT)
+- **Image**: `rhasspy/wyoming-whisper:latest` | **Port**: 10300/TCP (Wyoming protocol)
+- **Model**: `small-int8` (CPU-only) | **Access**: `10.0.20.202:10300` (internal, no public DNS)
+- **HA Integration**: Wyoming Protocol in ha-london

 ### Gramps Web (Genealogy)
- **Image**: `ghcr.io/gramps-project/grampsweb:latest`
- **Port**: 5000
- **URL**: `https://family.viktorbarzin.me`
- **Components**: Web app + Celery worker (2 containers in 1 pod)
- **Requires**: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
- **Storage**: NFS at `/mnt/main/grampsweb` with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp
- **Key env vars**:
-  - `GRAMPSWEB_SECRET_KEY` - Flask secret key (generated via `random_password`)
-  - `GRAMPSWEB_TREE` - Tree name
-  - `GRAMPSWEB_BASE_URL` - Public URL
-  - `GRAMPSWEB_CELERY_CONFIG__broker_url` / `result_backend` - Redis connection
-  - `GRAMPSWEB_REGISTRATION_DISABLED` - Set to `True`
-  - `GRAMPSWEB_EMAIL_*` - SMTP configuration
-  - `GRAMPSWEB_LLM_*` - Ollama AI integration
- **Celery command**: `celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2`
- **Registration**: Disabled; first user created via UI setup wizard
+- **Image**: `ghcr.io/gramps-project/grampsweb:latest` | **Port**: 5000 | **URL**: `https://family.viktorbarzin.me`
+- **Components**: Web app + Celery worker (2 containers in 1 pod) | **Redis**: DB 2 (broker), DB 3 (rate limiting)
+- **Storage**: NFS `/mnt/main/grampsweb` with sub_paths

-### Loki + Alloy (Centralized Log Collection)
- **Loki image**: `grafana/loki:3.6.5` (Helm chart, single binary mode)
- **Alloy image**: `grafana/alloy:v1.13.0` (Helm chart, DaemonSet)
- **Config files**: `modules/kubernetes/monitoring/loki.tf`, `loki.yaml`, `alloy.yaml`
- **Port**: 3100/TCP (Loki API)
- **Storage**: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi in-memory)
- **Memory**: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
- **Disk-friendly tuning**: `max_chunk_age: 24h`, `chunk_idle_period: 12h` — holds chunks in memory, flushes ~once/day
- **Retention**: 7 days (`retention_period: 168h`), compactor enforces deletion
- **Crash policy**: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
- **Ruler**: Evaluates LogQL alert rules, fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
+### Loki + Alloy (Log Collection)
+- **Loki**: `grafana/loki:3.6.5` (single binary, 6Gi RAM, 7d retention)
+- **Alloy**: `grafana/alloy:v1.13.0` (DaemonSet, 128Mi/pod)
+- **Storage**: NFS PV `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
 - **Alert rules**: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
- **Grafana**: Datasource UID `P8E80F9AEF21F6940`, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned)
- **Sysctl DaemonSet**: `sysctl-inotify` sets `fs.inotify.max_user_watches=1048576` on all nodes (required for Alloy fsnotify)
- **Disabled components**: gateway, chunksCache, resultsCache (not needed for single binary)
- **Key paths**: Compactor at `/var/loki/compactor`, ruler scratch at `/var/loki/scratch` (must be under `/var/loki` — root FS is read-only)
- **Querying**: Grafana Explore with LogQL, e.g. `{namespace="monitoring"} |= "error"`
- **Troubleshooting**: If "entry too far behind" errors on first start, restart Alloy DaemonSet (`kubectl rollout restart ds -n monitoring alloy`) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart
+- **Troubleshooting**: "entry too far behind" on first start → restart Alloy DaemonSet

 ### OpenClaw (AI Agent Gateway)
- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9`
- **Port**: 18789
- **URL**: `https://openclaw.viktorbarzin.me` (authentik-protected)
- **Namespace**: `openclaw` (tier: aux)
- **Formerly**: `moltbot` — renamed in Feb 2026
- **Architecture**: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
- **Init container**: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding (for managing cluster resources)
- **Storage**: NFS at `/mnt/main/openclaw/workspace` (git repo) and `/mnt/main/openclaw/data` (persistent data)
- **Config**: `openclaw.json` ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults
- **Variables**: `openclaw_ssh_key`, `openclaw_skill_secrets` in `terraform.tfvars`
- **Skill secrets**: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)
+- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9` | **Port**: 18789 | **URL**: `https://openclaw.viktorbarzin.me`
+- **Init container**: Downloads kubectl, terraform, git-crypt; clones infra repo
+- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding
+- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API

-### Poison Fountain (Anti-AI Scraping Service)
- **Image**: `python:3.12-slim` (runs custom `server.py` from ConfigMap)
- **Port**: 8080
- **URL**: `https://poison.viktorbarzin.me` (public, no auth)
- **Namespace**: `poison-fountain` (tier: aux)
- **Stack**: `stacks/poison-fountain/`
- **Architecture**: 1 Deployment (Python HTTP server) + 1 CronJob (fetcher, every 6h)
- **Storage**: NFS at `/mnt/main/poison-fountain` — `cache/` subdir for poison docs (chmod 777 for curl uid 101)
- **Endpoints**:
-  - `/auth` — ForwardAuth: checks User-Agent, returns 200 (allow) or 403 (block AI bots)
-  - `/article/*` — Tarpit: drip-feeds poison content at ~100 bytes/sec (DRIP_BYTES=50, DRIP_DELAY=0.5s)
-  - `/healthz` — Health check
- **CronJob**: Fetches 50 documents from `rnsaffn.com/poison2/` using `--http1.1` (HTTP/2 hangs)
- **Ingress**: Uses `anti_ai_scraping = false` (doesn't protect itself), `skip_default_rate_limit = true`, `exclude_crowdsec = true`
- **DNS**: `poison.viktorbarzin.me` in `cloudflare_non_proxied_names`
- **Traefik middlewares** (in `stacks/platform/modules/traefik/middleware.tf`):
-  - `ai-bot-block` — ForwardAuth to poison-fountain `/auth`
-  - `anti-ai-headers` — X-Robots-Tag: noai, noimageai
-  - `anti-ai-trap-links` — rewrite-body plugin injecting 5 hidden links before `</body>`
+## Service Versions (as of 2026-02)
+Immich v2.4.1 | AFFiNE stable | Whisper latest | Loki 3.6.5 | Alloy v1.13.0 | OpenClaw 2026.2.9
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -0,0 +1,50 @@
+# Authentik Current State
+
+> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
+
+## Applications (9)
+| Application | Provider Type | Auth Flow |
+|-------------|--------------|-----------|
+| Cloudflare Access | OAuth2/OIDC | explicit consent |
+| Domain wide catch all | Proxy (forward auth) | implicit consent |
+| Grafana | OAuth2/OIDC | implicit consent |
+| Headscale | OAuth2/OIDC | explicit consent |
+| Immich | OAuth2/OIDC | explicit consent |
+| Kubernetes | OAuth2/OIDC (public) | implicit consent |
+| linkwarden | OAuth2/OIDC | explicit consent |
+| Matrix | OAuth2/OIDC | implicit consent |
+| wrongmove | OAuth2/OIDC | implicit consent |
+
+## Groups (9)
+| Group | Parent | Superuser | Purpose |
+|-------|--------|-----------|---------|
+| Allow Login Users | — | No | Parent group for login-permitted users |
+| authentik Admins | — | Yes | Full admin access |
+| authentik Read-only | — | No | Read-only access (has role) |
+| Headscale Users | Allow Login Users | No | VPN access |
+| Home Server Admins | Allow Login Users | No | Server admin access |
+| Wrongmove Users | Allow Login Users | No | Real-estate app access |
+| kubernetes-admins | — | No | K8s cluster-admin RBAC |
+| kubernetes-power-users | — | No | K8s power-user RBAC |
+| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
+
+## Users (7 real)
+| Username | Name | Type | Groups |
+|----------|------|------|--------|
+| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
+| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
+| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
+| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
+| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
+| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users |
+| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users |
+| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
+
+## Login Sources
+- **Google** (OAuth) — user matching by identifier
+- **GitHub** (OAuth) — user matching by email_link
+- **Facebook** (OAuth) — user matching by email_link
+
+## Authorization Flows
+- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
+- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
--- a/.claude/reference/github-drone-api.md
+++ b/.claude/reference/github-drone-api.md
@ -0,0 +1,50 @@
+# GitHub & Drone CI API Reference
+
+> Token locations and common API patterns.
+
+## GitHub API
+- **Username**: `ViktorBarzin`
+- **Token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` (git-crypt encrypted)
+- **Scopes**: Full access (repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages)
+- **`gh` CLI**: Blocked by sandbox — use `curl` instead
+
+```bash
+GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
+
+# List repos
+curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
+
+# Create repo
+curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
+  -d '{"name":"repo-name","private":true}'
+
+# Add deploy key
+curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
+  -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
+
+# Create webhook
+curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
+  -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
+```
+
+## Drone CI API
+- **Server**: `https://drone.viktorbarzin.me`
+- **Token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2`
+
+```bash
+DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
+
+# Activate repo
+curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"
+
+# Trigger build
+curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"
+
+# Add secret
+curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
+  -d '{"name":"secret_name","data":"secret_value"}'
+```
+
+## Capabilities
+- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
+- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
--- a/.claude/reference/proxmox-inventory.md
+++ b/.claude/reference/proxmox-inventory.md
@ -0,0 +1,52 @@
+# Proxmox Inventory & Infrastructure
+
+> Static reference for VMs, hardware, and network topology.
+
+## Proxmox Host Hardware
+- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
+- **RAM**: 142 GB (Dell R730 server)
+- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
+- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
+- **Proxmox access**: `ssh root@192.168.1.127`
+
+## Network Topology
+```
+10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
+10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
+               k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
+192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
+```
+
+## Network Bridges
+- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network
+- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)
+
+## VM Inventory
+
+| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
+|------|------|--------|------|-----|---------|------|-------|
+| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
+| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
+| 103 | home-assistant | running | 8 | 16GB | vmbr0 | 32G | HA, net0(vlan10) disabled |
+| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
+| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
+| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 |
+| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
+| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
+| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
+| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
+| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
+| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) |
+
+## VM Templates
+| VMID | Name | Purpose |
+|------|------|---------|
+| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs |
+| 1001 | docker-registry-template | Docker registry VM |
+| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes |
+
+## GPU Node (k8s-node1)
+- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
+- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
+- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
+- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -0,0 +1,132 @@
+# Service Catalog
+
+> Auto-maintained reference. See `.claude/CLAUDE.md` for operational guidance.
+
+## Critical - Network & Auth (Tier: core)
+| Service | Description | Stack |
+|---------|-------------|-------|
+| wireguard | VPN server | platform |
+| technitium | DNS server (10.0.20.101) | platform |
+| headscale | Tailscale control server | platform |
+| traefik | Ingress controller (Helm) | platform |
+| xray | Proxy/tunnel | platform |
+| authentik | Identity provider (SSO) | platform |
+| cloudflared | Cloudflare tunnel | platform |
+| authelia | Auth middleware | platform |
+| monitoring | Prometheus/Grafana/Loki stack | platform |
+
+## Storage & Security (Tier: cluster)
+| Service | Description | Stack |
+|---------|-------------|-------|
+| vaultwarden | Bitwarden-compatible password manager | platform |
+| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
+| immich | Photo management (GPU) | immich |
+| nvidia | GPU device plugin | platform |
+| metrics-server | K8s metrics | platform |
+| uptime-kuma | Status monitoring | platform |
+| crowdsec | Security/WAF | platform |
+| kyverno | Policy engine | platform |
+
+## Admin
+| Service | Description | Stack |
+|---------|-------------|-------|
+| k8s-dashboard | Kubernetes dashboard | platform |
+| reverse-proxy | Generic reverse proxy | platform |
+
+## Active Use
+| Service | Description | Stack |
+|---------|-------------|-------|
+| mailserver | Email (docker-mailserver) | mailserver |
+| shadowsocks | Proxy | shadowsocks |
+| webhook_handler | Webhook processing | webhook_handler |
+| tuya-bridge | Smart home bridge | tuya-bridge |
+| dawarich | Location history | dawarich |
+| owntracks | Location tracking | owntracks |
+| nextcloud | File sync/share | nextcloud |
+| calibre | E-book management | calibre |
+| onlyoffice | Document editing | onlyoffice |
+| f1-stream | F1 streaming | f1-stream |
+| rybbit | Analytics | rybbit |
+| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
+| actualbudget | Budgeting (factory pattern) | actualbudget |
+
+## Optional
+| Service | Description | Stack |
+|---------|-------------|-------|
+| blog | Personal blog | blog |
+| descheduler | Pod descheduler | descheduler |
+| drone | CI/CD | drone |
+| hackmd | Collaborative markdown | hackmd |
+| kms | Key management | kms |
+| privatebin | Encrypted pastebin | privatebin |
+| vault | HashiCorp Vault | vault |
+| reloader | ConfigMap/Secret reloader | reloader |
+| city-guesser | Game | city-guesser |
+| echo | Echo server | echo |
+| url | URL shortener | url |
+| excalidraw | Whiteboard | excalidraw |
+| travel_blog | Travel blog | travel_blog |
+| dashy | Dashboard | dashy |
+| send | Firefox Send | send |
+| ytdlp | YouTube downloader | ytdlp |
+| wealthfolio | Finance tracking | wealthfolio |
+| audiobookshelf | Audiobook server | audiobookshelf |
+| paperless-ngx | Document management | paperless-ngx |
+| jsoncrack | JSON visualizer | jsoncrack |
+| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
+| ntfy | Push notifications | ntfy |
+| cyberchef | Data transformation | cyberchef |
+| diun | Docker image update notifier | diun |
+| meshcentral | Remote management | meshcentral |
+| homepage | Dashboard/startpage | homepage |
+| matrix | Matrix chat server | matrix |
+| linkwarden | Bookmark manager | linkwarden |
+| changedetection | Web change detection | changedetection |
+| tandoor | Recipe manager | tandoor |
+| n8n | Workflow automation | n8n |
+| real-estate-crawler | Property crawler | real-estate-crawler |
+| tor-proxy | Tor proxy | tor-proxy |
+| forgejo | Git forge | forgejo |
+| freshrss | RSS reader | freshrss |
+| navidrome | Music streaming | navidrome |
+| networking-toolbox | Network tools | networking-toolbox |
+| stirling-pdf | PDF tools | stirling-pdf |
+| speedtest | Speed testing | speedtest |
+| freedify | Music streaming (factory pattern) | freedify |
+| netbox | Network documentation | netbox |
+| infra-maintenance | Maintenance jobs | infra-maintenance |
+| ollama | LLM server (GPU) | ollama |
+| frigate | NVR/camera (GPU) | frigate |
+| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
+| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
+| health | Apple Health data dashboard (PostgreSQL) | health |
+| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
+| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
+| openclaw | AI agent gateway (OpenClaw) | openclaw |
+| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
+
+## Cloudflare Domains
+
+### Proxied (CDN + WAF enabled)
+```
+blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
+audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
+changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
+travel, netbox
+```
+
+### Non-Proxied (Direct DNS)
+```
+mail, wg, headscale, immich, calibre, vaultwarden, drone,
+mailserver-antispam, mailserver-admin, webhook, uptime,
+owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
+onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
+isponsorblocktv, speedtest, freedify, rybbit, paperless,
+servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
+jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
+```
+
+### Special Subdomains
+- `*.viktor.actualbudget` - Actualbudget factory instances
+- `*.freedify` - Freedify factory instances
+- `mailserver.*` - Mail server components (antispam, admin)
--- a/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md
+++ b/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md
@ -1,138 +0,0 @@
---
-name: containerd-multi-registry-pull-through-cache
-description: |
-  Set up pull-through caches for multiple container registries (ghcr.io, quay.io,
-  registry.k8s.io, reg.kyverno.io) using Docker Registry v2 instances. Use when:
-  (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
-  (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
-  (3) need to add pull-through cache for a new upstream registry,
-  (4) `mirrors` cannot be set when `config_path` is provided error in containerd,
-  (5) containerd 1.6.x vs 1.7.x config_path compatibility issues.
-  Docker Registry v2 can only proxy ONE upstream per instance, so multiple
-  containers are needed for multiple registries.
-author: Claude Code
-version: 1.0.0
-date: 2026-02-14
---
-
-# Containerd Multi-Registry Pull-Through Cache
-
-## Problem
-
-Docker Registry v2 can only proxy **one upstream registry per instance**. A common
-misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
-to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
-and other registries — they get routed to the Docker Hub proxy which can't serve them,
-causing `ImagePullBackOff`.
-
-## Context / Trigger Conditions
-
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
-
-## Solution
-
-### 1. Run one Registry v2 container per upstream
-
-Each upstream needs its own Docker Registry v2 instance on a different port:
-
-| Port | Registry | Container Name |
-|------|----------|---------------|
-| 5000 | docker.io | registry |
-| 5010 | ghcr.io | registry-ghcr |
-| 5020 | quay.io | registry-quay |
-| 5030 | registry.k8s.io | registry-k8s |
-| 5040 | reg.kyverno.io | registry-kyverno |
-
-Config for non-Docker-Hub proxies (no auth needed — they're public):
-
-```yaml
-version: 0.1
-storage:
-  cache:
-    blobdescriptor: inmemory
-  filesystem:
-    rootdirectory: /var/lib/registry
-http:
-  addr: :5000
-proxy:
-  remoteurl: https://ghcr.io  # change per registry
-```
-
-```bash
-docker run -p 5010:5000 -d --restart always --name registry-ghcr \
-  -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
-```
-
-### 2. Replace deprecated wildcard mirror with `config_path`
-
-Instead of:
-```toml
-# DEPRECATED - breaks non-Docker-Hub registries
-[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
-  endpoint = ["http://10.0.20.10:5000"]
-```
-
-Use the modern `config_path` approach:
-```toml
-[plugins."io.containerd.grpc.v1.cri".registry]
-  config_path = "/etc/containerd/certs.d"
-```
-
-Then create per-registry `hosts.toml` files:
-```bash
-mkdir -p /etc/containerd/certs.d/docker.io
-cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
-server = "https://registry-1.docker.io"
-
-[host."http://10.0.20.10:5000"]
-  capabilities = ["pull", "resolve"]
-EOF
-```
-
-Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
-
-### 3. Critical: `config_path` and `mirrors` cannot coexist
-
-Containerd will **refuse to start the CRI plugin** if both `config_path` and any
-`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
-(including the `[plugins."...registry.mirrors"]` parent section) before setting
-`config_path`.
-
-This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
-where the config format is slightly different. If unsure, either:
- Don't use config_path on that node (skip the pull-through cache)
- Remove the entire `mirrors` section first, then add `config_path`
-
-### 4. Static IP for registry VM
-
-If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
-via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
-
-## Verification
-
-```bash
-# Test each proxy responds
-for port in 5000 5010 5020 5030 5040; do
-  curl -s http://10.0.20.10:$port/v2/_catalog
-done
-
-# Test containerd can pull through cache
-crictl pull ghcr.io/some/image:tag
-
-# Check containerd logs for mirror usage
-journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
-```
-
-## Notes
-
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
-  direct pull from the upstream `server` URL. This provides graceful degradation.
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
-  to avoid I/O spikes.
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
-
-See also: `k8s-docker-registry-cache-bypass` (for stale cached image issues)
--- a/.claude/skills/helm-release-troubleshooting/SKILL.md
+++ b/.claude/skills/helm-release-troubleshooting/SKILL.md
@ -1,27 +1,33 @@
 ---
-name: helm-release-force-rerender
+name: helm-release-troubleshooting
 description: |
-  Fix for Helm releases managed by Terraform where changing Helm values doesn't update
-  the actual Kubernetes resources. Use when: (1) Terraform applies successfully but
-  K8s resources (Service, Deployment) don't reflect new Helm values,
-  (2) New ports/volumes/containers from Helm chart values don't appear in the deployed resources,
+  Troubleshoot and fix Helm release issues managed by Terraform. Use when:
+  (1) Terraform applies successfully but K8s resources don't reflect new Helm values,
+  (2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
  (3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
-  (4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale.
-  Solution involves removing from Terraform state, reimporting, and force upgrading.
+  (4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
+  (5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
+  (6) helm history shows status "pending-upgrade" or "pending-rollback",
+  (7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
+  (8) helm upgrade fails with "an error occurred while finding last successful release".
+  Covers force re-rendering via state removal/reimport and stuck release recovery via
+  secret cleanup.
 author: Claude Code
 version: 1.0.0
-date: 2026-02-07
+date: 2026-02-22
 ---

-# Helm Release Force Re-render via Terraform
+# Helm Release Troubleshooting

-## Problem
+## Force Re-render
+
+### Problem
 After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
 successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
 the new values. For example, adding a new port in Helm values doesn't result in that port
 appearing in the Service spec.

-## Context / Trigger Conditions
+### Context / Trigger Conditions
 - Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
  the old configuration
 - Structural changes to Helm values (new ports, new containers, new volumes) are not
@ -30,7 +36,7 @@ appearing in the Service spec.
 - Common with Traefik, ingress-nginx, and other charts where template logic conditionally
  includes resources based on values

-## Root Cause
+### Root Cause
 Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
 changed, Helm may use `--reuse-values` behavior where it merges new values into existing
 ones rather than doing a full template re-render. For structural changes (like enabling
@ -41,9 +47,9 @@ Additionally, Terraform may see the stored Helm release state as matching the de
 even though the actual Kubernetes resources don't reflect it, creating a state drift that
 Terraform doesn't detect.

-## Solution
+### Solution

-### Step 1: Verify the Discrepancy
+#### Step 1: Verify the Discrepancy

 Confirm that K8s resources don't match Helm values:
 ```bash
@ -55,7 +61,7 @@ helm get values <release-name> -n <namespace>
 helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
 ```

-### Step 2: Remove Helm Release from Terraform State
+#### Step 2: Remove Helm Release from Terraform State

 ```bash
 terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
@ -64,7 +70,7 @@ terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<nam
 **IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
 resources remain untouched in the cluster.

-### Step 3: Import the Helm Release Back
+#### Step 3: Import the Helm Release Back

 ```bash
 terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
@ -72,7 +78,7 @@ terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>

 For Helm releases, the import ID format is `namespace/release-name`.

-### Step 4: Force Apply with Terraform
+#### Step 4: Force Apply with Terraform

 After reimporting, run terraform apply. Terraform should now detect the drift between
 the desired Helm values and the actual release state:
@ -87,7 +93,7 @@ terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
 terraform apply -target=module.kubernetes_cluster.module.<service>
 ```

-### Step 5: Manual Helm Force Upgrade (Last Resort)
+#### Step 5: Manual Helm Force Upgrade (Last Resort)

 If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:

@ -109,7 +115,7 @@ terraform apply -target=module.kubernetes_cluster.module.<service>
 **WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
 afterward, and use `terraform apply` to verify Terraform is back in sync.

-## Verification
+### Verification

 ```bash
 # Check the K8s resources now match expected configuration
@ -121,7 +127,7 @@ terraform plan -target=module.kubernetes_cluster.module.<service>
 # Should show "No changes" or minimal expected drift
 ```

-## Example: Traefik HTTP/3 UDP Port Not Appearing
+### Example: Traefik HTTP/3 UDP Port Not Appearing

 **Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
 successfully, but the Traefik Service only had TCP port 443, missing the expected
@ -143,21 +149,102 @@ kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
 # Should show: port: 443, protocol: UDP
 ```

-## Notes
+### Notes

 - This issue is more common with structural Helm value changes (new ports, new sidecars,
  conditional template blocks) than with simple value changes (image tags, replica counts)
 - The `helm upgrade --force` flag deletes and recreates resources that have changed,
  which causes brief downtime. Use with caution on production ingress controllers.
 - Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
- This is different from the `terraform-state-identity-mismatch` skill, which covers
-  provider-level identity errors. This skill covers Helm template rendering issues where
-  the state looks correct but the actual resources don't match.
+
+---
+
+## Stuck Release Recovery
+
+### Problem
+Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
+states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
+Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
+
+### Context / Trigger Conditions
+- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
+- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
+- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
+- `helm upgrade` fails with: `an error occurred while finding last successful release`
+
+### Solution
+
+#### Step 1: Identify the stuck release
+```bash
+helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
+```
+
+Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
+
+#### Step 2: Delete the stuck Helm release secrets
+Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
+Delete all stuck revisions:
+
+```bash
+# Delete specific stuck revision (e.g., revision 5)
+kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
+
+# If multiple stuck revisions exist, delete all of them
+kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
+```
+
+#### Step 3: Verify the release is clean
+```bash
+helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
+```
+
+The latest revision should now show `deployed` status.
+
+#### Step 4: Retry the upgrade
+```bash
+terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
+```
+
+### Important Notes
+
+- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
+  This changes the label but not the encoded release data inside the secret, leaving Helm in an
+  inconsistent state. Always delete the stuck secrets entirely.
+- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
+  the next successful upgrade will reconcile the state.
+- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
+  over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
+
+### Verification
+After deleting stuck secrets and re-applying:
+- `helm history` shows the new revision as `deployed`
+- `terraform apply` completes without errors
+
+### Example
+```bash
+# Helm history shows stuck state
+$ helm history nextcloud -n nextcloud | tail -3
+4  deployed        nextcloud-8.8.1  Upgrade complete
+5  failed          nextcloud-8.8.1  Upgrade failed: etcd timeout
+6  pending-rollback nextcloud-8.8.1 Rollback to 4
+
+# Fix: delete stuck revisions
+$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
+
+# Verify clean state
+$ helm history nextcloud -n nextcloud | tail -1
+4  deployed  nextcloud-8.8.1  Upgrade complete
+
+# Re-apply
+$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
+```
+
+---

 ## See Also

 - `terraform-state-identity-mismatch` - For Terraform provider identity errors
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for this issue)
+- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)

 ## References

--- a/.claude/skills/helm-stuck-release-recovery/SKILL.md
+++ b/.claude/skills/helm-stuck-release-recovery/SKILL.md
@ -1,93 +0,0 @@
---
-name: helm-stuck-release-recovery
-description: |
-  Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states.
-  Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is
-  in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback",
-  (3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
-  (4) helm upgrade fails with "an error occurred while finding last successful release".
-  Covers manual secret cleanup to restore Helm release to a deployable state.
-author: Claude Code
-version: 1.0.0
-date: 2026-02-15
---
-
-# Helm Stuck Release Recovery
-
-## Problem
-Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
-states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
-Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
-
-## Context / Trigger Conditions
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
- `helm upgrade` fails with: `an error occurred while finding last successful release`
-
-## Solution
-
-### Step 1: Identify the stuck release
-```bash
-helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
-```
-
-Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
-
-### Step 2: Delete the stuck Helm release secrets
-Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
-Delete all stuck revisions:
-
-```bash
-# Delete specific stuck revision (e.g., revision 5)
-kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
-
-# If multiple stuck revisions exist, delete all of them
-kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
-```
-
-### Step 3: Verify the release is clean
-```bash
-helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
-```
-
-The latest revision should now show `deployed` status.
-
-### Step 4: Retry the upgrade
-```bash
-terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
-```
-
-## Important Notes
-
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
-  This changes the label but not the encoded release data inside the secret, leaving Helm in an
-  inconsistent state. Always delete the stuck secrets entirely.
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
-  the next successful upgrade will reconcile the state.
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
-  over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
-
-## Verification
-After deleting stuck secrets and re-applying:
- `helm history` shows the new revision as `deployed`
- `terraform apply` completes without errors
-
-## Example
-```bash
-# Helm history shows stuck state
-$ helm history nextcloud -n nextcloud | tail -3
-4  deployed        nextcloud-8.8.1  Upgrade complete
-5  failed          nextcloud-8.8.1  Upgrade failed: etcd timeout
-6  pending-rollback nextcloud-8.8.1 Rollback to 4
-
-# Fix: delete stuck revisions
-$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
-
-# Verify clean state
-$ helm history nextcloud -n nextcloud | tail -1
-4  deployed  nextcloud-8.8.1  Upgrade complete
-
-# Re-apply
-$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
-```
--- a/.claude/skills/k8s-container-image-caching/SKILL.md
+++ b/.claude/skills/k8s-container-image-caching/SKILL.md
@ -0,0 +1,244 @@
+---
+name: k8s-container-image-caching
+description: |
+  Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
+  (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
+  (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
+  (3) need to add pull-through cache for a new upstream registry,
+  (4) `mirrors` cannot be set when `config_path` is provided error in containerd,
+  (5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
+  (6) kubectl shows correct image tag but container runs old code,
+  (7) local registry mirror caches stale images,
+  (8) imagePullPolicy: Always doesn't force fresh pulls,
+  (9) containerd config has mirror that intercepts pulls serving stale images.
+  Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
+  via image digest pinning.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-22
+---
+
+# Kubernetes Container Image Caching
+
+## Pull-Through Cache Setup
+
+### Problem
+
+Docker Registry v2 can only proxy **one upstream registry per instance**. A common
+misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
+to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
+and other registries -- they get routed to the Docker Hub proxy which can't serve them,
+causing `ImagePullBackOff`.
+
+### Context / Trigger Conditions
+
+- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
+- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
+- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
+- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
+
+### Solution
+
+#### 1. Run one Registry v2 container per upstream
+
+Each upstream needs its own Docker Registry v2 instance on a different port:
+
+| Port | Registry | Container Name |
+|------|----------|---------------|
+| 5000 | docker.io | registry |
+| 5010 | ghcr.io | registry-ghcr |
+| 5020 | quay.io | registry-quay |
+| 5030 | registry.k8s.io | registry-k8s |
+| 5040 | reg.kyverno.io | registry-kyverno |
+
+Config for non-Docker-Hub proxies (no auth needed -- they're public):
+
+```yaml
+version: 0.1
+storage:
+  cache:
+    blobdescriptor: inmemory
+  filesystem:
+    rootdirectory: /var/lib/registry
+http:
+  addr: :5000
+proxy:
+  remoteurl: https://ghcr.io  # change per registry
+```
+
+```bash
+docker run -p 5010:5000 -d --restart always --name registry-ghcr \
+  -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
+```
+
+#### 2. Replace deprecated wildcard mirror with `config_path`
+
+Instead of:
+```toml
+# DEPRECATED - breaks non-Docker-Hub registries
+[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
+  endpoint = ["http://10.0.20.10:5000"]
+```
+
+Use the modern `config_path` approach:
+```toml
+[plugins."io.containerd.grpc.v1.cri".registry]
+  config_path = "/etc/containerd/certs.d"
+```
+
+Then create per-registry `hosts.toml` files:
+```bash
+mkdir -p /etc/containerd/certs.d/docker.io
+cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
+server = "https://registry-1.docker.io"
+
+[host."http://10.0.20.10:5000"]
+  capabilities = ["pull", "resolve"]
+EOF
+```
+
+Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
+
+#### 3. Critical: `config_path` and `mirrors` cannot coexist
+
+Containerd will **refuse to start the CRI plugin** if both `config_path` and any
+`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
+(including the `[plugins."...registry.mirrors"]` parent section) before setting
+`config_path`.
+
+This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
+where the config format is slightly different. If unsure, either:
+- Don't use config_path on that node (skip the pull-through cache)
+- Remove the entire `mirrors` section first, then add `config_path`
+
+#### 4. Static IP for registry VM
+
+If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
+via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
+
+### Verification
+
+```bash
+# Test each proxy responds
+for port in 5000 5010 5020 5030 5040; do
+  curl -s http://10.0.20.10:$port/v2/_catalog
+done
+
+# Test containerd can pull through cache
+crictl pull ghcr.io/some/image:tag
+
+# Check containerd logs for mirror usage
+journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
+```
+
+### Notes
+
+- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
+  direct pull from the upstream `server` URL. This provides graceful degradation.
+- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
+  to avoid I/O spikes.
+- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
+- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
+
+---
+
+## Cache Bypass / Stale Image Fix
+
+### Problem
+Kubernetes pods continue running old Docker images even after pushing new versions with
+the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
+and serves stale versions, ignoring `imagePullPolicy: Always`.
+
+### Context / Trigger Conditions
+- Pod is running but application code is outdated
+- `docker push` succeeded with new layers
+- `kubectl describe pod` shows correct image tag
+- Cluster has a local registry mirror configured (e.g., in containerd config)
+- `imagePullPolicy: Always` doesn't fix the issue
+- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
+
+### Solution
+
+#### 1. Get the image digest after pushing
+```bash
+docker push viktorbarzin/myimage:latest
+# Output includes: latest: digest: sha256:abc123... size: 856
+```
+
+#### 2. Use digest instead of tag in deployment
+```hcl
+# Terraform
+container {
+  # Use digest to bypass local registry cache
+  image             = "docker.io/viktorbarzin/myimage@sha256:abc123..."
+  image_pull_policy = "Always"
+  name              = "myimage"
+}
+```
+
+```yaml
+# Kubernetes YAML
+containers:
+  - name: myimage
+    image: docker.io/viktorbarzin/myimage@sha256:abc123...
+    imagePullPolicy: Always
+```
+
+#### 3. Apply and restart
+```bash
+terraform apply -target=module.kubernetes_cluster.module.myservice
+kubectl rollout restart deployment/myservice -n mynamespace
+```
+
+### Why This Works
+- Registry mirrors match by tag, not digest
+- When you specify a digest, the node must fetch that exact manifest
+- The mirror may not have the digest cached, forcing a pull from upstream
+- Even if cached, the digest guarantees the exact image version
+
+### Verification
+```bash
+# Check the pod is using the new image
+kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
+
+# Verify application behavior reflects new code
+kubectl exec -n mynamespace deploy/myservice -- <verification-command>
+```
+
+### Example
+
+Before (problematic):
+```hcl
+image = "docker.io/viktorbarzin/audiblez-web:latest"
+```
+
+After (fixed):
+```hcl
+image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
+```
+
+### Notes
+- You must update the digest each time you push a new image
+- Consider automating digest extraction in CI/CD pipelines
+- This is a workaround; ideally fix the registry mirror configuration
+- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
+- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
+
+### Diagnosing Registry Mirror Issues
+```bash
+# On a k8s node, check containerd config
+cat /etc/containerd/config.toml | grep -A5 mirrors
+
+# Check if mirror is intercepting
+crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
+
+# List cached images on node
+crictl images | grep myimage
+```
+
+---
+
+## References
+
+- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
+- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
--- a/.claude/skills/k8s-docker-registry-cache-bypass/SKILL.md
+++ b/.claude/skills/k8s-docker-registry-cache-bypass/SKILL.md
@ -1,110 +0,0 @@
---
-name: k8s-docker-registry-cache-bypass
-description: |
-  Fix for Kubernetes pods running old Docker images despite pushing new versions.
-  Use when: (1) kubectl shows correct image tag but container runs old code,
-  (2) Local registry mirror caches stale images, (3) imagePullPolicy: Always
-  doesn't force fresh pulls, (4) containerd config has mirror that intercepts pulls.
-  Solution: Use image digest instead of tag to bypass cache entirely.
-author: Claude Code
-version: 1.0.0
-date: 2025-01-31
---
-
-# Kubernetes Docker Registry Cache Bypass
-
-## Problem
-Kubernetes pods continue running old Docker images even after pushing new versions with
-the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
-and serves stale versions, ignoring `imagePullPolicy: Always`.
-
-## Context / Trigger Conditions
- Pod is running but application code is outdated
- `docker push` succeeded with new layers
- `kubectl describe pod` shows correct image tag
- Cluster has a local registry mirror configured (e.g., in containerd config)
- `imagePullPolicy: Always` doesn't fix the issue
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
-
-## Solution
-
-### 1. Get the image digest after pushing
-```bash
-docker push viktorbarzin/myimage:latest
-# Output includes: latest: digest: sha256:abc123... size: 856
-```
-
-### 2. Use digest instead of tag in deployment
-```hcl
-# Terraform
-container {
-  # Use digest to bypass local registry cache
-  image             = "docker.io/viktorbarzin/myimage@sha256:abc123..."
-  image_pull_policy = "Always"
-  name              = "myimage"
-}
-```
-
-```yaml
-# Kubernetes YAML
-containers:
-  - name: myimage
-    image: docker.io/viktorbarzin/myimage@sha256:abc123...
-    imagePullPolicy: Always
-```
-
-### 3. Apply and restart
-```bash
-terraform apply -target=module.kubernetes_cluster.module.myservice
-kubectl rollout restart deployment/myservice -n mynamespace
-```
-
-## Why This Works
- Registry mirrors match by tag, not digest
- When you specify a digest, the node must fetch that exact manifest
- The mirror may not have the digest cached, forcing a pull from upstream
- Even if cached, the digest guarantees the exact image version
-
-## Verification
-```bash
-# Check the pod is using the new image
-kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
-
-# Verify application behavior reflects new code
-kubectl exec -n mynamespace deploy/myservice -- <verification-command>
-```
-
-## Example
-
-Before (problematic):
-```hcl
-image = "docker.io/viktorbarzin/audiblez-web:latest"
-```
-
-After (fixed):
-```hcl
-image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
-```
-
-## Notes
- You must update the digest each time you push a new image
- Consider automating digest extraction in CI/CD pipelines
- This is a workaround; ideally fix the registry mirror configuration
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
-
-## Diagnosing Registry Mirror Issues
-```bash
-# On a k8s node, check containerd config
-cat /etc/containerd/config.toml | grep -A5 mirrors
-
-# Check if mirror is intercepting
-crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
-
-# List cached images on node
-crictl images | grep myimage
-```
-
-## References
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)