From abe89c926e05ef18ef736c0e83edfb5583d80ec6 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 22 Feb 2026 22:11:31 +0000 Subject: [PATCH] =?UTF-8?q?[ci=20skip]=20Refactor=20knowledge:=20CLAUDE.md?= =?UTF-8?q?=20881=E2=86=92190=20lines,=20extract=20reference=20data?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CLAUDE.md changes: - Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md - Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md - Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md - Extract Authentik state snapshot → .claude/reference/authentik-state.md - Remove Init Container pattern (duplicates setup-project skill) - Remove Poison Fountain service notes (duplicates Anti-AI section) - Consolidate Authentik section (link to skills + reference) - Remove resource limit tables (kept tier definitions inline) Skill merges (37→32): - helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting - containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching - (traefik merges in previous commits) --- .claude/CLAUDE.md | 909 +++--------------- .claude/reference/authentik-state.md | 50 + .claude/reference/github-drone-api.md | 50 + .claude/reference/proxmox-inventory.md | 52 + .claude/reference/service-catalog.md | 132 +++ .../SKILL.md | 138 --- .../SKILL.md | 137 ++- .../helm-stuck-release-recovery/SKILL.md | 93 -- .../k8s-container-image-caching/SKILL.md | 244 +++++ .../k8s-docker-registry-cache-bypass/SKILL.md | 110 --- 10 files changed, 749 insertions(+), 1166 deletions(-) create mode 100644 .claude/reference/authentik-state.md create mode 100644 .claude/reference/github-drone-api.md create mode 100644 .claude/reference/proxmox-inventory.md create mode 100644 .claude/reference/service-catalog.md delete mode 100644 .claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md rename .claude/skills/{helm-release-force-rerender => helm-release-troubleshooting}/SKILL.md (55%) delete mode 100644 .claude/skills/helm-stuck-release-recovery/SKILL.md create mode 100644 .claude/skills/k8s-container-image-caching/SKILL.md delete mode 100644 .claude/skills/k8s-docker-registry-cache-bypass/SKILL.md diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 0de01bf6..612343d0 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -6,100 +6,40 @@ - **When making infrastructure changes**: Always update this file to reflect the current state (new services, removed services, version changes, config changes) - **After every significant change**: Proactively update this file (`.claude/CLAUDE.md`) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically. - **After updating any `.claude/` files**: Always commit them immediately (`git add .claude/ && git commit -m "[ci skip] update claude knowledge"`) to avoid building up unstaged changes. -- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project.md` for deploying new services) -- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform `.tf` files and apply with `terragrunt apply`. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use `kubectl` only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes. -- **CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files** unless they are encrypted (e.g., via git-crypt). Secrets belong in `terraform.tfvars` (which is git-crypt encrypted) or in the `secrets/` directory. Never hardcode credentials in `.tf` files, scripts, `.claude/` files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (`terraform.tfvars` → `main.tf` → module variables). -- **CRITICAL: NEVER commit secrets** — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions. -- **New services MUST have CI/CD**: Set up Drone CI pipeline (`.drone.yml`) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy. -- **New services MUST have monitoring**: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible. +- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project` for deploying new services) +- **Reference data**: Check `.claude/reference/` for inventory tables, API patterns, and current state snapshots +- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (kubectl apply/edit/patch, helm install, docker run). Use `kubectl` only for read-only operations and ephemeral debugging. +- **CRITICAL: NEVER put sensitive data** (API keys, passwords, tokens, credentials) into committed files unless encrypted via git-crypt. Secrets belong in `terraform.tfvars` or `secrets/` directory. +- **CRITICAL: NEVER commit secrets** — triple-check before every commit. Zero exceptions. +- **New services MUST have CI/CD** (Drone CI pipeline) and **monitoring** (Prometheus alerts and/or Uptime Kuma). ## Execution Environment -- **File operations**: Read, Edit, Write, Glob, Grep tools -- **Git commands**: git status, git log, git diff, git add, git commit, git reset, etc. -- **Shell commands**: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally -- **CRITICAL: Always run terragrunt/terraform locally**, never on the remote server via SSH: - ```bash - cd stacks/ && terragrunt apply --non-interactive - ``` -- **kubectl**: Use `kubectl --kubeconfig $(pwd)/config` for cluster access -- **GitHub API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below). `gh` CLI is blocked by sandbox restrictions. -- **Drone CI API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below). +- **Terraform/Terragrunt**: Always run locally: `cd stacks/ && terragrunt apply --non-interactive` +- **kubectl**: `kubectl --kubeconfig $(pwd)/config` +- **GitHub/Drone API**: Use `curl` with tokens from tfvars (see `.claude/reference/github-drone-api.md`). `gh` CLI is blocked by sandbox. --- ## Overview -Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption. +Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`. Uses git-crypt for secrets encryption. -## Static File Paths (NEVER CHANGE) -- **Main config**: `terraform.tfvars` - All secrets, DNS, Cloudflare config, WireGuard peers -- **Root Terragrunt**: `terragrunt.hcl` - Root Terragrunt config (providers, backend, var loading) -- **Service stacks**: `stacks//` - Individual service stacks (each has `terragrunt.hcl` + `main.tf` with resources inline) -- **Infra stack**: `stacks/infra/` - Proxmox VM resources (templates, docker-registry, VMs) -- **Platform stack**: `stacks/platform/` - Core infrastructure services (22 modules in `modules/` subdir) -- **Per-stack state**: `state/stacks//terraform.tfstate` - Per-stack state files (gitignored) -- **Service resources**: `stacks//main.tf` - Service resources defined directly in stack root -- **Platform modules**: `stacks/platform/modules//` - Platform service modules -- **Shared modules**: `modules/kubernetes/ingress_factory/`, `modules/kubernetes/setup_tls_secret/` -- **Secrets**: `secrets/` - git-crypt encrypted TLS certs and keys - -## Network Topology (Static IPs) -``` -┌─────────────────────────────────────────────────────────────────┐ -│ 10.0.10.0/24 - Management Network │ -├─────────────────────────────────────────────────────────────────┤ -│ 10.0.10.10 - Wizard (main server) │ -│ 10.0.10.15 - NFS Server (TrueNAS) - /mnt/main/* │ -└─────────────────────────────────────────────────────────────────┘ - -┌─────────────────────────────────────────────────────────────────┐ -│ 10.0.20.0/24 - Kubernetes Network │ -├─────────────────────────────────────────────────────────────────┤ -│ 10.0.20.1 - pfSense Gateway │ -│ 10.0.20.10 - Docker Registry VM (MAC: DE:AD:BE:EF:22:22) │ -│ 10.0.20.100 - k8s-master │ -│ 10.0.20.101 - Technitium DNS │ -│ 10.0.20.102 - MetalLB IP Pool Start │ -│ 10.0.20.200 - MetalLB IP Pool End │ -└─────────────────────────────────────────────────────────────────┘ - -┌─────────────────────────────────────────────────────────────────┐ -│ 192.168.1.0/24 - Physical Network │ -├─────────────────────────────────────────────────────────────────┤ -│ 192.168.1.127 - Proxmox Hypervisor │ -└─────────────────────────────────────────────────────────────────┘ -``` +## Key File Paths +- `terraform.tfvars` — All secrets, DNS, Cloudflare config, WireGuard peers (git-crypt encrypted) +- `terragrunt.hcl` — Root config (providers, backend, variable loading) +- `stacks//` — Individual service stacks (`terragrunt.hcl` + `main.tf`) +- `stacks/platform/` — Core infrastructure (~22 services in `modules/` subdir) +- `stacks/infra/` — Proxmox VM resources +- `modules/kubernetes/ingress_factory/`, `setup_tls_secret/` — Shared utility modules +- `secrets/` — git-crypt encrypted TLS certs and keys ## Domains - **Public**: `viktorbarzin.me` (Cloudflare-managed) - **Internal**: `viktorbarzin.lan` (Technitium DNS) -## Directory Structure -- `terragrunt.hcl` - Root Terragrunt configuration (providers, backend, variable loading) -- `stacks/` - Individual Terragrunt stacks (one per service) -- `stacks/infra/` - Proxmox VM resources (templates, docker-registry) -- `stacks/platform/` - Core infrastructure (22 services in `stacks/platform/modules/`) -- `stacks//` - Individual service stacks (resources directly in `main.tf`) -- `stacks/platform/modules//` - Platform service module source code -- `modules/kubernetes/` - **Only shared utility modules**: `ingress_factory/`, `setup_tls_secret/` -- `modules/create-vm/` - Proxmox VM creation module -- `state/` - Per-stack Terraform state files (gitignored) -- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt -- `cli/` - Go CLI tool for infrastructure management -- `scripts/` - Helper scripts (cluster management, node updates) -- `playbooks/` - Ansible playbooks for node configuration -- `diagram/` - Infrastructure diagrams (Python-based) - ## Key Patterns -- Each service in `modules/kubernetes//main.tf` defines its own namespace, deployments, services, and ingress -- NFS storage from `10.0.10.15` for persistent data -- TLS secrets managed via `setup_tls_secret` module -- Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection -- HTTP/3 enabled on Traefik (`http3.enabled=true`, `advertisedPort=443` on websecure entrypoint) and Cloudflare (`cloudflare_zone_settings_override` with `http3="on"`) -- GPU workloads use `node_selector = { "gpu": "true" }` -- Services expose to `*.viktorbarzin.me` domains ### NFS Volume Pattern -**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs: +**Prefer inline NFS volumes** over separate PV/PVC resources: ```hcl volume { name = "data" @@ -109,773 +49,142 @@ volume { } } ``` -Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart). +Only use PV/PVC when a Helm chart requires `existingClaim`. ### Adding NFS Exports -To add a new NFS exported directory: -1. Edit `secrets/nfs_directories.txt` - add the new directory path, keep the list sorted -2. Run `secrets/nfs_exports.sh` from the `secrets/` directory to update the NFS share via TrueNAS API +1. Edit `secrets/nfs_directories.txt` — add path, keep sorted +2. Run `secrets/nfs_exports.sh` from `secrets/` to update TrueNAS -### Factory Pattern (for multi-user services) -Used when a service needs one instance per user. Structure: -``` -stacks// -├── main.tf # Namespace, TLS secret, user module calls -└── factory/ - └── main.tf # Deployment, service, ingress templates with ${var.name} -``` -Examples: `actualbudget`, `freedify` +### Factory Pattern (multi-user services) +Structure: `stacks//main.tf` + `factory/main.tf`. Examples: `actualbudget`, `freedify`. +To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory. -To add a new user: -1. Export NFS share at `/mnt/main//` in TrueNAS -2. Add Cloudflare route in tfvars -3. Add module block in main.tf calling factory - -### Init Container Pattern (for database migrations) -Use when a service needs to run database migrations before starting: -```hcl -init_container { - name = "migration" - image = "service-image:tag" - command = ["sh", "-c", "migration-command"] - - dynamic "env" { - for_each = local.common_env - content { - name = env.value.name - value = env.value.value - } - } -} -``` -Example: AFFiNE runs `node ./scripts/self-host-predeploy.js` in init container. - -### SMTP/Email Configuration -When configuring services to use the mailserver: -- **Use public hostname**: `mail.viktorbarzin.me` (for TLS cert validation) -- **Do NOT use**: `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch) -- **Port**: 587 (STARTTLS) -- **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars -- **Common email**: `info@viktorbarzin.me` for service notifications +### SMTP/Email +- **Use**: `mail.viktorbarzin.me` port 587 (STARTTLS). **NOT** `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch). +- **Credentials**: `mailserver_accounts` in tfvars. Common: `info@viktorbarzin.me` ### Anti-AI Scraping (5-Layer Defense) -All services have anti-AI scraping enabled by default via `anti_ai_scraping = true` in `ingress_factory`. The 5 layers are: +All services have `anti_ai_scraping = true` by default in `ingress_factory`. Layers: +1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth → poison-fountain `/auth`. Returns 403 for GPTBot, ClaudeBot, CCBot, etc. +2. **X-Robots-Tag** (`traefik-anti-ai-headers`): Adds `noai, noimageai` +3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body injects 5 hidden links before `` to `poison.viktorbarzin.me/article/*` +4. **Tarpit**: `/article/*` drip-feeds at ~100 bytes/sec +5. **Poison content**: 50 cached docs from rnsaffn.com/poison2/ (CronJob every 6h, `--http1.1` required) -1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth middleware → poison-fountain `/auth` endpoint. Checks `User-Agent` against known AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, etc.). Returns 403 for bots, 200 for normal users. -2. **X-Robots-Tag header** (`traefik-anti-ai-headers`): Adds `noai, noimageai` to all responses. -3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body plugin injects 5 hidden `` tags before `` linking to `poison.viktorbarzin.me/article/*`. Only injected when request `Accept` header contains `text/html` (browsers/scrapers, not API calls). -4. **Tarpit** (poison-fountain service): `/article/*` endpoints drip-feed responses at ~100 bytes/sec via chunked transfer encoding, wasting scraper time. -5. **Poison content**: Cached documents from rnsaffn.com/poison2/ (50 docs, refreshed every 6h via CronJob) served through the tarpit to pollute AI training data. - -**Key files:** -- `stacks/poison-fountain/` — Terraform stack (deployment, service, ingress, CronJob) -- `stacks/poison-fountain/app/server.py` — Python HTTP server (ForwardAuth + tarpit) -- `stacks/poison-fountain/app/fetch-poison.sh` — CronJob fetcher (uses `--http1.1`, upstream hangs on HTTP/2) -- `stacks/platform/modules/traefik/middleware.tf` — 3 Traefik middleware CRDs -- `modules/kubernetes/ingress_factory/main.tf` — `anti_ai_scraping` variable (default: true) - -**Testing:** -```bash -# Trap links (need Accept: text/html for rewrite-body plugin to process) -curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"' - -# X-Robots-Tag header -curl -sI -H "Accept: text/html" https://vaultwarden.viktorbarzin.me/ | grep -i x-robots - -# Bot blocking (403 for AI bots, 200 for normal users) -curl -s -o /dev/null -w "%{http_code}" -A "GPTBot/1.0" https://vaultwarden.viktorbarzin.me/ - -# Tarpit slow-drip (~100 bytes/sec) -curl -s -H "Accept: text/html" https://poison.viktorbarzin.me/article/test -``` - -**Gotchas:** -- rewrite-body plugin only processes responses when `Accept` header contains `text/html` — `curl` default `Accept: */*` does NOT match. Use `-H "Accept: text/html"` for testing. -- rnsaffn.com/poison2/ hangs on HTTP/2 — fetcher must use `--http1.1` -- NFS cache dir (`/mnt/main/poison-fountain/cache`) must be world-writable (chmod 777) because `curlimages/curl` runs as uid 101 -- To disable for a specific service: set `anti_ai_scraping = false` in its `ingress_factory` call +Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`, `modules/kubernetes/ingress_factory/main.tf` +Testing: `curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'` +Disable per-service: `anti_ai_scraping = false` in ingress_factory call. ### Terragrunt Architecture -- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading for all stacks -- Each stack contains its resources directly: `stacks//main.tf` has variable declarations, locals, and all Terraform resources inline -- Platform modules live at `stacks/platform/modules//`, referenced as `source = "./modules/"` -- Shared utility modules (`ingress_factory`, `setup_tls_secret`, `dockerhub_secret`, `oauth-proxy`) remain at `modules/kubernetes/` and are referenced with relative paths from each module -- State isolation: each stack has its own state file at `state/stacks//terraform.tfstate` -- Dependencies: service stacks depend on `platform` stack via `dependency` block in their `terragrunt.hcl` -- Variables loaded from `terraform.tfvars` automatically (unused vars silently ignored via `extra_arguments`) -- `secrets/` symlinks in each stack for TLS cert resolution (`path.root` workaround) -- Terragrunt v0.99+: use `--non-interactive` (not `--terragrunt-non-interactive`) -- run-all syntax: `terragrunt run --all -- ` (not `terragrunt run-all`) -- The `platform` stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.) -- Individual service stacks are for services that can be deployed independently +- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading +- Each stack: `stacks//main.tf` with resources inline, state at `state/stacks//terraform.tfstate` +- Platform modules: `stacks/platform/modules//`, shared modules: `modules/kubernetes/` +- Dependencies via `dependency` block; variables from `terraform.tfvars` (unused silently ignored) +- `secrets/` symlinks in stacks for TLS cert path resolution +- Syntax: `--non-interactive` (not `--terragrunt-non-interactive`), `terragrunt run --all -- ` (not `run-all`) ### Adding a New Service -When adding a new service to the cluster: -1. Create `stacks//` directory with: - - `terragrunt.hcl` - Include root config, declare `platform` dependency - - `main.tf` - All resources defined directly (variables, locals, namespace, deployments, services, ingress) - - `secrets` - Symlink to `../../secrets` (for TLS cert path resolution) -2. Add Cloudflare DNS record in `terraform.tfvars` (`cloudflare_proxied_names` or `cloudflare_non_proxied_names`) -3. Apply the cloudflared stack: `cd stacks/platform && terragrunt apply --non-interactive` -4. Apply the new service: `cd stacks/ && terragrunt apply --non-interactive` - -## Common Variables -- `tls_secret_name` - TLS certificate secret name -- `tier` - Deployment tier label -- Service-specific passwords passed as variables - -## Service Versions (as of 2026-02) -- Immich: v2.4.1 -- Freedify: latest (music streaming, factory pattern) -- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis) -- Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node) -- Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL) -- Gramps Web: latest (genealogy, uses Redis + Celery) -- Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks) -- Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki) -- OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected) +Use the **`setup-project`** skill for the full workflow. Quick reference: +1. Create `stacks//` with `terragrunt.hcl`, `main.tf`, `secrets` symlink +2. Add Cloudflare DNS in `terraform.tfvars` +3. Apply platform stack (for DNS): `cd stacks/platform && terragrunt apply --non-interactive` +4. Apply service: `cd stacks/ && terragrunt apply --non-interactive` ## Useful Commands ```bash -# Cluster health check — ALWAYS use this to check cluster status -bash scripts/cluster_healthcheck.sh # Full color report +bash scripts/cluster_healthcheck.sh # Cluster health (24 checks) bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL -bash scripts/cluster_healthcheck.sh --json # Machine-readable -bash scripts/cluster_healthcheck.sh --fix # Auto-delete evicted pods - -# Apply a single service stack -cd stacks/ && terragrunt apply --non-interactive - -# Plan a single service stack -cd stacks/ && terragrunt plan --non-interactive - -# Plan all stacks (full DAG) -cd stacks && terragrunt run --all --non-interactive -- plan - -# Apply all stacks (full DAG) -cd stacks && terragrunt run --all --non-interactive -- apply - -# Format all terraform files -terraform fmt -recursive - -kubectl get pods -A +cd stacks/ && terragrunt apply --non-interactive # Apply single stack +cd stacks && terragrunt run --all --non-interactive -- plan # Plan all +terraform fmt -recursive # Format all ``` -**Cluster Health Check** (`scripts/cluster_healthcheck.sh`): -- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead. -- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel -- **When adding new healthchecks or monitoring**: Always update this script to validate the new component - -**Terragrunt apply examples:** -- `cd stacks/monitoring && terragrunt apply --non-interactive` - Apply monitoring -- `cd stacks/immich && terragrunt apply --non-interactive` - Apply immich -- `cd stacks/infra && terragrunt apply --non-interactive` - Apply Proxmox VMs / docker registry -- `cd stacks/platform && terragrunt apply --non-interactive` - Apply all core/platform services - -**IMPORTANT: When deploying a new service**, you must ALSO apply the `platform` stack (which includes `cloudflared`) to create the Cloudflare DNS record: -```bash -cd stacks/platform && terragrunt apply --non-interactive -``` -Adding a name to `cloudflare_non_proxied_names` or `cloudflare_proxied_names` in `terraform.tfvars` only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied. - -## Stack Structure -Terragrunt stacks under `stacks/`: -- `stacks/infra/` - Proxmox VMs, templates, docker-registry -- `stacks/platform/` - Core infrastructure (~22 services in `modules/` subdir) -- `stacks//` - Individual service stacks (resources directly in `main.tf`) - -Each stack's `terragrunt.hcl` includes the root `terragrunt.hcl` which provides: -- Kubernetes + Helm providers (configured from `terraform.tfvars`) -- Local backend with per-stack state file (`state/stacks//terraform.tfstate`) -- Automatic loading of `terraform.tfvars` with unused vars ignored - ---- - -## Complete Service Catalog - -### Critical - Network & Auth (Tier: core) -| Service | Description | Stack | -|---------|-------------|-------| -| wireguard | VPN server | platform | -| technitium | DNS server (10.0.20.101) | platform | -| headscale | Tailscale control server | platform | -| traefik | Ingress controller (Helm) | platform | -| xray | Proxy/tunnel | platform | -| authentik | Identity provider (SSO) | platform | -| cloudflared | Cloudflare tunnel | platform | -| authelia | Auth middleware | platform | -| monitoring | Prometheus/Grafana/Loki stack | platform | - -### Storage & Security (Tier: cluster) -| Service | Description | Stack | -|---------|-------------|-------| -| vaultwarden | Bitwarden-compatible password manager | platform | -| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform | -| immich | Photo management (GPU) | immich | -| nvidia | GPU device plugin | platform | -| metrics-server | K8s metrics | platform | -| uptime-kuma | Status monitoring | platform | -| crowdsec | Security/WAF | platform | -| kyverno | Policy engine | platform | - -### Admin -| Service | Description | Stack | -|---------|-------------|-------| -| k8s-dashboard | Kubernetes dashboard | platform | -| reverse-proxy | Generic reverse proxy | platform | - -### Active Use -| Service | Description | Stack | -|---------|-------------|-------| -| mailserver | Email (docker-mailserver) | mailserver | -| shadowsocks | Proxy | shadowsocks | -| webhook_handler | Webhook processing | webhook_handler | -| tuya-bridge | Smart home bridge | tuya-bridge | -| dawarich | Location history | dawarich | -| owntracks | Location tracking | owntracks | -| nextcloud | File sync/share | nextcloud | -| calibre | E-book management | calibre | -| onlyoffice | Document editing | onlyoffice | -| f1-stream | F1 streaming | f1-stream | -| rybbit | Analytics | rybbit | -| isponsorblocktv | SponsorBlock for TV | isponsorblocktv | -| actualbudget | Budgeting (factory pattern) | actualbudget | - -### Optional -| Service | Description | Stack | -|---------|-------------|-------| -| blog | Personal blog | blog | -| descheduler | Pod descheduler | descheduler | -| drone | CI/CD | drone | -| hackmd | Collaborative markdown | hackmd | -| kms | Key management | kms | -| privatebin | Encrypted pastebin | privatebin | -| vault | HashiCorp Vault | vault | -| reloader | ConfigMap/Secret reloader | reloader | -| city-guesser | Game | city-guesser | -| echo | Echo server | echo | -| url | URL shortener | url | -| excalidraw | Whiteboard | excalidraw | -| travel_blog | Travel blog | travel_blog | -| dashy | Dashboard | dashy | -| send | Firefox Send | send | -| ytdlp | YouTube downloader | ytdlp | -| wealthfolio | Finance tracking | wealthfolio | -| audiobookshelf | Audiobook server | audiobookshelf | -| paperless-ngx | Document management | paperless-ngx | -| jsoncrack | JSON visualizer | jsoncrack | -| servarr | Media automation (Sonarr/Radarr/etc) | servarr | -| ntfy | Push notifications | ntfy | -| cyberchef | Data transformation | cyberchef | -| diun | Docker image update notifier | diun | -| meshcentral | Remote management | meshcentral | -| homepage | Dashboard/startpage | homepage | -| matrix | Matrix chat server | matrix | -| linkwarden | Bookmark manager | linkwarden | -| changedetection | Web change detection | changedetection | -| tandoor | Recipe manager | tandoor | -| n8n | Workflow automation | n8n | -| real-estate-crawler | Property crawler | real-estate-crawler | -| tor-proxy | Tor proxy | tor-proxy | -| forgejo | Git forge | forgejo | -| freshrss | RSS reader | freshrss | -| navidrome | Music streaming | navidrome | -| networking-toolbox | Network tools | networking-toolbox | -| stirling-pdf | PDF tools | stirling-pdf | -| speedtest | Speed testing | speedtest | -| freedify | Music streaming (factory pattern) | freedify | -| netbox | Network documentation | netbox | -| infra-maintenance | Maintenance jobs | infra-maintenance | -| ollama | LLM server (GPU) | ollama | -| frigate | NVR/camera (GPU) | frigate | -| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook | -| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine | -| health | Apple Health data dashboard (PostgreSQL) | health | -| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper | -| grampsweb | Genealogy web app (Gramps Web) | grampsweb | -| openclaw | AI agent gateway (OpenClaw) | openclaw | -| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain | - ---- - -## Cloudflare Domains - -### Proxied (CDN + WAF enabled) -``` -blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, -audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, -changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, -travel, netbox -``` - -### Non-Proxied (Direct DNS) -``` -mail, wg, headscale, immich, calibre, vaultwarden, drone, -mailserver-antispam, mailserver-admin, webhook, uptime, -owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget, -onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui, -isponsorblocktv, speedtest, freedify, rybbit, paperless, -servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr, -jellyfin, jellyseerr, tdarr, affine, health, family, openclaw -``` - -### Special Subdomains -- `*.viktor.actualbudget` - Actualbudget factory instances -- `*.freedify` - Freedify factory instances -- `mailserver.*` - Mail server components (antispam, admin) - ---- - ## CI/CD -- Drone CI (`.drone.yml`) for automated deployments -- **Default pipeline**: On push, applies the `platform` stack via `terragrunt apply` (core infrastructure services; installs Terraform 1.5.7 + Terragrunt 0.99.4 in Alpine) -- **TLS renewal pipeline**: Cron-triggered, runs `renew2.sh` (certbot + Cloudflare DNS) — no Terraform/Terragrunt needed -- **Build CLI pipeline**: Builds Docker image from `cli/Dockerfile` (unchanged) -- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly -- **After committing, run `git push origin master`** to sync changes - -## GitHub & Drone CI - -### GitHub API Access -- **Username**: `ViktorBarzin` -- **Token location**: `terraform.tfvars` as `github_pat` (git-crypt encrypted) -- **Read token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` -- **Scopes**: Full access — `repo`, `admin:public_key`, `admin:repo_hook`, `delete_repo`, `admin:org`, `workflow`, `write:packages`, and more -- **`gh` CLI**: Blocked by sandbox restrictions — use `curl` with the GitHub API instead - -#### Common API Patterns -```bash -# Read token from tfvars -GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2) - -# List repos -curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100" - -# Create repo -curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \ - -d '{"name":"repo-name","private":true}' - -# Add deploy key -curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin//keys" \ - -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}' - -# Create webhook (e.g., for Drone CI) -curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin//hooks" \ - -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}' - -# Get repo info -curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/" -``` - -### Drone CI API Access -- **Server**: `https://drone.viktorbarzin.me` -- **Token location**: `terraform.tfvars` as `drone_api_token` (git-crypt encrypted) -- **Read token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2` -- **Username**: `ViktorBarzin` - -#### Common API Patterns -```bash -# Read token from tfvars -DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2) - -# List repos -curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos" - -# Activate repo in Drone -curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/" - -# Trigger build -curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin//builds" - -# Get build info -curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin//builds/" - -# Add secret to repo -curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin//secrets" \ - -d '{"name":"secret_name","data":"secret_value"}' -``` - -### Capabilities -With these tokens, Claude can: -- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages -- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines +- Drone CI (`.drone.yml`): pushes apply `platform` stack (Terraform 1.5.7 + Terragrunt 0.99.4) +- TLS renewal pipeline: cron-triggered `renew2.sh` (certbot + Cloudflare DNS) +- **ALWAYS add `[ci skip]`** to commit messages when you've already applied locally +- **After committing, run `git push origin master`** to sync ## Infrastructure -- Proxmox hypervisor for VMs (192.168.1.127) -- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2) -- NFS server at 10.0.10.15 for storage -- Redis shared service at `redis.redis.svc.cluster.local` -- Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init) - - Port 5000: docker.io (Docker Hub, with auth) - - Port 5010: ghcr.io - - Port 5020: quay.io - - Port 5030: registry.k8s.io - - Port 5040: reg.kyverno.io - - Worker nodes use `config_path = "/etc/containerd/certs.d"` with per-registry `hosts.toml` files - - k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors) +- Proxmox hypervisor (192.168.1.127) — see `.claude/reference/proxmox-inventory.md` for full VM table +- Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4) +- NFS: `10.0.10.15`, Redis: `redis.redis.svc.cluster.local` +- Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040) +- GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }` -### Proxmox Host Hardware -- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket) -- **RAM**: 142 GB (Dell R730 server) -- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1) -- **Disks**: 1.1TB + 931GB + 10.7TB (local storage) -- **Proxmox access**: `ssh root@192.168.1.127` - -### Proxmox Network Bridges -- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — connects to physical/home network (192.168.1.0/24) -- **vmbr1**: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24) - -### Proxmox VM Inventory - -| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes | -|------|------|--------|------|-----|---------|------|-------| -| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall, routes between all networks | -| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM on management network | -| 103 | home-assistant | running | 8 | 16GB | vmbr1:vlan10(down), vmbr0 | 32G | Home Assistant, net0 link disabled, uses vmbr0 | -| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup Server (not in use) | -| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Kubernetes control plane (10.0.20.100) | -| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 passthrough (hostpci0) | -| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node | -| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node | -| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node | -| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10) | -| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM on physical network | -| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7×256G+1T | NFS server (10.0.10.15), multiple data disks | - -#### VM Templates (stopped, used for cloning) -| VMID | Name | Purpose | -|------|------|---------| -| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base template for non-K8s VMs | -| 1001 | docker-registry-template | Template for docker registry VM | -| 2000 | ubuntu-2404-cloudinit-k8s-template | Base template for K8s nodes | - -#### Network Connectivity Summary -- **pfSense (101)** bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20 -- **K8s cluster** (200-204) + **docker-registry** (220) are all on VLAN 20 (kubernetes network) -- **TrueNAS** (9000) + **devvm** (102) + **PBS** (105) are on VLAN 10 (management network) -- **Home Assistant** (103) is on physical network (vmbr0), with a disabled VLAN 10 interface -- **Windows10** (300) is on physical network (vmbr0) only - -### GPU Node (k8s-node1) -- **VMID**: 201 -- **PCIe Passthrough**: `0000:06:00.0` (NVIDIA Tesla T4) -- **Taint**: `nvidia.com/gpu=true:NoSchedule` - Only GPU workloads can run here -- **Label**: `gpu=true` -- GPU workloads must have both: - - `node_selector = { "gpu": "true" }` - - `toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }` -- Taint is applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf` - -## Git Operations (IMPORTANT) -- **Git is slow** on this repo due to many files - commands can take 30+ seconds -- Use `GIT_OPTIONAL_LOCKS=0` prefix if git hangs -- Always commit only specific files you changed, not everything -- **ALWAYS ask user before pushing to remote** - never push without explicit confirmation +## Git Operations +- **Git is slow** — commands can take 30+ seconds. Use `GIT_OPTIONAL_LOCKS=0` if git hangs. +- Commit only specific files. **ALWAYS ask user before pushing**. ## Prometheus Alerts -- Alert rules are in `modules/kubernetes/monitoring/prometheus_chart_values.tpl` -- Under `serverFiles.alerting_rules.yml.groups` +- Rules in `modules/kubernetes/monitoring/prometheus_chart_values.tpl` - Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster" -- kube-state-metrics provides: `kube_deployment_*`, `kube_statefulset_*`, `kube_daemonset_*` -## Tier System -- **0-core**: Critical infrastructure (ingress, DNS, VPN, auth) -- **1-cluster**: Cluster services (Redis, metrics, security) -- **2-gpu**: GPU workloads (Immich, Ollama, Frigate) -- **3-edge**: User-facing services -- **4-aux**: Optional/auxiliary services - -### Resource Governance (Kyverno-based) -Four layers of noisy-neighbor protection, all defined in `modules/kubernetes/kyverno/resource-governance.tf`: - -1. **PriorityClasses**: `tier-0-core` (1M) through `tier-4-aux` (200K). `tier-4-aux` uses `preemption_policy=Never`. -2. **LimitRange defaults** (Kyverno generate): Auto-creates `tier-defaults` LimitRange in namespaces based on tier label. Only affects containers without explicit resources. -3. **ResourceQuotas** (Kyverno generate): Auto-creates `tier-quota` ResourceQuota in namespaces with tier labels. Excludes namespaces with `resource-governance/custom-quota=true` label. -4. **Priority injection** (Kyverno mutate): Sets `priorityClassName` on Pods based on namespace tier label. - -**Custom quota override**: Add label `resource-governance/custom-quota: "true"` to namespace, then define a custom `kubernetes_resource_quota` in the service's Terraform module. Currently used by: monitoring, crowdsec. - -**LimitRange defaults by tier**: -| Tier | Default Req | Default Limit | Max | -|------|------------|--------------|-----| -| 0-core | 100m/128Mi | 2/4Gi | 8/16Gi | -| 1-cluster | 100m/128Mi | 2/4Gi | 4/8Gi | -| 2-gpu | 100m/256Mi | 4/8Gi | 8/16Gi | -| 3-edge | 50m/128Mi | 1/2Gi | 4/8Gi | -| 4-aux | 25m/64Mi | 500m/1Gi | 2/4Gi | - -**ResourceQuota hard limits by tier**: -| Tier | Req CPU | Req Mem | Lim CPU | Lim Mem | Pods | -|------|---------|---------|---------|---------|------| -| 0-core | 8 | 8Gi | 32 | 64Gi | 100 | -| 1-cluster | 4 | 4Gi | 16 | 32Gi | 30 | -| 2-gpu | 8 | 8Gi | 48 | 96Gi | 40 | -| 3-edge | 4 | 4Gi | 16 | 32Gi | 30 | -| 4-aux | 2 | 2Gi | 8 | 16Gi | 20 | +## Tier System & Resource Governance +- **0-core**: Critical infra (ingress, DNS, VPN, auth) | **1-cluster**: Redis, metrics, security | **2-gpu**: GPU workloads | **3-edge**: User-facing | **4-aux**: Optional +- Kyverno-based governance in `modules/kubernetes/kyverno/resource-governance.tf`: + 1. PriorityClasses: `tier-0-core` (1M) through `tier-4-aux` (200K, preemption=Never) + 2. LimitRange defaults (Kyverno generate): auto-created per namespace tier + 3. ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label `resource-governance/custom-quota=true`) + 4. Priority injection (Kyverno mutate): sets priorityClassName on Pods +- Custom quota override: monitoring, crowdsec --- ## User Preferences - -### Calendar -- **Default calendar**: Nextcloud (always use unless otherwise specified) -- **Nextcloud URL**: `https://nextcloud.viktorbarzin.me` -- **CalDAV endpoint**: `https://nextcloud.viktorbarzin.me/remote.php/dav/calendars///` - -### Home Assistant -- **Default smart home**: Home Assistant (always use for smart home control) -- **Two deployments**: - - **ha-london** (default): `https://ha-london.viktorbarzin.me` | Script: `.claude/home-assistant.py` | SSH: `ssh pi@192.168.8.103`, config at `/home/pi/docker/homeAssistant/` - - **ha-sofia**: `https://ha-sofia.viktorbarzin.me` | Script: `.claude/home-assistant-sofia.py` | SSH: `ssh vbarzin@192.168.1.8`, config at `/config/` -- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia. - -### Development -- **Frontend framework**: Svelte (user is learning it, so use Svelte for all new web apps) - -### Pod Monitoring After Updates -- **Never use `sleep` to wait for pods** — instead, spawn a background subagent (Task tool with `run_in_background: true`) that continuously checks pod state (e.g., `kubectl get pods -n -w`) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling. +- **Calendar**: Nextcloud at `https://nextcloud.viktorbarzin.me` +- **Home Assistant**: ha-london (default) at `https://ha-london.viktorbarzin.me`, ha-sofia at `https://ha-sofia.viktorbarzin.me`. "ha"/"HA" = ha-london. +- **Frontend**: Svelte for all new web apps +- **Pod monitoring**: Never use `sleep` — spawn background subagent with `kubectl get pods -w` instead --- -## Skills & Workflows - -Skills are specialized workflows for common tasks. Located in `.claude/skills/`. - -### Available Skills - -**setup-project** (`.claude/skills/setup-project/SKILL.md`) -- Deploy new self-hosted services from GitHub repos -- Automated workflow: Docker image → Terraform module → Deploy -- Handles database setup, ingress, DNS configuration -- **When to use**: User provides GitHub URL or wants to deploy a new service -- **Example**: "Deploy [GitHub repo] to the cluster" - -**extend-vm-storage** (`.claude/skills/extend-vm-storage/SKILL.md`) -- Extend disk storage on K8s node VMs (Proxmox-hosted) -- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon -- **When to use**: A k8s node needs more disk space -- **Example**: "Extend storage on k8s-node2 by 64G" +## Reference Data +- `.claude/reference/service-catalog.md` — Full service catalog (70+ services) with Cloudflare domains +- `.claude/reference/proxmox-inventory.md` — VM table, hardware specs, network topology, GPU config +- `.claude/reference/github-drone-api.md` — GitHub & Drone CI API patterns with curl examples +- `.claude/reference/authentik-state.md` — Current applications, groups, users, login sources --- ## Service-Specific Notes ### Authentik (Identity Provider) -- **Helm Chart**: `authentik` v2025.10.3 from `https://charts.goauthentik.io/` -- **URL**: `https://authentik.viktorbarzin.me` -- **API**: `https://authentik.viktorbarzin.me/api/v3/` -- **API Token**: Stored in `terraform.tfvars` as `authentik_api_token` (non-expiring, superuser, identifier: `claude-code-permanent`). Read with: `grep authentik_api_token terraform.tfvars | cut -d'"' -f2` -- **Namespace**: `authentik` (tier: cluster) -- **Architecture**: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost -- **Database**: PostgreSQL via `postgresql.dbaas:5432`, pooled through PgBouncer at `pgbouncer.authentik:6432` -- **Redis**: Shared at `redis.redis.svc.cluster.local` -- **Terraform**: `modules/kubernetes/authentik/main.tf` (Helm), `pgbouncer.tf` (connection pooling) - -#### Authentik API Management -To call the API, use: -```bash -curl -s -H "Authorization: Bearer " "https://authentik.viktorbarzin.me/api/v3//" -``` - -Key API endpoints: -- `core/users/` — List/create/update/delete users -- `core/groups/` — List/create/update/delete groups -- `core/applications/` — List/create applications -- `providers/all/` — List all providers (OAuth2, Proxy, etc.) -- `providers/oauth2/` — OAuth2/OIDC providers specifically -- `providers/proxy/` — Proxy providers (forward auth) -- `flows/instances/` — List flows -- `stages/all/` — List stages -- `sources/all/` — List sources (Google, GitHub, etc.) -- `outposts/instances/` — List outposts -- `propertymappings/all/` — List property mappings -- `rbac/roles/` — List roles - -#### Current Applications (9) -| Application | Provider Type | Auth Flow | -|-------------|--------------|-----------| -| Cloudflare Access | OAuth2/OIDC | explicit consent | -| Domain wide catch all | Proxy (forward auth) | implicit consent | -| Grafana | OAuth2/OIDC | implicit consent | -| Headscale | OAuth2/OIDC | explicit consent | -| Immich | OAuth2/OIDC | explicit consent | -| Kubernetes | OAuth2/OIDC (public) | implicit consent | -| linkwarden | OAuth2/OIDC | explicit consent | -| Matrix | OAuth2/OIDC | implicit consent | -| wrongmove | OAuth2/OIDC | implicit consent | - -#### Current Groups (9) -| Group | Parent | Superuser | Purpose | -|-------|--------|-----------|---------| -| Allow Login Users | — | No | Parent group for login-permitted users | -| authentik Admins | — | Yes | Full admin access | -| authentik Read-only | — | No | Read-only access (has role) | -| Headscale Users | Allow Login Users | No | VPN access | -| Home Server Admins | Allow Login Users | No | Server admin access | -| Wrongmove Users | Allow Login Users | No | Real-estate app access | -| kubernetes-admins | — | No | K8s cluster-admin RBAC | -| kubernetes-power-users | — | No | K8s power-user RBAC | -| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC | - -#### Current Users (7 real users) -| Username | Name | Type | Groups | -|----------|------|------|--------| -| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users | -| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users | -| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users | -| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users | -| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users | -| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users | -| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users | -| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users | - -#### Login Sources (Social Login) -- **Google** (OAuth) — user matching by identifier -- **GitHub** (OAuth) — user matching by email_link -- **Facebook** (OAuth) — user matching by email_link -- All use the same authentication flow (`1a779f24`) and enrollment flow (`87572804`) - -#### Authorization Flows -- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare -- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove - -#### Traefik Integration -- Forward auth middleware: `authentik-forward-auth` in Traefik namespace -- Outpost endpoint: `http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik` -- Services opt in via `protected = true` in `ingress_factory` -- Response headers: `X-authentik-username`, `X-authentik-uid`, `X-authentik-email`, `X-authentik-name`, `X-authentik-groups`, `Set-Cookie` - -#### OIDC for Kubernetes API -- **Issuer**: `https://authentik.viktorbarzin.me/application/o/kubernetes/` -- **Client ID**: `kubernetes` (public client, no secret) -- **Username claim**: `email`, **Groups claim**: `groups` -- **Signing key**: `authentik Self-signed Certificate` (must be assigned to the provider or JWKS will be empty) -- **Redirect URIs**: Regex mode `http://localhost:.*` and `http://127\.0\.0\.1:.*` (kubelogin picks random ports) -- **Configured via**: SSH to kube-apiserver manifest (`modules/kubernetes/rbac/apiserver-oidc.tf`) -- **RBAC module**: `modules/kubernetes/rbac/main.tf` — admin/power-user/namespace-owner roles -- **Self-service portal**: `modules/kubernetes/k8s-portal/` — SvelteKit app at `https://k8s-portal.viktorbarzin.me` -- **User definition**: `k8s_users` variable in `terraform.tfvars` -- **Audit logging**: Enabled via `modules/kubernetes/rbac/audit-policy.tf`, logs at `/var/log/kubernetes/audit.log` - -**CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:** -1. **Signing key MUST be assigned** to the OAuth2 provider. Without it, the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens. -2. **Email mapping must set `email_verified: True`**. The default Authentik email scope mapping hardcodes `email_verified: False`, which causes kube-apiserver to reject the token with `oidc: email not verified`. Use a custom scope mapping: `return {"email": request.user.email, "email_verified": True}` -3. **kubelogin needs `--oidc-extra-scope`** for `email`, `profile`, `groups`. Without these, only `openid` is requested and the token lacks the `email` claim, causing `oidc: parse username claims "email": claim not present`. -4. **Redirect URIs must use regex mode** (`http://localhost:.*`) because kubelogin picks random ports, not just 8000/18000. -5. **Kubelet static pod manifest changes** require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple `touch` or kubelet restart is not enough. -6. **Property mappings endpoint** in Authentik 2025.10.x is `propertymappings/provider/scope/` (not the older `propertymappings/scope/`). - -#### Common Management Tasks -**Add a new OAuth2 application:** -1. Create OAuth2 provider: `POST /api/v3/providers/oauth2/` with client_id, client_secret, redirect_uris, authorization_flow, etc. -2. Create application: `POST /api/v3/core/applications/` with name, slug, provider pk -3. (Optional) Bind to group policy for access control - -**Add a user to a group:** -```bash -# Get group pk, then PATCH with updated users list -curl -X PATCH -H "Authorization: Bearer " -H "Content-Type: application/json" \ - "https://authentik.viktorbarzin.me/api/v3/core/groups//" \ - -d '{"users": [, ]}' -``` - -**Protect a service with forward auth:** -Set `protected = true` in the service's `ingress_factory` call in Terraform. +- **URL**: `https://authentik.viktorbarzin.me` | **API**: `/api/v3/` | **Token**: `authentik_api_token` in tfvars +- **Architecture**: 3 server + 3 worker + 3 PgBouncer + embedded outpost +- **Database**: PostgreSQL via `postgresql.dbaas:5432`, PgBouncer at `pgbouncer.authentik:6432` +- **Traefik integration**: Forward auth via `protected = true` in ingress_factory +- **OIDC for K8s**: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public) +- For management tasks, current state, and OIDC gotchas: see `authentik` and `authentik-oidc-kubernetes` skills +- For current apps/groups/users snapshot: see `.claude/reference/authentik-state.md` ### AFFiNE (Visual Canvas) -- **Image**: `ghcr.io/toeverything/affine:stable` -- **Port**: 3010 -- **Requires**: PostgreSQL + Redis +- **Image**: `ghcr.io/toeverything/affine:stable` | **Port**: 3010 | **Requires**: PostgreSQL + Redis - **Migration**: Init container runs `node ./scripts/self-host-predeploy.js` -- **Storage**: NFS at `/mnt/main/affine` mounted to `/root/.affine/storage` and `/root/.affine/config` -- **Key env vars**: - - `AFFINE_SERVER_EXTERNAL_URL` - Public URL (e.g., `https://affine.viktorbarzin.me`) - - `AFFINE_SERVER_HTTPS` - Set to `true` behind TLS ingress - - `DATABASE_URL` - PostgreSQL connection string - - `REDIS_SERVER_HOST` - Redis hostname - - `MAILER_*` - SMTP configuration for email invites -- **Local-first**: Data stored in browser by default; syncs to server when user creates account -- **Docs**: https://docs.affine.pro/self-host-affine +- **Storage**: NFS `/mnt/main/affine` → `/root/.affine/storage` and `/root/.affine/config` -### Wyoming Whisper (STT for Home Assistant) -- **Image**: `rhasspy/wyoming-whisper:latest` -- **Port**: 10300/TCP (Wyoming protocol) -- **Model**: `small-int8` (CPU-optimized, no CUDA variant available from upstream) -- **Runs on**: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only -- **Storage**: NFS at `/mnt/main/whisper` → `/data` (model cache) -- **Exposure**: Internal only via Traefik TCP entrypoint `whisper-tcp` → IngressRouteTCP -- **Access**: `10.0.20.202:10300` (Traefik LB IP, no public DNS) -- **HA Integration**: Wyoming Protocol integration in ha-london, host `10.0.20.202`, port `10300` -- **No GPU acceleration**: Official image is CPU-only (Debian + PyTorch CPU). The `mib1185/wyoming-faster-whisper-cuda` image exists but requires self-build. +### Wyoming Whisper (STT) +- **Image**: `rhasspy/wyoming-whisper:latest` | **Port**: 10300/TCP (Wyoming protocol) +- **Model**: `small-int8` (CPU-only) | **Access**: `10.0.20.202:10300` (internal, no public DNS) +- **HA Integration**: Wyoming Protocol in ha-london ### Gramps Web (Genealogy) -- **Image**: `ghcr.io/gramps-project/grampsweb:latest` -- **Port**: 5000 -- **URL**: `https://family.viktorbarzin.me` -- **Components**: Web app + Celery worker (2 containers in 1 pod) -- **Requires**: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting) -- **Storage**: NFS at `/mnt/main/grampsweb` with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp -- **Key env vars**: - - `GRAMPSWEB_SECRET_KEY` - Flask secret key (generated via `random_password`) - - `GRAMPSWEB_TREE` - Tree name - - `GRAMPSWEB_BASE_URL` - Public URL - - `GRAMPSWEB_CELERY_CONFIG__broker_url` / `result_backend` - Redis connection - - `GRAMPSWEB_REGISTRATION_DISABLED` - Set to `True` - - `GRAMPSWEB_EMAIL_*` - SMTP configuration - - `GRAMPSWEB_LLM_*` - Ollama AI integration -- **Celery command**: `celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2` -- **Registration**: Disabled; first user created via UI setup wizard +- **Image**: `ghcr.io/gramps-project/grampsweb:latest` | **Port**: 5000 | **URL**: `https://family.viktorbarzin.me` +- **Components**: Web app + Celery worker (2 containers in 1 pod) | **Redis**: DB 2 (broker), DB 3 (rate limiting) +- **Storage**: NFS `/mnt/main/grampsweb` with sub_paths -### Loki + Alloy (Centralized Log Collection) -- **Loki image**: `grafana/loki:3.6.5` (Helm chart, single binary mode) -- **Alloy image**: `grafana/alloy:v1.13.0` (Helm chart, DaemonSet) -- **Config files**: `modules/kubernetes/monitoring/loki.tf`, `loki.yaml`, `alloy.yaml` -- **Port**: 3100/TCP (Loki API) -- **Storage**: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi in-memory) -- **Memory**: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes) -- **Disk-friendly tuning**: `max_chunk_age: 24h`, `chunk_idle_period: 12h` — holds chunks in memory, flushes ~once/day -- **Retention**: 7 days (`retention_period: 168h`), compactor enforces deletion -- **Crash policy**: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time) -- **Ruler**: Evaluates LogQL alert rules, fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093` +### Loki + Alloy (Log Collection) +- **Loki**: `grafana/loki:3.6.5` (single binary, 6Gi RAM, 7d retention) +- **Alloy**: `grafana/alloy:v1.13.0` (DaemonSet, 128Mi/pod) +- **Storage**: NFS PV `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi) - **Alert rules**: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`) -- **Grafana**: Datasource UID `P8E80F9AEF21F6940`, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned) -- **Sysctl DaemonSet**: `sysctl-inotify` sets `fs.inotify.max_user_watches=1048576` on all nodes (required for Alloy fsnotify) -- **Disabled components**: gateway, chunksCache, resultsCache (not needed for single binary) -- **Key paths**: Compactor at `/var/loki/compactor`, ruler scratch at `/var/loki/scratch` (must be under `/var/loki` — root FS is read-only) -- **Querying**: Grafana Explore with LogQL, e.g. `{namespace="monitoring"} |= "error"` -- **Troubleshooting**: If "entry too far behind" errors on first start, restart Alloy DaemonSet (`kubectl rollout restart ds -n monitoring alloy`) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart +- **Troubleshooting**: "entry too far behind" on first start → restart Alloy DaemonSet ### OpenClaw (AI Agent Gateway) -- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9` -- **Port**: 18789 -- **URL**: `https://openclaw.viktorbarzin.me` (authentik-protected) -- **Namespace**: `openclaw` (tier: aux) -- **Formerly**: `moltbot` — renamed in Feb 2026 -- **Architecture**: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway) -- **Init container**: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init -- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding (for managing cluster resources) -- **Storage**: NFS at `/mnt/main/openclaw/workspace` (git repo) and `/mnt/main/openclaw/data` (persistent data) -- **Config**: `openclaw.json` ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults -- **Variables**: `openclaw_ssh_key`, `openclaw_skill_secrets` in `terraform.tfvars` -- **Skill secrets**: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars -- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick) +- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9` | **Port**: 18789 | **URL**: `https://openclaw.viktorbarzin.me` +- **Init container**: Downloads kubectl, terraform, git-crypt; clones infra repo +- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding +- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API -### Poison Fountain (Anti-AI Scraping Service) -- **Image**: `python:3.12-slim` (runs custom `server.py` from ConfigMap) -- **Port**: 8080 -- **URL**: `https://poison.viktorbarzin.me` (public, no auth) -- **Namespace**: `poison-fountain` (tier: aux) -- **Stack**: `stacks/poison-fountain/` -- **Architecture**: 1 Deployment (Python HTTP server) + 1 CronJob (fetcher, every 6h) -- **Storage**: NFS at `/mnt/main/poison-fountain` — `cache/` subdir for poison docs (chmod 777 for curl uid 101) -- **Endpoints**: - - `/auth` — ForwardAuth: checks User-Agent, returns 200 (allow) or 403 (block AI bots) - - `/article/*` — Tarpit: drip-feeds poison content at ~100 bytes/sec (DRIP_BYTES=50, DRIP_DELAY=0.5s) - - `/healthz` — Health check -- **CronJob**: Fetches 50 documents from `rnsaffn.com/poison2/` using `--http1.1` (HTTP/2 hangs) -- **Ingress**: Uses `anti_ai_scraping = false` (doesn't protect itself), `skip_default_rate_limit = true`, `exclude_crowdsec = true` -- **DNS**: `poison.viktorbarzin.me` in `cloudflare_non_proxied_names` -- **Traefik middlewares** (in `stacks/platform/modules/traefik/middleware.tf`): - - `ai-bot-block` — ForwardAuth to poison-fountain `/auth` - - `anti-ai-headers` — X-Robots-Tag: noai, noimageai - - `anti-ai-trap-links` — rewrite-body plugin injecting 5 hidden links before `` +## Service Versions (as of 2026-02) +Immich v2.4.1 | AFFiNE stable | Whisper latest | Loki 3.6.5 | Alloy v1.13.0 | OpenClaw 2026.2.9 diff --git a/.claude/reference/authentik-state.md b/.claude/reference/authentik-state.md new file mode 100644 index 00000000..34fef453 --- /dev/null +++ b/.claude/reference/authentik-state.md @@ -0,0 +1,50 @@ +# Authentik Current State + +> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks. + +## Applications (9) +| Application | Provider Type | Auth Flow | +|-------------|--------------|-----------| +| Cloudflare Access | OAuth2/OIDC | explicit consent | +| Domain wide catch all | Proxy (forward auth) | implicit consent | +| Grafana | OAuth2/OIDC | implicit consent | +| Headscale | OAuth2/OIDC | explicit consent | +| Immich | OAuth2/OIDC | explicit consent | +| Kubernetes | OAuth2/OIDC (public) | implicit consent | +| linkwarden | OAuth2/OIDC | explicit consent | +| Matrix | OAuth2/OIDC | implicit consent | +| wrongmove | OAuth2/OIDC | implicit consent | + +## Groups (9) +| Group | Parent | Superuser | Purpose | +|-------|--------|-----------|---------| +| Allow Login Users | — | No | Parent group for login-permitted users | +| authentik Admins | — | Yes | Full admin access | +| authentik Read-only | — | No | Read-only access (has role) | +| Headscale Users | Allow Login Users | No | VPN access | +| Home Server Admins | Allow Login Users | No | Server admin access | +| Wrongmove Users | Allow Login Users | No | Real-estate app access | +| kubernetes-admins | — | No | K8s cluster-admin RBAC | +| kubernetes-power-users | — | No | K8s power-user RBAC | +| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC | + +## Users (7 real) +| Username | Name | Type | Groups | +|----------|------|------|--------| +| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users | +| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users | +| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users | +| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users | +| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users | +| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users | +| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users | +| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users | + +## Login Sources +- **Google** (OAuth) — user matching by identifier +- **GitHub** (OAuth) — user matching by email_link +- **Facebook** (OAuth) — user matching by email_link + +## Authorization Flows +- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen +- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects diff --git a/.claude/reference/github-drone-api.md b/.claude/reference/github-drone-api.md new file mode 100644 index 00000000..6760b5fe --- /dev/null +++ b/.claude/reference/github-drone-api.md @@ -0,0 +1,50 @@ +# GitHub & Drone CI API Reference + +> Token locations and common API patterns. + +## GitHub API +- **Username**: `ViktorBarzin` +- **Token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` (git-crypt encrypted) +- **Scopes**: Full access (repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages) +- **`gh` CLI**: Blocked by sandbox — use `curl` instead + +```bash +GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2) + +# List repos +curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100" + +# Create repo +curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \ + -d '{"name":"repo-name","private":true}' + +# Add deploy key +curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin//keys" \ + -d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}' + +# Create webhook +curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin//hooks" \ + -d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}' +``` + +## Drone CI API +- **Server**: `https://drone.viktorbarzin.me` +- **Token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2` + +```bash +DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2) + +# Activate repo +curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/" + +# Trigger build +curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin//builds" + +# Add secret +curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin//secrets" \ + -d '{"name":"secret_name","data":"secret_value"}' +``` + +## Capabilities +- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages +- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines diff --git a/.claude/reference/proxmox-inventory.md b/.claude/reference/proxmox-inventory.md new file mode 100644 index 00000000..daec0ff7 --- /dev/null +++ b/.claude/reference/proxmox-inventory.md @@ -0,0 +1,52 @@ +# Proxmox Inventory & Infrastructure + +> Static reference for VMs, hardware, and network topology. + +## Proxmox Host Hardware +- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket) +- **RAM**: 142 GB (Dell R730 server) +- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1) +- **Disks**: 1.1TB + 931GB + 10.7TB (local storage) +- **Proxmox access**: `ssh root@192.168.1.127` + +## Network Topology +``` +10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15) +10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10), + k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200) +192.168.1.0/24 - Physical: Proxmox (192.168.1.127) +``` + +## Network Bridges +- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network +- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes) + +## VM Inventory + +| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes | +|------|------|--------|------|-----|---------|------|-------| +| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall | +| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM | +| 103 | home-assistant | running | 8 | 16GB | vmbr0 | 32G | HA, net0(vlan10) disabled | +| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) | +| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) | +| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 | +| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker | +| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker | +| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker | +| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) | +| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM | +| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) | + +## VM Templates +| VMID | Name | Purpose | +|------|------|---------| +| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs | +| 1001 | docker-registry-template | Docker registry VM | +| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes | + +## GPU Node (k8s-node1) +- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) +- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true` +- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration +- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf` diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md new file mode 100644 index 00000000..f0c5ad48 --- /dev/null +++ b/.claude/reference/service-catalog.md @@ -0,0 +1,132 @@ +# Service Catalog + +> Auto-maintained reference. See `.claude/CLAUDE.md` for operational guidance. + +## Critical - Network & Auth (Tier: core) +| Service | Description | Stack | +|---------|-------------|-------| +| wireguard | VPN server | platform | +| technitium | DNS server (10.0.20.101) | platform | +| headscale | Tailscale control server | platform | +| traefik | Ingress controller (Helm) | platform | +| xray | Proxy/tunnel | platform | +| authentik | Identity provider (SSO) | platform | +| cloudflared | Cloudflare tunnel | platform | +| authelia | Auth middleware | platform | +| monitoring | Prometheus/Grafana/Loki stack | platform | + +## Storage & Security (Tier: cluster) +| Service | Description | Stack | +|---------|-------------|-------| +| vaultwarden | Bitwarden-compatible password manager | platform | +| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform | +| immich | Photo management (GPU) | immich | +| nvidia | GPU device plugin | platform | +| metrics-server | K8s metrics | platform | +| uptime-kuma | Status monitoring | platform | +| crowdsec | Security/WAF | platform | +| kyverno | Policy engine | platform | + +## Admin +| Service | Description | Stack | +|---------|-------------|-------| +| k8s-dashboard | Kubernetes dashboard | platform | +| reverse-proxy | Generic reverse proxy | platform | + +## Active Use +| Service | Description | Stack | +|---------|-------------|-------| +| mailserver | Email (docker-mailserver) | mailserver | +| shadowsocks | Proxy | shadowsocks | +| webhook_handler | Webhook processing | webhook_handler | +| tuya-bridge | Smart home bridge | tuya-bridge | +| dawarich | Location history | dawarich | +| owntracks | Location tracking | owntracks | +| nextcloud | File sync/share | nextcloud | +| calibre | E-book management | calibre | +| onlyoffice | Document editing | onlyoffice | +| f1-stream | F1 streaming | f1-stream | +| rybbit | Analytics | rybbit | +| isponsorblocktv | SponsorBlock for TV | isponsorblocktv | +| actualbudget | Budgeting (factory pattern) | actualbudget | + +## Optional +| Service | Description | Stack | +|---------|-------------|-------| +| blog | Personal blog | blog | +| descheduler | Pod descheduler | descheduler | +| drone | CI/CD | drone | +| hackmd | Collaborative markdown | hackmd | +| kms | Key management | kms | +| privatebin | Encrypted pastebin | privatebin | +| vault | HashiCorp Vault | vault | +| reloader | ConfigMap/Secret reloader | reloader | +| city-guesser | Game | city-guesser | +| echo | Echo server | echo | +| url | URL shortener | url | +| excalidraw | Whiteboard | excalidraw | +| travel_blog | Travel blog | travel_blog | +| dashy | Dashboard | dashy | +| send | Firefox Send | send | +| ytdlp | YouTube downloader | ytdlp | +| wealthfolio | Finance tracking | wealthfolio | +| audiobookshelf | Audiobook server | audiobookshelf | +| paperless-ngx | Document management | paperless-ngx | +| jsoncrack | JSON visualizer | jsoncrack | +| servarr | Media automation (Sonarr/Radarr/etc) | servarr | +| ntfy | Push notifications | ntfy | +| cyberchef | Data transformation | cyberchef | +| diun | Docker image update notifier | diun | +| meshcentral | Remote management | meshcentral | +| homepage | Dashboard/startpage | homepage | +| matrix | Matrix chat server | matrix | +| linkwarden | Bookmark manager | linkwarden | +| changedetection | Web change detection | changedetection | +| tandoor | Recipe manager | tandoor | +| n8n | Workflow automation | n8n | +| real-estate-crawler | Property crawler | real-estate-crawler | +| tor-proxy | Tor proxy | tor-proxy | +| forgejo | Git forge | forgejo | +| freshrss | RSS reader | freshrss | +| navidrome | Music streaming | navidrome | +| networking-toolbox | Network tools | networking-toolbox | +| stirling-pdf | PDF tools | stirling-pdf | +| speedtest | Speed testing | speedtest | +| freedify | Music streaming (factory pattern) | freedify | +| netbox | Network documentation | netbox | +| infra-maintenance | Maintenance jobs | infra-maintenance | +| ollama | LLM server (GPU) | ollama | +| frigate | NVR/camera (GPU) | frigate | +| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook | +| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine | +| health | Apple Health data dashboard (PostgreSQL) | health | +| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper | +| grampsweb | Genealogy web app (Gramps Web) | grampsweb | +| openclaw | AI agent gateway (OpenClaw) | openclaw | +| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain | + +## Cloudflare Domains + +### Proxied (CDN + WAF enabled) +``` +blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, +audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, +changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, +travel, netbox +``` + +### Non-Proxied (Direct DNS) +``` +mail, wg, headscale, immich, calibre, vaultwarden, drone, +mailserver-antispam, mailserver-admin, webhook, uptime, +owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget, +onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui, +isponsorblocktv, speedtest, freedify, rybbit, paperless, +servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr, +jellyfin, jellyseerr, tdarr, affine, health, family, openclaw +``` + +### Special Subdomains +- `*.viktor.actualbudget` - Actualbudget factory instances +- `*.freedify` - Freedify factory instances +- `mailserver.*` - Mail server components (antispam, admin) diff --git a/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md b/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md deleted file mode 100644 index 7b519b48..00000000 --- a/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md +++ /dev/null @@ -1,138 +0,0 @@ ---- -name: containerd-multi-registry-pull-through-cache -description: | - Set up pull-through caches for multiple container registries (ghcr.io, quay.io, - registry.k8s.io, reg.kyverno.io) using Docker Registry v2 instances. Use when: - (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror, - (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls, - (3) need to add pull-through cache for a new upstream registry, - (4) `mirrors` cannot be set when `config_path` is provided error in containerd, - (5) containerd 1.6.x vs 1.7.x config_path compatibility issues. - Docker Registry v2 can only proxy ONE upstream per instance, so multiple - containers are needed for multiple registries. -author: Claude Code -version: 1.0.0 -date: 2026-02-14 ---- - -# Containerd Multi-Registry Pull-Through Cache - -## Problem - -Docker Registry v2 can only proxy **one upstream registry per instance**. A common -misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing -to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io, -and other registries — they get routed to the Docker Hub proxy which can't serve them, -causing `ImagePullBackOff`. - -## Context / Trigger Conditions - -- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries -- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]` -- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided` -- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach - -## Solution - -### 1. Run one Registry v2 container per upstream - -Each upstream needs its own Docker Registry v2 instance on a different port: - -| Port | Registry | Container Name | -|------|----------|---------------| -| 5000 | docker.io | registry | -| 5010 | ghcr.io | registry-ghcr | -| 5020 | quay.io | registry-quay | -| 5030 | registry.k8s.io | registry-k8s | -| 5040 | reg.kyverno.io | registry-kyverno | - -Config for non-Docker-Hub proxies (no auth needed — they're public): - -```yaml -version: 0.1 -storage: - cache: - blobdescriptor: inmemory - filesystem: - rootdirectory: /var/lib/registry -http: - addr: :5000 -proxy: - remoteurl: https://ghcr.io # change per registry -``` - -```bash -docker run -p 5010:5000 -d --restart always --name registry-ghcr \ - -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2 -``` - -### 2. Replace deprecated wildcard mirror with `config_path` - -Instead of: -```toml -# DEPRECATED - breaks non-Docker-Hub registries -[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"] - endpoint = ["http://10.0.20.10:5000"] -``` - -Use the modern `config_path` approach: -```toml -[plugins."io.containerd.grpc.v1.cri".registry] - config_path = "/etc/containerd/certs.d" -``` - -Then create per-registry `hosts.toml` files: -```bash -mkdir -p /etc/containerd/certs.d/docker.io -cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF' -server = "https://registry-1.docker.io" - -[host."http://10.0.20.10:5000"] - capabilities = ["pull", "resolve"] -EOF -``` - -Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage). - -### 3. Critical: `config_path` and `mirrors` cannot coexist - -Containerd will **refuse to start the CRI plugin** if both `config_path` and any -`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries -(including the `[plugins."...registry.mirrors"]` parent section) before setting -`config_path`. - -This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master) -where the config format is slightly different. If unsure, either: -- Don't use config_path on that node (skip the pull-through cache) -- Remove the entire `mirrors` section first, then add `config_path` - -### 4. Static IP for registry VM - -If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP -via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP. - -## Verification - -```bash -# Test each proxy responds -for port in 5000 5010 5020 5030 5040; do - curl -s http://10.0.20.10:$port/v2/_catalog -done - -# Test containerd can pull through cache -crictl pull ghcr.io/some/image:tag - -# Check containerd logs for mirror usage -journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry" -``` - -## Notes - -- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to - direct pull from the upstream `server` URL. This provides graceful degradation. -- **GC crontabs**: Add weekly garbage collection for each registry container, staggered - to avoid I/O spikes. -- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates. -- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand. - -See also: `k8s-docker-registry-cache-bypass` (for stale cached image issues) diff --git a/.claude/skills/helm-release-force-rerender/SKILL.md b/.claude/skills/helm-release-troubleshooting/SKILL.md similarity index 55% rename from .claude/skills/helm-release-force-rerender/SKILL.md rename to .claude/skills/helm-release-troubleshooting/SKILL.md index d0648c15..a402ca45 100644 --- a/.claude/skills/helm-release-force-rerender/SKILL.md +++ b/.claude/skills/helm-release-troubleshooting/SKILL.md @@ -1,27 +1,33 @@ --- -name: helm-release-force-rerender +name: helm-release-troubleshooting description: | - Fix for Helm releases managed by Terraform where changing Helm values doesn't update - the actual Kubernetes resources. Use when: (1) Terraform applies successfully but - K8s resources (Service, Deployment) don't reflect new Helm values, - (2) New ports/volumes/containers from Helm chart values don't appear in the deployed resources, + Troubleshoot and fix Helm release issues managed by Terraform. Use when: + (1) Terraform applies successfully but K8s resources don't reflect new Helm values, + (2) New ports/volumes/containers from Helm chart values don't appear in deployed resources, (3) helm upgrade --reuse-values doesn't re-render templates for structural changes, - (4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale. - Solution involves removing from Terraform state, reimporting, and force upgrading. + (4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale, + (5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress", + (6) helm history shows status "pending-upgrade" or "pending-rollback", + (7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop, + (8) helm upgrade fails with "an error occurred while finding last successful release". + Covers force re-rendering via state removal/reimport and stuck release recovery via + secret cleanup. author: Claude Code version: 1.0.0 -date: 2026-02-07 +date: 2026-02-22 --- -# Helm Release Force Re-render via Terraform +# Helm Release Troubleshooting -## Problem +## Force Re-render + +### Problem After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect the new values. For example, adding a new port in Helm values doesn't result in that port appearing in the Service spec. -## Context / Trigger Conditions +### Context / Trigger Conditions - Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows the old configuration - Structural changes to Helm values (new ports, new containers, new volumes) are not @@ -30,7 +36,7 @@ appearing in the Service spec. - Common with Traefik, ingress-nginx, and other charts where template logic conditionally includes resources based on values -## Root Cause +### Root Cause Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are changed, Helm may use `--reuse-values` behavior where it merges new values into existing ones rather than doing a full template re-render. For structural changes (like enabling @@ -41,9 +47,9 @@ Additionally, Terraform may see the stored Helm release state as matching the de even though the actual Kubernetes resources don't reflect it, creating a state drift that Terraform doesn't detect. -## Solution +### Solution -### Step 1: Verify the Discrepancy +#### Step 1: Verify the Discrepancy Confirm that K8s resources don't match Helm values: ```bash @@ -55,7 +61,7 @@ helm get values -n helm get manifest -n | grep -A10 "" ``` -### Step 2: Remove Helm Release from Terraform State +#### Step 2: Remove Helm Release from Terraform State ```bash terraform state rm 'module.kubernetes_cluster.module..helm_release.' @@ -64,7 +70,7 @@ terraform state rm 'module.kubernetes_cluster.module..helm_release..helm_release.' '/' @@ -72,7 +78,7 @@ terraform import 'module.kubernetes_cluster.module..helm_release. For Helm releases, the import ID format is `namespace/release-name`. -### Step 4: Force Apply with Terraform +#### Step 4: Force Apply with Terraform After reimporting, run terraform apply. Terraform should now detect the drift between the desired Helm values and the actual release state: @@ -87,7 +93,7 @@ terraform taint 'module.kubernetes_cluster.module..helm_release.' terraform apply -target=module.kubernetes_cluster.module. ``` -### Step 5: Manual Helm Force Upgrade (Last Resort) +#### Step 5: Manual Helm Force Upgrade (Last Resort) If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport: @@ -109,7 +115,7 @@ terraform apply -target=module.kubernetes_cluster.module. **WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state afterward, and use `terraform apply` to verify Terraform is back in sync. -## Verification +### Verification ```bash # Check the K8s resources now match expected configuration @@ -121,7 +127,7 @@ terraform plan -target=module.kubernetes_cluster.module. # Should show "No changes" or minimal expected drift ``` -## Example: Traefik HTTP/3 UDP Port Not Appearing +### Example: Traefik HTTP/3 UDP Port Not Appearing **Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied successfully, but the Traefik Service only had TCP port 443, missing the expected @@ -143,21 +149,102 @@ kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3" # Should show: port: 443, protocol: UDP ``` -## Notes +### Notes - This issue is more common with structural Helm value changes (new ports, new sidecars, conditional template blocks) than with simple value changes (image tags, replica counts) - The `helm upgrade --force` flag deletes and recreates resources that have changed, which causes brief downtime. Use with caution on production ingress controllers. - Always verify with `terraform plan` after fixing to ensure Terraform state is consistent -- This is different from the `terraform-state-identity-mismatch` skill, which covers - provider-level identity errors. This skill covers Helm template rendering issues where - the state looks correct but the actual resources don't match. + +--- + +## Stuck Release Recovery + +### Problem +Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install` +states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion). +Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress. + +### Context / Trigger Conditions +- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress` +- `helm history -n ` shows `pending-upgrade`, `pending-rollback`, or `pending-install` +- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout +- `helm upgrade` fails with: `an error occurred while finding last successful release` + +### Solution + +#### Step 1: Identify the stuck release +```bash +helm --kubeconfig $(pwd)/config history -n | tail -5 +``` + +Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`. + +#### Step 2: Delete the stuck Helm release secrets +Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1..v`. +Delete all stuck revisions: + +```bash +# Delete specific stuck revision (e.g., revision 5) +kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v5 -n + +# If multiple stuck revisions exist, delete all of them +kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v6 -n +``` + +#### Step 3: Verify the release is clean +```bash +helm --kubeconfig $(pwd)/config history -n | tail -3 +``` + +The latest revision should now show `deployed` status. + +#### Step 4: Retry the upgrade +```bash +terraform apply -target=module.kubernetes_cluster.module. -var="kube_config_path=$(pwd)/config" -auto-approve +``` + +### Important Notes + +- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`). + This changes the label but not the encoded release data inside the secret, leaving Helm in an + inconsistent state. Always delete the stuck secrets entirely. +- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment), + the next successful upgrade will reconcile the state. +- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value` + over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle. + +### Verification +After deleting stuck secrets and re-applying: +- `helm history` shows the new revision as `deployed` +- `terraform apply` completes without errors + +### Example +```bash +# Helm history shows stuck state +$ helm history nextcloud -n nextcloud | tail -3 +4 deployed nextcloud-8.8.1 Upgrade complete +5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout +6 pending-rollback nextcloud-8.8.1 Rollback to 4 + +# Fix: delete stuck revisions +$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud + +# Verify clean state +$ helm history nextcloud -n nextcloud | tail -1 +4 deployed nextcloud-8.8.1 Upgrade complete + +# Re-apply +$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve +``` + +--- ## See Also - `terraform-state-identity-mismatch` - For Terraform provider identity errors -- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for this issue) +- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render) ## References diff --git a/.claude/skills/helm-stuck-release-recovery/SKILL.md b/.claude/skills/helm-stuck-release-recovery/SKILL.md deleted file mode 100644 index ac9f999d..00000000 --- a/.claude/skills/helm-stuck-release-recovery/SKILL.md +++ /dev/null @@ -1,93 +0,0 @@ ---- -name: helm-stuck-release-recovery -description: | - Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states. - Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is - in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback", - (3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop, - (4) helm upgrade fails with "an error occurred while finding last successful release". - Covers manual secret cleanup to restore Helm release to a deployable state. -author: Claude Code -version: 1.0.0 -date: 2026-02-15 ---- - -# Helm Stuck Release Recovery - -## Problem -Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install` -states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion). -Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress. - -## Context / Trigger Conditions -- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress` -- `helm history -n ` shows `pending-upgrade`, `pending-rollback`, or `pending-install` -- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout -- `helm upgrade` fails with: `an error occurred while finding last successful release` - -## Solution - -### Step 1: Identify the stuck release -```bash -helm --kubeconfig $(pwd)/config history -n | tail -5 -``` - -Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`. - -### Step 2: Delete the stuck Helm release secrets -Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1..v`. -Delete all stuck revisions: - -```bash -# Delete specific stuck revision (e.g., revision 5) -kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v5 -n - -# If multiple stuck revisions exist, delete all of them -kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v6 -n -``` - -### Step 3: Verify the release is clean -```bash -helm --kubeconfig $(pwd)/config history -n | tail -3 -``` - -The latest revision should now show `deployed` status. - -### Step 4: Retry the upgrade -```bash -terraform apply -target=module.kubernetes_cluster.module. -var="kube_config_path=$(pwd)/config" -auto-approve -``` - -## Important Notes - -- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`). - This changes the label but not the encoded release data inside the secret, leaving Helm in an - inconsistent state. Always delete the stuck secrets entirely. -- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment), - the next successful upgrade will reconcile the state. -- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value` - over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle. - -## Verification -After deleting stuck secrets and re-applying: -- `helm history` shows the new revision as `deployed` -- `terraform apply` completes without errors - -## Example -```bash -# Helm history shows stuck state -$ helm history nextcloud -n nextcloud | tail -3 -4 deployed nextcloud-8.8.1 Upgrade complete -5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout -6 pending-rollback nextcloud-8.8.1 Rollback to 4 - -# Fix: delete stuck revisions -$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud - -# Verify clean state -$ helm history nextcloud -n nextcloud | tail -1 -4 deployed nextcloud-8.8.1 Upgrade complete - -# Re-apply -$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve -``` diff --git a/.claude/skills/k8s-container-image-caching/SKILL.md b/.claude/skills/k8s-container-image-caching/SKILL.md new file mode 100644 index 00000000..76304dc7 --- /dev/null +++ b/.claude/skills/k8s-container-image-caching/SKILL.md @@ -0,0 +1,244 @@ +--- +name: k8s-container-image-caching +description: | + Set up and troubleshoot container image pull-through caches in Kubernetes. Use when: + (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror, + (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls, + (3) need to add pull-through cache for a new upstream registry, + (4) `mirrors` cannot be set when `config_path` is provided error in containerd, + (5) containerd 1.6.x vs 1.7.x config_path compatibility issues, + (6) kubectl shows correct image tag but container runs old code, + (7) local registry mirror caches stale images, + (8) imagePullPolicy: Always doesn't force fresh pulls, + (9) containerd config has mirror that intercepts pulls serving stale images. + Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass + via image digest pinning. +author: Claude Code +version: 1.0.0 +date: 2026-02-22 +--- + +# Kubernetes Container Image Caching + +## Pull-Through Cache Setup + +### Problem + +Docker Registry v2 can only proxy **one upstream registry per instance**. A common +misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing +to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io, +and other registries -- they get routed to the Docker Hub proxy which can't serve them, +causing `ImagePullBackOff`. + +### Context / Trigger Conditions + +- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries +- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]` +- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided` +- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach + +### Solution + +#### 1. Run one Registry v2 container per upstream + +Each upstream needs its own Docker Registry v2 instance on a different port: + +| Port | Registry | Container Name | +|------|----------|---------------| +| 5000 | docker.io | registry | +| 5010 | ghcr.io | registry-ghcr | +| 5020 | quay.io | registry-quay | +| 5030 | registry.k8s.io | registry-k8s | +| 5040 | reg.kyverno.io | registry-kyverno | + +Config for non-Docker-Hub proxies (no auth needed -- they're public): + +```yaml +version: 0.1 +storage: + cache: + blobdescriptor: inmemory + filesystem: + rootdirectory: /var/lib/registry +http: + addr: :5000 +proxy: + remoteurl: https://ghcr.io # change per registry +``` + +```bash +docker run -p 5010:5000 -d --restart always --name registry-ghcr \ + -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2 +``` + +#### 2. Replace deprecated wildcard mirror with `config_path` + +Instead of: +```toml +# DEPRECATED - breaks non-Docker-Hub registries +[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"] + endpoint = ["http://10.0.20.10:5000"] +``` + +Use the modern `config_path` approach: +```toml +[plugins."io.containerd.grpc.v1.cri".registry] + config_path = "/etc/containerd/certs.d" +``` + +Then create per-registry `hosts.toml` files: +```bash +mkdir -p /etc/containerd/certs.d/docker.io +cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF' +server = "https://registry-1.docker.io" + +[host."http://10.0.20.10:5000"] + capabilities = ["pull", "resolve"] +EOF +``` + +Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage). + +#### 3. Critical: `config_path` and `mirrors` cannot coexist + +Containerd will **refuse to start the CRI plugin** if both `config_path` and any +`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries +(including the `[plugins."...registry.mirrors"]` parent section) before setting +`config_path`. + +This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master) +where the config format is slightly different. If unsure, either: +- Don't use config_path on that node (skip the pull-through cache) +- Remove the entire `mirrors` section first, then add `config_path` + +#### 4. Static IP for registry VM + +If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP +via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP. + +### Verification + +```bash +# Test each proxy responds +for port in 5000 5010 5020 5030 5040; do + curl -s http://10.0.20.10:$port/v2/_catalog +done + +# Test containerd can pull through cache +crictl pull ghcr.io/some/image:tag + +# Check containerd logs for mirror usage +journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry" +``` + +### Notes + +- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to + direct pull from the upstream `server` URL. This provides graceful degradation. +- **GC crontabs**: Add weekly garbage collection for each registry container, staggered + to avoid I/O spikes. +- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates. +- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand. + +--- + +## Cache Bypass / Stale Image Fix + +### Problem +Kubernetes pods continue running old Docker images even after pushing new versions with +the same tag (e.g., `:latest`). This happens when a local registry mirror caches images +and serves stale versions, ignoring `imagePullPolicy: Always`. + +### Context / Trigger Conditions +- Pod is running but application code is outdated +- `docker push` succeeded with new layers +- `kubectl describe pod` shows correct image tag +- Cluster has a local registry mirror configured (e.g., in containerd config) +- `imagePullPolicy: Always` doesn't fix the issue +- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar + +### Solution + +#### 1. Get the image digest after pushing +```bash +docker push viktorbarzin/myimage:latest +# Output includes: latest: digest: sha256:abc123... size: 856 +``` + +#### 2. Use digest instead of tag in deployment +```hcl +# Terraform +container { + # Use digest to bypass local registry cache + image = "docker.io/viktorbarzin/myimage@sha256:abc123..." + image_pull_policy = "Always" + name = "myimage" +} +``` + +```yaml +# Kubernetes YAML +containers: + - name: myimage + image: docker.io/viktorbarzin/myimage@sha256:abc123... + imagePullPolicy: Always +``` + +#### 3. Apply and restart +```bash +terraform apply -target=module.kubernetes_cluster.module.myservice +kubectl rollout restart deployment/myservice -n mynamespace +``` + +### Why This Works +- Registry mirrors match by tag, not digest +- When you specify a digest, the node must fetch that exact manifest +- The mirror may not have the digest cached, forcing a pull from upstream +- Even if cached, the digest guarantees the exact image version + +### Verification +```bash +# Check the pod is using the new image +kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}' + +# Verify application behavior reflects new code +kubectl exec -n mynamespace deploy/myservice -- +``` + +### Example + +Before (problematic): +```hcl +image = "docker.io/viktorbarzin/audiblez-web:latest" +``` + +After (fixed): +```hcl +image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29" +``` + +### Notes +- You must update the digest each time you push a new image +- Consider automating digest extraction in CI/CD pipelines +- This is a workaround; ideally fix the registry mirror configuration +- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes +- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml` + +### Diagnosing Registry Mirror Issues +```bash +# On a k8s node, check containerd config +cat /etc/containerd/config.toml | grep -A5 mirrors + +# Check if mirror is intercepting +crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror + +# List cached images on node +crictl images | grep myimage +``` + +--- + +## References + +- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy) +- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md) diff --git a/.claude/skills/k8s-docker-registry-cache-bypass/SKILL.md b/.claude/skills/k8s-docker-registry-cache-bypass/SKILL.md deleted file mode 100644 index 85a761f4..00000000 --- a/.claude/skills/k8s-docker-registry-cache-bypass/SKILL.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -name: k8s-docker-registry-cache-bypass -description: | - Fix for Kubernetes pods running old Docker images despite pushing new versions. - Use when: (1) kubectl shows correct image tag but container runs old code, - (2) Local registry mirror caches stale images, (3) imagePullPolicy: Always - doesn't force fresh pulls, (4) containerd config has mirror that intercepts pulls. - Solution: Use image digest instead of tag to bypass cache entirely. -author: Claude Code -version: 1.0.0 -date: 2025-01-31 ---- - -# Kubernetes Docker Registry Cache Bypass - -## Problem -Kubernetes pods continue running old Docker images even after pushing new versions with -the same tag (e.g., `:latest`). This happens when a local registry mirror caches images -and serves stale versions, ignoring `imagePullPolicy: Always`. - -## Context / Trigger Conditions -- Pod is running but application code is outdated -- `docker push` succeeded with new layers -- `kubectl describe pod` shows correct image tag -- Cluster has a local registry mirror configured (e.g., in containerd config) -- `imagePullPolicy: Always` doesn't fix the issue -- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar - -## Solution - -### 1. Get the image digest after pushing -```bash -docker push viktorbarzin/myimage:latest -# Output includes: latest: digest: sha256:abc123... size: 856 -``` - -### 2. Use digest instead of tag in deployment -```hcl -# Terraform -container { - # Use digest to bypass local registry cache - image = "docker.io/viktorbarzin/myimage@sha256:abc123..." - image_pull_policy = "Always" - name = "myimage" -} -``` - -```yaml -# Kubernetes YAML -containers: - - name: myimage - image: docker.io/viktorbarzin/myimage@sha256:abc123... - imagePullPolicy: Always -``` - -### 3. Apply and restart -```bash -terraform apply -target=module.kubernetes_cluster.module.myservice -kubectl rollout restart deployment/myservice -n mynamespace -``` - -## Why This Works -- Registry mirrors match by tag, not digest -- When you specify a digest, the node must fetch that exact manifest -- The mirror may not have the digest cached, forcing a pull from upstream -- Even if cached, the digest guarantees the exact image version - -## Verification -```bash -# Check the pod is using the new image -kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}' - -# Verify application behavior reflects new code -kubectl exec -n mynamespace deploy/myservice -- -``` - -## Example - -Before (problematic): -```hcl -image = "docker.io/viktorbarzin/audiblez-web:latest" -``` - -After (fixed): -```hcl -image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29" -``` - -## Notes -- You must update the digest each time you push a new image -- Consider automating digest extraction in CI/CD pipelines -- This is a workaround; ideally fix the registry mirror configuration -- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes -- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml` - -## Diagnosing Registry Mirror Issues -```bash -# On a k8s node, check containerd config -cat /etc/containerd/config.toml | grep -A5 mirrors - -# Check if mirror is intercepting -crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror - -# List cached images on node -crictl images | grep myimage -``` - -## References -- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy) -- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)