[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes: - Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md - Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md - Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md - Extract Authentik state snapshot → .claude/reference/authentik-state.md - Remove Init Container pattern (duplicates setup-project skill) - Remove Poison Fountain service notes (duplicates Anti-AI section) - Consolidate Authentik section (link to skills + reference) - Remove resource limit tables (kept tier definitions inline) Skill merges (37→32): - helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting - containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching - (traefik merges in previous commits)
This commit is contained in:
parent
d3d0b4281c
commit
abe89c926e
10 changed files with 749 additions and 1166 deletions
|
|
@ -6,100 +6,40 @@
|
|||
- **When making infrastructure changes**: Always update this file to reflect the current state (new services, removed services, version changes, config changes)
|
||||
- **After every significant change**: Proactively update this file (`.claude/CLAUDE.md`) to reflect what changed — new services, config changes, version bumps, new patterns, etc. This ensures knowledge persists across sessions automatically.
|
||||
- **After updating any `.claude/` files**: Always commit them immediately (`git add .claude/ && git commit -m "[ci skip] update claude knowledge"`) to avoid building up unstaged changes.
|
||||
- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project.md` for deploying new services)
|
||||
- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (e.g., via kubectl apply/edit/patch, helm install, docker run). Always make changes in the Terraform `.tf` files and apply with `terragrunt apply`. The real cluster state must never deviate from what's defined in Terraform — if a manual change is unavoidable (e.g., containerd config on running nodes), document it and ensure the Terraform templates match so future provisioning is consistent. Use `kubectl` only for read-only operations (get, describe, logs) and ephemeral debugging (run --rm, delete stuck pods), never for persistent state changes.
|
||||
- **CRITICAL: NEVER put sensitive data (API keys, passwords, tokens, credentials) into committed files** unless they are encrypted (e.g., via git-crypt). Secrets belong in `terraform.tfvars` (which is git-crypt encrypted) or in the `secrets/` directory. Never hardcode credentials in `.tf` files, scripts, `.claude/` files, or any other unencrypted committed file. Always pass secrets through the Terraform variable chain (`terraform.tfvars` → `main.tf` → module variables).
|
||||
- **CRITICAL: NEVER commit secrets** — triple-check before every commit that no API keys, passwords, tokens, or credentials are included in unencrypted files. This is a hard rule with zero exceptions.
|
||||
- **New services MUST have CI/CD**: Set up Drone CI pipeline (`.drone.yml`) with GitHub/GitLab repo integration. Services should auto-build and auto-deploy.
|
||||
- **New services MUST have monitoring**: Every new service should have monitoring via Prometheus (alerts/metrics) and/or Uptime Kuma (HTTP health checks). Add both when possible.
|
||||
- **Skills available**: Check `.claude/skills/` directory for specialized workflows (e.g., `setup-project` for deploying new services)
|
||||
- **Reference data**: Check `.claude/reference/` for inventory tables, API patterns, and current state snapshots
|
||||
- **CRITICAL: All infrastructure changes must go through Terraform/Terragrunt**. NEVER modify cluster resources directly (kubectl apply/edit/patch, helm install, docker run). Use `kubectl` only for read-only operations and ephemeral debugging.
|
||||
- **CRITICAL: NEVER put sensitive data** (API keys, passwords, tokens, credentials) into committed files unless encrypted via git-crypt. Secrets belong in `terraform.tfvars` or `secrets/` directory.
|
||||
- **CRITICAL: NEVER commit secrets** — triple-check before every commit. Zero exceptions.
|
||||
- **New services MUST have CI/CD** (Drone CI pipeline) and **monitoring** (Prometheus alerts and/or Uptime Kuma).
|
||||
|
||||
## Execution Environment
|
||||
- **File operations**: Read, Edit, Write, Glob, Grep tools
|
||||
- **Git commands**: git status, git log, git diff, git add, git commit, git reset, etc.
|
||||
- **Shell commands**: All tools (terraform, terragrunt, kubectl, helm, python, etc.) are available locally
|
||||
- **CRITICAL: Always run terragrunt/terraform locally**, never on the remote server via SSH:
|
||||
```bash
|
||||
cd stacks/<service> && terragrunt apply --non-interactive
|
||||
```
|
||||
- **kubectl**: Use `kubectl --kubeconfig $(pwd)/config` for cluster access
|
||||
- **GitHub API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below). `gh` CLI is blocked by sandbox restrictions.
|
||||
- **Drone CI API**: Use `curl` with token from tfvars (see GitHub & Drone CI section below).
|
||||
- **Terraform/Terragrunt**: Always run locally: `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
||||
- **GitHub/Drone API**: Use `curl` with tokens from tfvars (see `.claude/reference/github-drone-api.md`). `gh` CLI is blocked by sandbox.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`, enabling fast, independent plan/apply cycles. Uses git-crypt for secrets encryption.
|
||||
Terragrunt-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs, with per-service state isolation. Each service has its own Terragrunt stack under `stacks/`. Uses git-crypt for secrets encryption.
|
||||
|
||||
## Static File Paths (NEVER CHANGE)
|
||||
- **Main config**: `terraform.tfvars` - All secrets, DNS, Cloudflare config, WireGuard peers
|
||||
- **Root Terragrunt**: `terragrunt.hcl` - Root Terragrunt config (providers, backend, var loading)
|
||||
- **Service stacks**: `stacks/<service>/` - Individual service stacks (each has `terragrunt.hcl` + `main.tf` with resources inline)
|
||||
- **Infra stack**: `stacks/infra/` - Proxmox VM resources (templates, docker-registry, VMs)
|
||||
- **Platform stack**: `stacks/platform/` - Core infrastructure services (22 modules in `modules/` subdir)
|
||||
- **Per-stack state**: `state/stacks/<service>/terraform.tfstate` - Per-stack state files (gitignored)
|
||||
- **Service resources**: `stacks/<service>/main.tf` - Service resources defined directly in stack root
|
||||
- **Platform modules**: `stacks/platform/modules/<service>/` - Platform service modules
|
||||
- **Shared modules**: `modules/kubernetes/ingress_factory/`, `modules/kubernetes/setup_tls_secret/`
|
||||
- **Secrets**: `secrets/` - git-crypt encrypted TLS certs and keys
|
||||
|
||||
## Network Topology (Static IPs)
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 10.0.10.0/24 - Management Network │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 10.0.10.10 - Wizard (main server) │
|
||||
│ 10.0.10.15 - NFS Server (TrueNAS) - /mnt/main/* │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 10.0.20.0/24 - Kubernetes Network │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 10.0.20.1 - pfSense Gateway │
|
||||
│ 10.0.20.10 - Docker Registry VM (MAC: DE:AD:BE:EF:22:22) │
|
||||
│ 10.0.20.100 - k8s-master │
|
||||
│ 10.0.20.101 - Technitium DNS │
|
||||
│ 10.0.20.102 - MetalLB IP Pool Start │
|
||||
│ 10.0.20.200 - MetalLB IP Pool End │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 192.168.1.0/24 - Physical Network │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 192.168.1.127 - Proxmox Hypervisor │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
## Key File Paths
|
||||
- `terraform.tfvars` — All secrets, DNS, Cloudflare config, WireGuard peers (git-crypt encrypted)
|
||||
- `terragrunt.hcl` — Root config (providers, backend, variable loading)
|
||||
- `stacks/<service>/` — Individual service stacks (`terragrunt.hcl` + `main.tf`)
|
||||
- `stacks/platform/` — Core infrastructure (~22 services in `modules/` subdir)
|
||||
- `stacks/infra/` — Proxmox VM resources
|
||||
- `modules/kubernetes/ingress_factory/`, `setup_tls_secret/` — Shared utility modules
|
||||
- `secrets/` — git-crypt encrypted TLS certs and keys
|
||||
|
||||
## Domains
|
||||
- **Public**: `viktorbarzin.me` (Cloudflare-managed)
|
||||
- **Internal**: `viktorbarzin.lan` (Technitium DNS)
|
||||
|
||||
## Directory Structure
|
||||
- `terragrunt.hcl` - Root Terragrunt configuration (providers, backend, variable loading)
|
||||
- `stacks/` - Individual Terragrunt stacks (one per service)
|
||||
- `stacks/infra/` - Proxmox VM resources (templates, docker-registry)
|
||||
- `stacks/platform/` - Core infrastructure (22 services in `stacks/platform/modules/`)
|
||||
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf`)
|
||||
- `stacks/platform/modules/<service>/` - Platform service module source code
|
||||
- `modules/kubernetes/` - **Only shared utility modules**: `ingress_factory/`, `setup_tls_secret/`
|
||||
- `modules/create-vm/` - Proxmox VM creation module
|
||||
- `state/` - Per-stack Terraform state files (gitignored)
|
||||
- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt
|
||||
- `cli/` - Go CLI tool for infrastructure management
|
||||
- `scripts/` - Helper scripts (cluster management, node updates)
|
||||
- `playbooks/` - Ansible playbooks for node configuration
|
||||
- `diagram/` - Infrastructure diagrams (Python-based)
|
||||
|
||||
## Key Patterns
|
||||
- Each service in `modules/kubernetes/<service>/main.tf` defines its own namespace, deployments, services, and ingress
|
||||
- NFS storage from `10.0.10.15` for persistent data
|
||||
- TLS secrets managed via `setup_tls_secret` module
|
||||
- Ingress uses Traefik (Helm chart, 3 replicas) with HTTP/3 (QUIC) enabled, Middleware CRDs for rate limiting, auth, CSP headers, CrowdSec bouncer, and analytics injection
|
||||
- HTTP/3 enabled on Traefik (`http3.enabled=true`, `advertisedPort=443` on websecure entrypoint) and Cloudflare (`cloudflare_zone_settings_override` with `http3="on"`)
|
||||
- GPU workloads use `node_selector = { "gpu": "true" }`
|
||||
- Services expose to `*.viktorbarzin.me` domains
|
||||
|
||||
### NFS Volume Pattern
|
||||
**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs:
|
||||
**Prefer inline NFS volumes** over separate PV/PVC resources:
|
||||
```hcl
|
||||
volume {
|
||||
name = "data"
|
||||
|
|
@ -109,773 +49,142 @@ volume {
|
|||
}
|
||||
}
|
||||
```
|
||||
Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart).
|
||||
Only use PV/PVC when a Helm chart requires `existingClaim`.
|
||||
|
||||
### Adding NFS Exports
|
||||
To add a new NFS exported directory:
|
||||
1. Edit `secrets/nfs_directories.txt` - add the new directory path, keep the list sorted
|
||||
2. Run `secrets/nfs_exports.sh` from the `secrets/` directory to update the NFS share via TrueNAS API
|
||||
1. Edit `secrets/nfs_directories.txt` — add path, keep sorted
|
||||
2. Run `secrets/nfs_exports.sh` from `secrets/` to update TrueNAS
|
||||
|
||||
### Factory Pattern (for multi-user services)
|
||||
Used when a service needs one instance per user. Structure:
|
||||
```
|
||||
stacks/<service>/
|
||||
├── main.tf # Namespace, TLS secret, user module calls
|
||||
└── factory/
|
||||
└── main.tf # Deployment, service, ingress templates with ${var.name}
|
||||
```
|
||||
Examples: `actualbudget`, `freedify`
|
||||
### Factory Pattern (multi-user services)
|
||||
Structure: `stacks/<service>/main.tf` + `factory/main.tf`. Examples: `actualbudget`, `freedify`.
|
||||
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
|
||||
|
||||
To add a new user:
|
||||
1. Export NFS share at `/mnt/main/<service>/<username>` in TrueNAS
|
||||
2. Add Cloudflare route in tfvars
|
||||
3. Add module block in main.tf calling factory
|
||||
|
||||
### Init Container Pattern (for database migrations)
|
||||
Use when a service needs to run database migrations before starting:
|
||||
```hcl
|
||||
init_container {
|
||||
name = "migration"
|
||||
image = "service-image:tag"
|
||||
command = ["sh", "-c", "migration-command"]
|
||||
|
||||
dynamic "env" {
|
||||
for_each = local.common_env
|
||||
content {
|
||||
name = env.value.name
|
||||
value = env.value.value
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
Example: AFFiNE runs `node ./scripts/self-host-predeploy.js` in init container.
|
||||
|
||||
### SMTP/Email Configuration
|
||||
When configuring services to use the mailserver:
|
||||
- **Use public hostname**: `mail.viktorbarzin.me` (for TLS cert validation)
|
||||
- **Do NOT use**: `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch)
|
||||
- **Port**: 587 (STARTTLS)
|
||||
- **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars
|
||||
- **Common email**: `info@viktorbarzin.me` for service notifications
|
||||
### SMTP/Email
|
||||
- **Use**: `mail.viktorbarzin.me` port 587 (STARTTLS). **NOT** `mailserver.mailserver.svc.cluster.local` (TLS cert mismatch).
|
||||
- **Credentials**: `mailserver_accounts` in tfvars. Common: `info@viktorbarzin.me`
|
||||
|
||||
### Anti-AI Scraping (5-Layer Defense)
|
||||
All services have anti-AI scraping enabled by default via `anti_ai_scraping = true` in `ingress_factory`. The 5 layers are:
|
||||
All services have `anti_ai_scraping = true` by default in `ingress_factory`. Layers:
|
||||
1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth → poison-fountain `/auth`. Returns 403 for GPTBot, ClaudeBot, CCBot, etc.
|
||||
2. **X-Robots-Tag** (`traefik-anti-ai-headers`): Adds `noai, noimageai`
|
||||
3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body injects 5 hidden links before `</body>` to `poison.viktorbarzin.me/article/*`
|
||||
4. **Tarpit**: `/article/*` drip-feeds at ~100 bytes/sec
|
||||
5. **Poison content**: 50 cached docs from rnsaffn.com/poison2/ (CronJob every 6h, `--http1.1` required)
|
||||
|
||||
1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth middleware → poison-fountain `/auth` endpoint. Checks `User-Agent` against known AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, etc.). Returns 403 for bots, 200 for normal users.
|
||||
2. **X-Robots-Tag header** (`traefik-anti-ai-headers`): Adds `noai, noimageai` to all responses.
|
||||
3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body plugin injects 5 hidden `<a>` tags before `</body>` linking to `poison.viktorbarzin.me/article/*`. Only injected when request `Accept` header contains `text/html` (browsers/scrapers, not API calls).
|
||||
4. **Tarpit** (poison-fountain service): `/article/*` endpoints drip-feed responses at ~100 bytes/sec via chunked transfer encoding, wasting scraper time.
|
||||
5. **Poison content**: Cached documents from rnsaffn.com/poison2/ (50 docs, refreshed every 6h via CronJob) served through the tarpit to pollute AI training data.
|
||||
|
||||
**Key files:**
|
||||
- `stacks/poison-fountain/` — Terraform stack (deployment, service, ingress, CronJob)
|
||||
- `stacks/poison-fountain/app/server.py` — Python HTTP server (ForwardAuth + tarpit)
|
||||
- `stacks/poison-fountain/app/fetch-poison.sh` — CronJob fetcher (uses `--http1.1`, upstream hangs on HTTP/2)
|
||||
- `stacks/platform/modules/traefik/middleware.tf` — 3 Traefik middleware CRDs
|
||||
- `modules/kubernetes/ingress_factory/main.tf` — `anti_ai_scraping` variable (default: true)
|
||||
|
||||
**Testing:**
|
||||
```bash
|
||||
# Trap links (need Accept: text/html for rewrite-body plugin to process)
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'
|
||||
|
||||
# X-Robots-Tag header
|
||||
curl -sI -H "Accept: text/html" https://vaultwarden.viktorbarzin.me/ | grep -i x-robots
|
||||
|
||||
# Bot blocking (403 for AI bots, 200 for normal users)
|
||||
curl -s -o /dev/null -w "%{http_code}" -A "GPTBot/1.0" https://vaultwarden.viktorbarzin.me/
|
||||
|
||||
# Tarpit slow-drip (~100 bytes/sec)
|
||||
curl -s -H "Accept: text/html" https://poison.viktorbarzin.me/article/test
|
||||
```
|
||||
|
||||
**Gotchas:**
|
||||
- rewrite-body plugin only processes responses when `Accept` header contains `text/html` — `curl` default `Accept: */*` does NOT match. Use `-H "Accept: text/html"` for testing.
|
||||
- rnsaffn.com/poison2/ hangs on HTTP/2 — fetcher must use `--http1.1`
|
||||
- NFS cache dir (`/mnt/main/poison-fountain/cache`) must be world-writable (chmod 777) because `curlimages/curl` runs as uid 101
|
||||
- To disable for a specific service: set `anti_ai_scraping = false` in its `ingress_factory` call
|
||||
Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`, `modules/kubernetes/ingress_factory/main.tf`
|
||||
Testing: `curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'`
|
||||
Disable per-service: `anti_ai_scraping = false` in ingress_factory call.
|
||||
|
||||
### Terragrunt Architecture
|
||||
- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading for all stacks
|
||||
- Each stack contains its resources directly: `stacks/<service>/main.tf` has variable declarations, locals, and all Terraform resources inline
|
||||
- Platform modules live at `stacks/platform/modules/<service>/`, referenced as `source = "./modules/<service>"`
|
||||
- Shared utility modules (`ingress_factory`, `setup_tls_secret`, `dockerhub_secret`, `oauth-proxy`) remain at `modules/kubernetes/` and are referenced with relative paths from each module
|
||||
- State isolation: each stack has its own state file at `state/stacks/<service>/terraform.tfstate`
|
||||
- Dependencies: service stacks depend on `platform` stack via `dependency` block in their `terragrunt.hcl`
|
||||
- Variables loaded from `terraform.tfvars` automatically (unused vars silently ignored via `extra_arguments`)
|
||||
- `secrets/` symlinks in each stack for TLS cert resolution (`path.root` workaround)
|
||||
- Terragrunt v0.99+: use `--non-interactive` (not `--terragrunt-non-interactive`)
|
||||
- run-all syntax: `terragrunt run --all -- <command>` (not `terragrunt run-all`)
|
||||
- The `platform` stack bundles ~22 core services that have cross-dependencies (traefik, monitoring, authentik, etc.)
|
||||
- Individual service stacks are for services that can be deployed independently
|
||||
- Root `terragrunt.hcl` provides DRY provider, backend, and variable loading
|
||||
- Each stack: `stacks/<service>/main.tf` with resources inline, state at `state/stacks/<service>/terraform.tfstate`
|
||||
- Platform modules: `stacks/platform/modules/<service>/`, shared modules: `modules/kubernetes/`
|
||||
- Dependencies via `dependency` block; variables from `terraform.tfvars` (unused silently ignored)
|
||||
- `secrets/` symlinks in stacks for TLS cert path resolution
|
||||
- Syntax: `--non-interactive` (not `--terragrunt-non-interactive`), `terragrunt run --all -- <command>` (not `run-all`)
|
||||
|
||||
### Adding a New Service
|
||||
When adding a new service to the cluster:
|
||||
1. Create `stacks/<service>/` directory with:
|
||||
- `terragrunt.hcl` - Include root config, declare `platform` dependency
|
||||
- `main.tf` - All resources defined directly (variables, locals, namespace, deployments, services, ingress)
|
||||
- `secrets` - Symlink to `../../secrets` (for TLS cert path resolution)
|
||||
2. Add Cloudflare DNS record in `terraform.tfvars` (`cloudflare_proxied_names` or `cloudflare_non_proxied_names`)
|
||||
3. Apply the cloudflared stack: `cd stacks/platform && terragrunt apply --non-interactive`
|
||||
4. Apply the new service: `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
|
||||
## Common Variables
|
||||
- `tls_secret_name` - TLS certificate secret name
|
||||
- `tier` - Deployment tier label
|
||||
- Service-specific passwords passed as variables
|
||||
|
||||
## Service Versions (as of 2026-02)
|
||||
- Immich: v2.4.1
|
||||
- Freedify: latest (music streaming, factory pattern)
|
||||
- AFFiNE: stable (visual canvas, uses PostgreSQL + Redis)
|
||||
- Wyoming Whisper: latest (STT for Home Assistant, CPU on GPU node)
|
||||
- Health: latest (Apple Health data dashboard, Svelte + FastAPI + Caddy, uses PostgreSQL)
|
||||
- Gramps Web: latest (genealogy, uses Redis + Celery)
|
||||
- Loki: 3.6.5 (log aggregation, single binary, 6Gi RAM, 24h in-memory chunks)
|
||||
- Alloy: v1.13.0 (log collector DaemonSet, forwards to Loki)
|
||||
- OpenClaw: 2026.2.9 (AI agent gateway, authentik-protected)
|
||||
Use the **`setup-project`** skill for the full workflow. Quick reference:
|
||||
1. Create `stacks/<service>/` with `terragrunt.hcl`, `main.tf`, `secrets` symlink
|
||||
2. Add Cloudflare DNS in `terraform.tfvars`
|
||||
3. Apply platform stack (for DNS): `cd stacks/platform && terragrunt apply --non-interactive`
|
||||
4. Apply service: `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
|
||||
## Useful Commands
|
||||
```bash
|
||||
# Cluster health check — ALWAYS use this to check cluster status
|
||||
bash scripts/cluster_healthcheck.sh # Full color report
|
||||
bash scripts/cluster_healthcheck.sh # Cluster health (24 checks)
|
||||
bash scripts/cluster_healthcheck.sh --quiet # Only WARN/FAIL
|
||||
bash scripts/cluster_healthcheck.sh --json # Machine-readable
|
||||
bash scripts/cluster_healthcheck.sh --fix # Auto-delete evicted pods
|
||||
|
||||
# Apply a single service stack
|
||||
cd stacks/<service> && terragrunt apply --non-interactive
|
||||
|
||||
# Plan a single service stack
|
||||
cd stacks/<service> && terragrunt plan --non-interactive
|
||||
|
||||
# Plan all stacks (full DAG)
|
||||
cd stacks && terragrunt run --all --non-interactive -- plan
|
||||
|
||||
# Apply all stacks (full DAG)
|
||||
cd stacks && terragrunt run --all --non-interactive -- apply
|
||||
|
||||
# Format all terraform files
|
||||
terraform fmt -recursive
|
||||
|
||||
kubectl get pods -A
|
||||
cd stacks/<service> && terragrunt apply --non-interactive # Apply single stack
|
||||
cd stacks && terragrunt run --all --non-interactive -- plan # Plan all
|
||||
terraform fmt -recursive # Format all
|
||||
```
|
||||
|
||||
**Cluster Health Check** (`scripts/cluster_healthcheck.sh`):
|
||||
- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
|
||||
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
|
||||
- **When adding new healthchecks or monitoring**: Always update this script to validate the new component
|
||||
|
||||
**Terragrunt apply examples:**
|
||||
- `cd stacks/monitoring && terragrunt apply --non-interactive` - Apply monitoring
|
||||
- `cd stacks/immich && terragrunt apply --non-interactive` - Apply immich
|
||||
- `cd stacks/infra && terragrunt apply --non-interactive` - Apply Proxmox VMs / docker registry
|
||||
- `cd stacks/platform && terragrunt apply --non-interactive` - Apply all core/platform services
|
||||
|
||||
**IMPORTANT: When deploying a new service**, you must ALSO apply the `platform` stack (which includes `cloudflared`) to create the Cloudflare DNS record:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
Adding a name to `cloudflare_non_proxied_names` or `cloudflare_proxied_names` in `terraform.tfvars` only defines the record — it won't be created until the platform stack (which contains cloudflared) is applied.
|
||||
|
||||
## Stack Structure
|
||||
Terragrunt stacks under `stacks/`:
|
||||
- `stacks/infra/` - Proxmox VMs, templates, docker-registry
|
||||
- `stacks/platform/` - Core infrastructure (~22 services in `modules/` subdir)
|
||||
- `stacks/<service>/` - Individual service stacks (resources directly in `main.tf`)
|
||||
|
||||
Each stack's `terragrunt.hcl` includes the root `terragrunt.hcl` which provides:
|
||||
- Kubernetes + Helm providers (configured from `terraform.tfvars`)
|
||||
- Local backend with per-stack state file (`state/stacks/<service>/terraform.tfstate`)
|
||||
- Automatic loading of `terraform.tfvars` with unused vars ignored
|
||||
|
||||
---
|
||||
|
||||
## Complete Service Catalog
|
||||
|
||||
### Critical - Network & Auth (Tier: core)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| wireguard | VPN server | platform |
|
||||
| technitium | DNS server (10.0.20.101) | platform |
|
||||
| headscale | Tailscale control server | platform |
|
||||
| traefik | Ingress controller (Helm) | platform |
|
||||
| xray | Proxy/tunnel | platform |
|
||||
| authentik | Identity provider (SSO) | platform |
|
||||
| cloudflared | Cloudflare tunnel | platform |
|
||||
| authelia | Auth middleware | platform |
|
||||
| monitoring | Prometheus/Grafana/Loki stack | platform |
|
||||
|
||||
### Storage & Security (Tier: cluster)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| vaultwarden | Bitwarden-compatible password manager | platform |
|
||||
| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
|
||||
| immich | Photo management (GPU) | immich |
|
||||
| nvidia | GPU device plugin | platform |
|
||||
| metrics-server | K8s metrics | platform |
|
||||
| uptime-kuma | Status monitoring | platform |
|
||||
| crowdsec | Security/WAF | platform |
|
||||
| kyverno | Policy engine | platform |
|
||||
|
||||
### Admin
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard | platform |
|
||||
| reverse-proxy | Generic reverse proxy | platform |
|
||||
|
||||
### Active Use
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| mailserver | Email (docker-mailserver) | mailserver |
|
||||
| shadowsocks | Proxy | shadowsocks |
|
||||
| webhook_handler | Webhook processing | webhook_handler |
|
||||
| tuya-bridge | Smart home bridge | tuya-bridge |
|
||||
| dawarich | Location history | dawarich |
|
||||
| owntracks | Location tracking | owntracks |
|
||||
| nextcloud | File sync/share | nextcloud |
|
||||
| calibre | E-book management | calibre |
|
||||
| onlyoffice | Document editing | onlyoffice |
|
||||
| f1-stream | F1 streaming | f1-stream |
|
||||
| rybbit | Analytics | rybbit |
|
||||
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
|
||||
| actualbudget | Budgeting (factory pattern) | actualbudget |
|
||||
|
||||
### Optional
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| blog | Personal blog | blog |
|
||||
| descheduler | Pod descheduler | descheduler |
|
||||
| drone | CI/CD | drone |
|
||||
| hackmd | Collaborative markdown | hackmd |
|
||||
| kms | Key management | kms |
|
||||
| privatebin | Encrypted pastebin | privatebin |
|
||||
| vault | HashiCorp Vault | vault |
|
||||
| reloader | ConfigMap/Secret reloader | reloader |
|
||||
| city-guesser | Game | city-guesser |
|
||||
| echo | Echo server | echo |
|
||||
| url | URL shortener | url |
|
||||
| excalidraw | Whiteboard | excalidraw |
|
||||
| travel_blog | Travel blog | travel_blog |
|
||||
| dashy | Dashboard | dashy |
|
||||
| send | Firefox Send | send |
|
||||
| ytdlp | YouTube downloader | ytdlp |
|
||||
| wealthfolio | Finance tracking | wealthfolio |
|
||||
| audiobookshelf | Audiobook server | audiobookshelf |
|
||||
| paperless-ngx | Document management | paperless-ngx |
|
||||
| jsoncrack | JSON visualizer | jsoncrack |
|
||||
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
|
||||
| ntfy | Push notifications | ntfy |
|
||||
| cyberchef | Data transformation | cyberchef |
|
||||
| diun | Docker image update notifier | diun |
|
||||
| meshcentral | Remote management | meshcentral |
|
||||
| homepage | Dashboard/startpage | homepage |
|
||||
| matrix | Matrix chat server | matrix |
|
||||
| linkwarden | Bookmark manager | linkwarden |
|
||||
| changedetection | Web change detection | changedetection |
|
||||
| tandoor | Recipe manager | tandoor |
|
||||
| n8n | Workflow automation | n8n |
|
||||
| real-estate-crawler | Property crawler | real-estate-crawler |
|
||||
| tor-proxy | Tor proxy | tor-proxy |
|
||||
| forgejo | Git forge | forgejo |
|
||||
| freshrss | RSS reader | freshrss |
|
||||
| navidrome | Music streaming | navidrome |
|
||||
| networking-toolbox | Network tools | networking-toolbox |
|
||||
| stirling-pdf | PDF tools | stirling-pdf |
|
||||
| speedtest | Speed testing | speedtest |
|
||||
| freedify | Music streaming (factory pattern) | freedify |
|
||||
| netbox | Network documentation | netbox |
|
||||
| infra-maintenance | Maintenance jobs | infra-maintenance |
|
||||
| ollama | LLM server (GPU) | ollama |
|
||||
| frigate | NVR/camera (GPU) | frigate |
|
||||
| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
|
||||
| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
|
||||
| health | Apple Health data dashboard (PostgreSQL) | health |
|
||||
| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
|
||||
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
|
||||
| openclaw | AI agent gateway (OpenClaw) | openclaw |
|
||||
| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
|
||||
|
||||
---
|
||||
|
||||
## Cloudflare Domains
|
||||
|
||||
### Proxied (CDN + WAF enabled)
|
||||
```
|
||||
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
|
||||
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
|
||||
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
|
||||
travel, netbox
|
||||
```
|
||||
|
||||
### Non-Proxied (Direct DNS)
|
||||
```
|
||||
mail, wg, headscale, immich, calibre, vaultwarden, drone,
|
||||
mailserver-antispam, mailserver-admin, webhook, uptime,
|
||||
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
|
||||
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
|
||||
isponsorblocktv, speedtest, freedify, rybbit, paperless,
|
||||
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
|
||||
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
|
||||
```
|
||||
|
||||
### Special Subdomains
|
||||
- `*.viktor.actualbudget` - Actualbudget factory instances
|
||||
- `*.freedify` - Freedify factory instances
|
||||
- `mailserver.*` - Mail server components (antispam, admin)
|
||||
|
||||
---
|
||||
|
||||
## CI/CD
|
||||
- Drone CI (`.drone.yml`) for automated deployments
|
||||
- **Default pipeline**: On push, applies the `platform` stack via `terragrunt apply` (core infrastructure services; installs Terraform 1.5.7 + Terragrunt 0.99.4 in Alpine)
|
||||
- **TLS renewal pipeline**: Cron-triggered, runs `renew2.sh` (certbot + Cloudflare DNS) — no Terraform/Terragrunt needed
|
||||
- **Build CLI pipeline**: Builds Docker image from `cli/Dockerfile` (unchanged)
|
||||
- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly
|
||||
- **After committing, run `git push origin master`** to sync changes
|
||||
|
||||
## GitHub & Drone CI
|
||||
|
||||
### GitHub API Access
|
||||
- **Username**: `ViktorBarzin`
|
||||
- **Token location**: `terraform.tfvars` as `github_pat` (git-crypt encrypted)
|
||||
- **Read token**: `grep github_pat terraform.tfvars | cut -d'"' -f2`
|
||||
- **Scopes**: Full access — `repo`, `admin:public_key`, `admin:repo_hook`, `delete_repo`, `admin:org`, `workflow`, `write:packages`, and more
|
||||
- **`gh` CLI**: Blocked by sandbox restrictions — use `curl` with the GitHub API instead
|
||||
|
||||
#### Common API Patterns
|
||||
```bash
|
||||
# Read token from tfvars
|
||||
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
|
||||
|
||||
# List repos
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
|
||||
|
||||
# Create repo
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
|
||||
-d '{"name":"repo-name","private":true}'
|
||||
|
||||
# Add deploy key
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
|
||||
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
|
||||
|
||||
# Create webhook (e.g., for Drone CI)
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
|
||||
-d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
|
||||
|
||||
# Get repo info
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>"
|
||||
```
|
||||
|
||||
### Drone CI API Access
|
||||
- **Server**: `https://drone.viktorbarzin.me`
|
||||
- **Token location**: `terraform.tfvars` as `drone_api_token` (git-crypt encrypted)
|
||||
- **Read token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2`
|
||||
- **Username**: `ViktorBarzin`
|
||||
|
||||
#### Common API Patterns
|
||||
```bash
|
||||
# Read token from tfvars
|
||||
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
|
||||
|
||||
# List repos
|
||||
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos"
|
||||
|
||||
# Activate repo in Drone
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"
|
||||
|
||||
# Trigger build
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"
|
||||
|
||||
# Get build info
|
||||
curl -s -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds/<build-number>"
|
||||
|
||||
# Add secret to repo
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
|
||||
-d '{"name":"secret_name","data":"secret_value"}'
|
||||
```
|
||||
|
||||
### Capabilities
|
||||
With these tokens, Claude can:
|
||||
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
|
||||
- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
|
||||
- Drone CI (`.drone.yml`): pushes apply `platform` stack (Terraform 1.5.7 + Terragrunt 0.99.4)
|
||||
- TLS renewal pipeline: cron-triggered `renew2.sh` (certbot + Cloudflare DNS)
|
||||
- **ALWAYS add `[ci skip]`** to commit messages when you've already applied locally
|
||||
- **After committing, run `git push origin master`** to sync
|
||||
|
||||
## Infrastructure
|
||||
- Proxmox hypervisor for VMs (192.168.1.127)
|
||||
- Kubernetes cluster with GPU node (5 nodes: k8s-master + k8s-node1-4, running v1.34.2)
|
||||
- NFS server at 10.0.10.15 for storage
|
||||
- Redis shared service at `redis.redis.svc.cluster.local`
|
||||
- Docker registry pull-through cache at 10.0.20.10 (static IP via cloud-init)
|
||||
- Port 5000: docker.io (Docker Hub, with auth)
|
||||
- Port 5010: ghcr.io
|
||||
- Port 5020: quay.io
|
||||
- Port 5030: registry.k8s.io
|
||||
- Port 5040: reg.kyverno.io
|
||||
- Worker nodes use `config_path = "/etc/containerd/certs.d"` with per-registry `hosts.toml` files
|
||||
- k8s-master does NOT use pull-through cache (containerd 1.6.x incompatibility with config_path + mirrors)
|
||||
- Proxmox hypervisor (192.168.1.127) — see `.claude/reference/proxmox-inventory.md` for full VM table
|
||||
- Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
|
||||
- NFS: `10.0.10.15`, Redis: `redis.redis.svc.cluster.local`
|
||||
- Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
|
||||
- GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }`
|
||||
|
||||
### Proxmox Host Hardware
|
||||
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
|
||||
- **RAM**: 142 GB (Dell R730 server)
|
||||
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
|
||||
- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
|
||||
- **Proxmox access**: `ssh root@192.168.1.127`
|
||||
|
||||
### Proxmox Network Bridges
|
||||
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — connects to physical/home network (192.168.1.0/24)
|
||||
- **vmbr1**: Internal-only bridge (no physical port), VLAN-aware — carries VLAN 10 (management 10.0.10.0/24) and VLAN 20 (kubernetes 10.0.20.0/24)
|
||||
|
||||
### Proxmox VM Inventory
|
||||
|
||||
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
||||
|------|------|--------|------|-----|---------|------|-------|
|
||||
| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall, routes between all networks |
|
||||
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM on management network |
|
||||
| 103 | home-assistant | running | 8 | 16GB | vmbr1:vlan10(down), vmbr0 | 32G | Home Assistant, net0 link disabled, uses vmbr0 |
|
||||
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup Server (not in use) |
|
||||
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Kubernetes control plane (10.0.20.100) |
|
||||
| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 passthrough (hostpci0) |
|
||||
| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
|
||||
| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
|
||||
| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | K8s worker node |
|
||||
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | Terraform-managed, MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
||||
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM on physical network |
|
||||
| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7×256G+1T | NFS server (10.0.10.15), multiple data disks |
|
||||
|
||||
#### VM Templates (stopped, used for cloning)
|
||||
| VMID | Name | Purpose |
|
||||
|------|------|---------|
|
||||
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base template for non-K8s VMs |
|
||||
| 1001 | docker-registry-template | Template for docker registry VM |
|
||||
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base template for K8s nodes |
|
||||
|
||||
#### Network Connectivity Summary
|
||||
- **pfSense (101)** bridges all three networks: physical (vmbr0), management VLAN 10, and kubernetes VLAN 20
|
||||
- **K8s cluster** (200-204) + **docker-registry** (220) are all on VLAN 20 (kubernetes network)
|
||||
- **TrueNAS** (9000) + **devvm** (102) + **PBS** (105) are on VLAN 10 (management network)
|
||||
- **Home Assistant** (103) is on physical network (vmbr0), with a disabled VLAN 10 interface
|
||||
- **Windows10** (300) is on physical network (vmbr0) only
|
||||
|
||||
### GPU Node (k8s-node1)
|
||||
- **VMID**: 201
|
||||
- **PCIe Passthrough**: `0000:06:00.0` (NVIDIA Tesla T4)
|
||||
- **Taint**: `nvidia.com/gpu=true:NoSchedule` - Only GPU workloads can run here
|
||||
- **Label**: `gpu=true`
|
||||
- GPU workloads must have both:
|
||||
- `node_selector = { "gpu": "true" }`
|
||||
- `toleration { key = "nvidia.com/gpu", operator = "Equal", value = "true", effect = "NoSchedule" }`
|
||||
- Taint is applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|
||||
|
||||
## Git Operations (IMPORTANT)
|
||||
- **Git is slow** on this repo due to many files - commands can take 30+ seconds
|
||||
- Use `GIT_OPTIONAL_LOCKS=0` prefix if git hangs
|
||||
- Always commit only specific files you changed, not everything
|
||||
- **ALWAYS ask user before pushing to remote** - never push without explicit confirmation
|
||||
## Git Operations
|
||||
- **Git is slow** — commands can take 30+ seconds. Use `GIT_OPTIONAL_LOCKS=0` if git hangs.
|
||||
- Commit only specific files. **ALWAYS ask user before pushing**.
|
||||
|
||||
## Prometheus Alerts
|
||||
- Alert rules are in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
|
||||
- Under `serverFiles.alerting_rules.yml.groups`
|
||||
- Rules in `modules/kubernetes/monitoring/prometheus_chart_values.tpl`
|
||||
- Groups: "R730 Host", "Nvidia Tesla T4 GPU", "Power", "Cluster"
|
||||
- kube-state-metrics provides: `kube_deployment_*`, `kube_statefulset_*`, `kube_daemonset_*`
|
||||
|
||||
## Tier System
|
||||
- **0-core**: Critical infrastructure (ingress, DNS, VPN, auth)
|
||||
- **1-cluster**: Cluster services (Redis, metrics, security)
|
||||
- **2-gpu**: GPU workloads (Immich, Ollama, Frigate)
|
||||
- **3-edge**: User-facing services
|
||||
- **4-aux**: Optional/auxiliary services
|
||||
|
||||
### Resource Governance (Kyverno-based)
|
||||
Four layers of noisy-neighbor protection, all defined in `modules/kubernetes/kyverno/resource-governance.tf`:
|
||||
|
||||
1. **PriorityClasses**: `tier-0-core` (1M) through `tier-4-aux` (200K). `tier-4-aux` uses `preemption_policy=Never`.
|
||||
2. **LimitRange defaults** (Kyverno generate): Auto-creates `tier-defaults` LimitRange in namespaces based on tier label. Only affects containers without explicit resources.
|
||||
3. **ResourceQuotas** (Kyverno generate): Auto-creates `tier-quota` ResourceQuota in namespaces with tier labels. Excludes namespaces with `resource-governance/custom-quota=true` label.
|
||||
4. **Priority injection** (Kyverno mutate): Sets `priorityClassName` on Pods based on namespace tier label.
|
||||
|
||||
**Custom quota override**: Add label `resource-governance/custom-quota: "true"` to namespace, then define a custom `kubernetes_resource_quota` in the service's Terraform module. Currently used by: monitoring, crowdsec.
|
||||
|
||||
**LimitRange defaults by tier**:
|
||||
| Tier | Default Req | Default Limit | Max |
|
||||
|------|------------|--------------|-----|
|
||||
| 0-core | 100m/128Mi | 2/4Gi | 8/16Gi |
|
||||
| 1-cluster | 100m/128Mi | 2/4Gi | 4/8Gi |
|
||||
| 2-gpu | 100m/256Mi | 4/8Gi | 8/16Gi |
|
||||
| 3-edge | 50m/128Mi | 1/2Gi | 4/8Gi |
|
||||
| 4-aux | 25m/64Mi | 500m/1Gi | 2/4Gi |
|
||||
|
||||
**ResourceQuota hard limits by tier**:
|
||||
| Tier | Req CPU | Req Mem | Lim CPU | Lim Mem | Pods |
|
||||
|------|---------|---------|---------|---------|------|
|
||||
| 0-core | 8 | 8Gi | 32 | 64Gi | 100 |
|
||||
| 1-cluster | 4 | 4Gi | 16 | 32Gi | 30 |
|
||||
| 2-gpu | 8 | 8Gi | 48 | 96Gi | 40 |
|
||||
| 3-edge | 4 | 4Gi | 16 | 32Gi | 30 |
|
||||
| 4-aux | 2 | 2Gi | 8 | 16Gi | 20 |
|
||||
## Tier System & Resource Governance
|
||||
- **0-core**: Critical infra (ingress, DNS, VPN, auth) | **1-cluster**: Redis, metrics, security | **2-gpu**: GPU workloads | **3-edge**: User-facing | **4-aux**: Optional
|
||||
- Kyverno-based governance in `modules/kubernetes/kyverno/resource-governance.tf`:
|
||||
1. PriorityClasses: `tier-0-core` (1M) through `tier-4-aux` (200K, preemption=Never)
|
||||
2. LimitRange defaults (Kyverno generate): auto-created per namespace tier
|
||||
3. ResourceQuotas (Kyverno generate): auto-created per namespace tier (skip with label `resource-governance/custom-quota=true`)
|
||||
4. Priority injection (Kyverno mutate): sets priorityClassName on Pods
|
||||
- Custom quota override: monitoring, crowdsec
|
||||
|
||||
---
|
||||
|
||||
## User Preferences
|
||||
|
||||
### Calendar
|
||||
- **Default calendar**: Nextcloud (always use unless otherwise specified)
|
||||
- **Nextcloud URL**: `https://nextcloud.viktorbarzin.me`
|
||||
- **CalDAV endpoint**: `https://nextcloud.viktorbarzin.me/remote.php/dav/calendars/<username>/<calendar-name>/`
|
||||
|
||||
### Home Assistant
|
||||
- **Default smart home**: Home Assistant (always use for smart home control)
|
||||
- **Two deployments**:
|
||||
- **ha-london** (default): `https://ha-london.viktorbarzin.me` | Script: `.claude/home-assistant.py` | SSH: `ssh pi@192.168.8.103`, config at `/home/pi/docker/homeAssistant/`
|
||||
- **ha-sofia**: `https://ha-sofia.viktorbarzin.me` | Script: `.claude/home-assistant-sofia.py` | SSH: `ssh vbarzin@192.168.1.8`, config at `/config/`
|
||||
- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
|
||||
|
||||
### Development
|
||||
- **Frontend framework**: Svelte (user is learning it, so use Svelte for all new web apps)
|
||||
|
||||
### Pod Monitoring After Updates
|
||||
- **Never use `sleep` to wait for pods** — instead, spawn a background subagent (Task tool with `run_in_background: true`) that continuously checks pod state (e.g., `kubectl get pods -n <namespace> -w`) and reports back when the pod is ready or if errors occur. This catches CrashLoopBackOff, ImagePullBackOff, and other failures much sooner than periodic sleep-based polling.
|
||||
- **Calendar**: Nextcloud at `https://nextcloud.viktorbarzin.me`
|
||||
- **Home Assistant**: ha-london (default) at `https://ha-london.viktorbarzin.me`, ha-sofia at `https://ha-sofia.viktorbarzin.me`. "ha"/"HA" = ha-london.
|
||||
- **Frontend**: Svelte for all new web apps
|
||||
- **Pod monitoring**: Never use `sleep` — spawn background subagent with `kubectl get pods -w` instead
|
||||
|
||||
---
|
||||
|
||||
## Skills & Workflows
|
||||
|
||||
Skills are specialized workflows for common tasks. Located in `.claude/skills/`.
|
||||
|
||||
### Available Skills
|
||||
|
||||
**setup-project** (`.claude/skills/setup-project/SKILL.md`)
|
||||
- Deploy new self-hosted services from GitHub repos
|
||||
- Automated workflow: Docker image → Terraform module → Deploy
|
||||
- Handles database setup, ingress, DNS configuration
|
||||
- **When to use**: User provides GitHub URL or wants to deploy a new service
|
||||
- **Example**: "Deploy [GitHub repo] to the cluster"
|
||||
|
||||
**extend-vm-storage** (`.claude/skills/extend-vm-storage/SKILL.md`)
|
||||
- Extend disk storage on K8s node VMs (Proxmox-hosted)
|
||||
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
|
||||
- **When to use**: A k8s node needs more disk space
|
||||
- **Example**: "Extend storage on k8s-node2 by 64G"
|
||||
## Reference Data
|
||||
- `.claude/reference/service-catalog.md` — Full service catalog (70+ services) with Cloudflare domains
|
||||
- `.claude/reference/proxmox-inventory.md` — VM table, hardware specs, network topology, GPU config
|
||||
- `.claude/reference/github-drone-api.md` — GitHub & Drone CI API patterns with curl examples
|
||||
- `.claude/reference/authentik-state.md` — Current applications, groups, users, login sources
|
||||
|
||||
---
|
||||
|
||||
## Service-Specific Notes
|
||||
|
||||
### Authentik (Identity Provider)
|
||||
- **Helm Chart**: `authentik` v2025.10.3 from `https://charts.goauthentik.io/`
|
||||
- **URL**: `https://authentik.viktorbarzin.me`
|
||||
- **API**: `https://authentik.viktorbarzin.me/api/v3/`
|
||||
- **API Token**: Stored in `terraform.tfvars` as `authentik_api_token` (non-expiring, superuser, identifier: `claude-code-permanent`). Read with: `grep authentik_api_token terraform.tfvars | cut -d'"' -f2`
|
||||
- **Namespace**: `authentik` (tier: cluster)
|
||||
- **Architecture**: 3 server replicas + 3 worker replicas + 3 PgBouncer replicas + 1 embedded outpost
|
||||
- **Database**: PostgreSQL via `postgresql.dbaas:5432`, pooled through PgBouncer at `pgbouncer.authentik:6432`
|
||||
- **Redis**: Shared at `redis.redis.svc.cluster.local`
|
||||
- **Terraform**: `modules/kubernetes/authentik/main.tf` (Helm), `pgbouncer.tf` (connection pooling)
|
||||
|
||||
#### Authentik API Management
|
||||
To call the API, use:
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer <TOKEN>" "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
|
||||
```
|
||||
|
||||
Key API endpoints:
|
||||
- `core/users/` — List/create/update/delete users
|
||||
- `core/groups/` — List/create/update/delete groups
|
||||
- `core/applications/` — List/create applications
|
||||
- `providers/all/` — List all providers (OAuth2, Proxy, etc.)
|
||||
- `providers/oauth2/` — OAuth2/OIDC providers specifically
|
||||
- `providers/proxy/` — Proxy providers (forward auth)
|
||||
- `flows/instances/` — List flows
|
||||
- `stages/all/` — List stages
|
||||
- `sources/all/` — List sources (Google, GitHub, etc.)
|
||||
- `outposts/instances/` — List outposts
|
||||
- `propertymappings/all/` — List property mappings
|
||||
- `rbac/roles/` — List roles
|
||||
|
||||
#### Current Applications (9)
|
||||
| Application | Provider Type | Auth Flow |
|
||||
|-------------|--------------|-----------|
|
||||
| Cloudflare Access | OAuth2/OIDC | explicit consent |
|
||||
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
||||
| Grafana | OAuth2/OIDC | implicit consent |
|
||||
| Headscale | OAuth2/OIDC | explicit consent |
|
||||
| Immich | OAuth2/OIDC | explicit consent |
|
||||
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
||||
| linkwarden | OAuth2/OIDC | explicit consent |
|
||||
| Matrix | OAuth2/OIDC | implicit consent |
|
||||
| wrongmove | OAuth2/OIDC | implicit consent |
|
||||
|
||||
#### Current Groups (9)
|
||||
| Group | Parent | Superuser | Purpose |
|
||||
|-------|--------|-----------|---------|
|
||||
| Allow Login Users | — | No | Parent group for login-permitted users |
|
||||
| authentik Admins | — | Yes | Full admin access |
|
||||
| authentik Read-only | — | No | Read-only access (has role) |
|
||||
| Headscale Users | Allow Login Users | No | VPN access |
|
||||
| Home Server Admins | Allow Login Users | No | Server admin access |
|
||||
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
|
||||
| kubernetes-admins | — | No | K8s cluster-admin RBAC |
|
||||
| kubernetes-power-users | — | No | K8s power-user RBAC |
|
||||
| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
|
||||
|
||||
#### Current Users (7 real users)
|
||||
| Username | Name | Type | Groups |
|
||||
|----------|------|------|--------|
|
||||
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
|
||||
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
|
||||
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
|
||||
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
|
||||
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
|
||||
| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users |
|
||||
| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users |
|
||||
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
|
||||
|
||||
#### Login Sources (Social Login)
|
||||
- **Google** (OAuth) — user matching by identifier
|
||||
- **GitHub** (OAuth) — user matching by email_link
|
||||
- **Facebook** (OAuth) — user matching by email_link
|
||||
- All use the same authentication flow (`1a779f24`) and enrollment flow (`87572804`)
|
||||
|
||||
#### Authorization Flows
|
||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen before redirecting — used for Immich, Linkwarden, Headscale, Cloudflare
|
||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects without consent — used for Grafana, Matrix, Domain catch-all, Wrongmove
|
||||
|
||||
#### Traefik Integration
|
||||
- Forward auth middleware: `authentik-forward-auth` in Traefik namespace
|
||||
- Outpost endpoint: `http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik`
|
||||
- Services opt in via `protected = true` in `ingress_factory`
|
||||
- Response headers: `X-authentik-username`, `X-authentik-uid`, `X-authentik-email`, `X-authentik-name`, `X-authentik-groups`, `Set-Cookie`
|
||||
|
||||
#### OIDC for Kubernetes API
|
||||
- **Issuer**: `https://authentik.viktorbarzin.me/application/o/kubernetes/`
|
||||
- **Client ID**: `kubernetes` (public client, no secret)
|
||||
- **Username claim**: `email`, **Groups claim**: `groups`
|
||||
- **Signing key**: `authentik Self-signed Certificate` (must be assigned to the provider or JWKS will be empty)
|
||||
- **Redirect URIs**: Regex mode `http://localhost:.*` and `http://127\.0\.0\.1:.*` (kubelogin picks random ports)
|
||||
- **Configured via**: SSH to kube-apiserver manifest (`modules/kubernetes/rbac/apiserver-oidc.tf`)
|
||||
- **RBAC module**: `modules/kubernetes/rbac/main.tf` — admin/power-user/namespace-owner roles
|
||||
- **Self-service portal**: `modules/kubernetes/k8s-portal/` — SvelteKit app at `https://k8s-portal.viktorbarzin.me`
|
||||
- **User definition**: `k8s_users` variable in `terraform.tfvars`
|
||||
- **Audit logging**: Enabled via `modules/kubernetes/rbac/audit-policy.tf`, logs at `/var/log/kubernetes/audit.log`
|
||||
|
||||
**CRITICAL GOTCHAS when setting up Authentik OIDC for Kubernetes:**
|
||||
1. **Signing key MUST be assigned** to the OAuth2 provider. Without it, the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
|
||||
2. **Email mapping must set `email_verified: True`**. The default Authentik email scope mapping hardcodes `email_verified: False`, which causes kube-apiserver to reject the token with `oidc: email not verified`. Use a custom scope mapping: `return {"email": request.user.email, "email_verified": True}`
|
||||
3. **kubelogin needs `--oidc-extra-scope`** for `email`, `profile`, `groups`. Without these, only `openid` is requested and the token lacks the `email` claim, causing `oidc: parse username claims "email": claim not present`.
|
||||
4. **Redirect URIs must use regex mode** (`http://localhost:.*`) because kubelogin picks random ports, not just 8000/18000.
|
||||
5. **Kubelet static pod manifest changes** require a full cycle to take effect: remove manifest, stop kubelet, remove containers via crictl, re-add manifest, start kubelet. Simple `touch` or kubelet restart is not enough.
|
||||
6. **Property mappings endpoint** in Authentik 2025.10.x is `propertymappings/provider/scope/` (not the older `propertymappings/scope/`).
|
||||
|
||||
#### Common Management Tasks
|
||||
**Add a new OAuth2 application:**
|
||||
1. Create OAuth2 provider: `POST /api/v3/providers/oauth2/` with client_id, client_secret, redirect_uris, authorization_flow, etc.
|
||||
2. Create application: `POST /api/v3/core/applications/` with name, slug, provider pk
|
||||
3. (Optional) Bind to group policy for access control
|
||||
|
||||
**Add a user to a group:**
|
||||
```bash
|
||||
# Get group pk, then PATCH with updated users list
|
||||
curl -X PATCH -H "Authorization: Bearer <TOKEN>" -H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/<group-pk>/" \
|
||||
-d '{"users": [<existing_user_pks>, <new_user_pk>]}'
|
||||
```
|
||||
|
||||
**Protect a service with forward auth:**
|
||||
Set `protected = true` in the service's `ingress_factory` call in Terraform.
|
||||
- **URL**: `https://authentik.viktorbarzin.me` | **API**: `/api/v3/` | **Token**: `authentik_api_token` in tfvars
|
||||
- **Architecture**: 3 server + 3 worker + 3 PgBouncer + embedded outpost
|
||||
- **Database**: PostgreSQL via `postgresql.dbaas:5432`, PgBouncer at `pgbouncer.authentik:6432`
|
||||
- **Traefik integration**: Forward auth via `protected = true` in ingress_factory
|
||||
- **OIDC for K8s**: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
|
||||
- For management tasks, current state, and OIDC gotchas: see `authentik` and `authentik-oidc-kubernetes` skills
|
||||
- For current apps/groups/users snapshot: see `.claude/reference/authentik-state.md`
|
||||
|
||||
### AFFiNE (Visual Canvas)
|
||||
- **Image**: `ghcr.io/toeverything/affine:stable`
|
||||
- **Port**: 3010
|
||||
- **Requires**: PostgreSQL + Redis
|
||||
- **Image**: `ghcr.io/toeverything/affine:stable` | **Port**: 3010 | **Requires**: PostgreSQL + Redis
|
||||
- **Migration**: Init container runs `node ./scripts/self-host-predeploy.js`
|
||||
- **Storage**: NFS at `/mnt/main/affine` mounted to `/root/.affine/storage` and `/root/.affine/config`
|
||||
- **Key env vars**:
|
||||
- `AFFINE_SERVER_EXTERNAL_URL` - Public URL (e.g., `https://affine.viktorbarzin.me`)
|
||||
- `AFFINE_SERVER_HTTPS` - Set to `true` behind TLS ingress
|
||||
- `DATABASE_URL` - PostgreSQL connection string
|
||||
- `REDIS_SERVER_HOST` - Redis hostname
|
||||
- `MAILER_*` - SMTP configuration for email invites
|
||||
- **Local-first**: Data stored in browser by default; syncs to server when user creates account
|
||||
- **Docs**: https://docs.affine.pro/self-host-affine
|
||||
- **Storage**: NFS `/mnt/main/affine` → `/root/.affine/storage` and `/root/.affine/config`
|
||||
|
||||
### Wyoming Whisper (STT for Home Assistant)
|
||||
- **Image**: `rhasspy/wyoming-whisper:latest`
|
||||
- **Port**: 10300/TCP (Wyoming protocol)
|
||||
- **Model**: `small-int8` (CPU-optimized, no CUDA variant available from upstream)
|
||||
- **Runs on**: GPU node (node_selector gpu=true + nvidia toleration) but uses CPU only
|
||||
- **Storage**: NFS at `/mnt/main/whisper` → `/data` (model cache)
|
||||
- **Exposure**: Internal only via Traefik TCP entrypoint `whisper-tcp` → IngressRouteTCP
|
||||
- **Access**: `10.0.20.202:10300` (Traefik LB IP, no public DNS)
|
||||
- **HA Integration**: Wyoming Protocol integration in ha-london, host `10.0.20.202`, port `10300`
|
||||
- **No GPU acceleration**: Official image is CPU-only (Debian + PyTorch CPU). The `mib1185/wyoming-faster-whisper-cuda` image exists but requires self-build.
|
||||
### Wyoming Whisper (STT)
|
||||
- **Image**: `rhasspy/wyoming-whisper:latest` | **Port**: 10300/TCP (Wyoming protocol)
|
||||
- **Model**: `small-int8` (CPU-only) | **Access**: `10.0.20.202:10300` (internal, no public DNS)
|
||||
- **HA Integration**: Wyoming Protocol in ha-london
|
||||
|
||||
### Gramps Web (Genealogy)
|
||||
- **Image**: `ghcr.io/gramps-project/grampsweb:latest`
|
||||
- **Port**: 5000
|
||||
- **URL**: `https://family.viktorbarzin.me`
|
||||
- **Components**: Web app + Celery worker (2 containers in 1 pod)
|
||||
- **Requires**: Shared Redis (DB 2 for Celery broker/backend, DB 3 for rate limiting)
|
||||
- **Storage**: NFS at `/mnt/main/grampsweb` with sub_paths: users, indexdir, thumbnail_cache, cache, secret, grampsdb, media, tmp
|
||||
- **Key env vars**:
|
||||
- `GRAMPSWEB_SECRET_KEY` - Flask secret key (generated via `random_password`)
|
||||
- `GRAMPSWEB_TREE` - Tree name
|
||||
- `GRAMPSWEB_BASE_URL` - Public URL
|
||||
- `GRAMPSWEB_CELERY_CONFIG__broker_url` / `result_backend` - Redis connection
|
||||
- `GRAMPSWEB_REGISTRATION_DISABLED` - Set to `True`
|
||||
- `GRAMPSWEB_EMAIL_*` - SMTP configuration
|
||||
- `GRAMPSWEB_LLM_*` - Ollama AI integration
|
||||
- **Celery command**: `celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=2`
|
||||
- **Registration**: Disabled; first user created via UI setup wizard
|
||||
- **Image**: `ghcr.io/gramps-project/grampsweb:latest` | **Port**: 5000 | **URL**: `https://family.viktorbarzin.me`
|
||||
- **Components**: Web app + Celery worker (2 containers in 1 pod) | **Redis**: DB 2 (broker), DB 3 (rate limiting)
|
||||
- **Storage**: NFS `/mnt/main/grampsweb` with sub_paths
|
||||
|
||||
### Loki + Alloy (Centralized Log Collection)
|
||||
- **Loki image**: `grafana/loki:3.6.5` (Helm chart, single binary mode)
|
||||
- **Alloy image**: `grafana/alloy:v1.13.0` (Helm chart, DaemonSet)
|
||||
- **Config files**: `modules/kubernetes/monitoring/loki.tf`, `loki.yaml`, `alloy.yaml`
|
||||
- **Port**: 3100/TCP (Loki API)
|
||||
- **Storage**: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi in-memory)
|
||||
- **Memory**: Loki 6Gi limit, Alloy 128Mi per pod (4 worker nodes)
|
||||
- **Disk-friendly tuning**: `max_chunk_age: 24h`, `chunk_idle_period: 12h` — holds chunks in memory, flushes ~once/day
|
||||
- **Retention**: 7 days (`retention_period: 168h`), compactor enforces deletion
|
||||
- **Crash policy**: WAL on tmpfs — up to 24h log loss on crash (alerts still fire in real-time)
|
||||
- **Ruler**: Evaluates LogQL alert rules, fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
|
||||
### Loki + Alloy (Log Collection)
|
||||
- **Loki**: `grafana/loki:3.6.5` (single binary, 6Gi RAM, 7d retention)
|
||||
- **Alloy**: `grafana/alloy:v1.13.0` (DaemonSet, 128Mi/pod)
|
||||
- **Storage**: NFS PV `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
|
||||
- **Alert rules**: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
|
||||
- **Grafana**: Datasource UID `P8E80F9AEF21F6940`, dashboard "Loki Kubernetes Logs" (stored in MySQL, not file-provisioned)
|
||||
- **Sysctl DaemonSet**: `sysctl-inotify` sets `fs.inotify.max_user_watches=1048576` on all nodes (required for Alloy fsnotify)
|
||||
- **Disabled components**: gateway, chunksCache, resultsCache (not needed for single binary)
|
||||
- **Key paths**: Compactor at `/var/loki/compactor`, ruler scratch at `/var/loki/scratch` (must be under `/var/loki` — root FS is read-only)
|
||||
- **Querying**: Grafana Explore with LogQL, e.g. `{namespace="monitoring"} |= "error"`
|
||||
- **Troubleshooting**: If "entry too far behind" errors on first start, restart Alloy DaemonSet (`kubectl rollout restart ds -n monitoring alloy`) — Alloy reads historical logs on first boot, which Loki rejects; clears after restart
|
||||
- **Troubleshooting**: "entry too far behind" on first start → restart Alloy DaemonSet
|
||||
|
||||
### OpenClaw (AI Agent Gateway)
|
||||
- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9`
|
||||
- **Port**: 18789
|
||||
- **URL**: `https://openclaw.viktorbarzin.me` (authentik-protected)
|
||||
- **Namespace**: `openclaw` (tier: aux)
|
||||
- **Formerly**: `moltbot` — renamed in Feb 2026
|
||||
- **Architecture**: Single pod with init container (tools download + repo clone) + main container (OpenClaw gateway)
|
||||
- **Init container**: Downloads kubectl v1.34.2, terraform 1.14.5, git-crypt; clones infra repo; runs terraform init
|
||||
- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding (for managing cluster resources)
|
||||
- **Storage**: NFS at `/mnt/main/openclaw/workspace` (git repo) and `/mnt/main/openclaw/data` (persistent data)
|
||||
- **Config**: `openclaw.json` ConfigMap with model providers (Gemini, Ollama, Llama API), tool permissions, and agent defaults
|
||||
- **Variables**: `openclaw_ssh_key`, `openclaw_skill_secrets` in `terraform.tfvars`
|
||||
- **Skill secrets**: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
|
||||
- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)
|
||||
- **Image**: `ghcr.io/openclaw/openclaw:2026.2.9` | **Port**: 18789 | **URL**: `https://openclaw.viktorbarzin.me`
|
||||
- **Init container**: Downloads kubectl, terraform, git-crypt; clones infra repo
|
||||
- **ServiceAccount**: `openclaw` with `cluster-admin` ClusterRoleBinding
|
||||
- **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API
|
||||
|
||||
### Poison Fountain (Anti-AI Scraping Service)
|
||||
- **Image**: `python:3.12-slim` (runs custom `server.py` from ConfigMap)
|
||||
- **Port**: 8080
|
||||
- **URL**: `https://poison.viktorbarzin.me` (public, no auth)
|
||||
- **Namespace**: `poison-fountain` (tier: aux)
|
||||
- **Stack**: `stacks/poison-fountain/`
|
||||
- **Architecture**: 1 Deployment (Python HTTP server) + 1 CronJob (fetcher, every 6h)
|
||||
- **Storage**: NFS at `/mnt/main/poison-fountain` — `cache/` subdir for poison docs (chmod 777 for curl uid 101)
|
||||
- **Endpoints**:
|
||||
- `/auth` — ForwardAuth: checks User-Agent, returns 200 (allow) or 403 (block AI bots)
|
||||
- `/article/*` — Tarpit: drip-feeds poison content at ~100 bytes/sec (DRIP_BYTES=50, DRIP_DELAY=0.5s)
|
||||
- `/healthz` — Health check
|
||||
- **CronJob**: Fetches 50 documents from `rnsaffn.com/poison2/` using `--http1.1` (HTTP/2 hangs)
|
||||
- **Ingress**: Uses `anti_ai_scraping = false` (doesn't protect itself), `skip_default_rate_limit = true`, `exclude_crowdsec = true`
|
||||
- **DNS**: `poison.viktorbarzin.me` in `cloudflare_non_proxied_names`
|
||||
- **Traefik middlewares** (in `stacks/platform/modules/traefik/middleware.tf`):
|
||||
- `ai-bot-block` — ForwardAuth to poison-fountain `/auth`
|
||||
- `anti-ai-headers` — X-Robots-Tag: noai, noimageai
|
||||
- `anti-ai-trap-links` — rewrite-body plugin injecting 5 hidden links before `</body>`
|
||||
## Service Versions (as of 2026-02)
|
||||
Immich v2.4.1 | AFFiNE stable | Whisper latest | Loki 3.6.5 | Alloy v1.13.0 | OpenClaw 2026.2.9
|
||||
|
|
|
|||
50
.claude/reference/authentik-state.md
Normal file
50
.claude/reference/authentik-state.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# Authentik Current State
|
||||
|
||||
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
|
||||
|
||||
## Applications (9)
|
||||
| Application | Provider Type | Auth Flow |
|
||||
|-------------|--------------|-----------|
|
||||
| Cloudflare Access | OAuth2/OIDC | explicit consent |
|
||||
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
||||
| Grafana | OAuth2/OIDC | implicit consent |
|
||||
| Headscale | OAuth2/OIDC | explicit consent |
|
||||
| Immich | OAuth2/OIDC | explicit consent |
|
||||
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
||||
| linkwarden | OAuth2/OIDC | explicit consent |
|
||||
| Matrix | OAuth2/OIDC | implicit consent |
|
||||
| wrongmove | OAuth2/OIDC | implicit consent |
|
||||
|
||||
## Groups (9)
|
||||
| Group | Parent | Superuser | Purpose |
|
||||
|-------|--------|-----------|---------|
|
||||
| Allow Login Users | — | No | Parent group for login-permitted users |
|
||||
| authentik Admins | — | Yes | Full admin access |
|
||||
| authentik Read-only | — | No | Read-only access (has role) |
|
||||
| Headscale Users | Allow Login Users | No | VPN access |
|
||||
| Home Server Admins | Allow Login Users | No | Server admin access |
|
||||
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
|
||||
| kubernetes-admins | — | No | K8s cluster-admin RBAC |
|
||||
| kubernetes-power-users | — | No | K8s power-user RBAC |
|
||||
| kubernetes-namespace-owners | — | No | K8s namespace-owner RBAC |
|
||||
|
||||
## Users (7 real)
|
||||
| Username | Name | Type | Groups |
|
||||
|----------|------|------|--------|
|
||||
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
|
||||
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
|
||||
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
|
||||
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
|
||||
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
|
||||
| valentinakolevabarzina@gmail.com | Валентина Колева-Барзина | internal | Headscale Users |
|
||||
| anca.r.cristian10@gmail.com | — | internal | Wrongmove Users |
|
||||
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
|
||||
|
||||
## Login Sources
|
||||
- **Google** (OAuth) — user matching by identifier
|
||||
- **GitHub** (OAuth) — user matching by email_link
|
||||
- **Facebook** (OAuth) — user matching by email_link
|
||||
|
||||
## Authorization Flows
|
||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
|
||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
|
||||
50
.claude/reference/github-drone-api.md
Normal file
50
.claude/reference/github-drone-api.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# GitHub & Drone CI API Reference
|
||||
|
||||
> Token locations and common API patterns.
|
||||
|
||||
## GitHub API
|
||||
- **Username**: `ViktorBarzin`
|
||||
- **Token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` (git-crypt encrypted)
|
||||
- **Scopes**: Full access (repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages)
|
||||
- **`gh` CLI**: Blocked by sandbox — use `curl` instead
|
||||
|
||||
```bash
|
||||
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
|
||||
|
||||
# List repos
|
||||
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
|
||||
|
||||
# Create repo
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
|
||||
-d '{"name":"repo-name","private":true}'
|
||||
|
||||
# Add deploy key
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
|
||||
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
|
||||
|
||||
# Create webhook
|
||||
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
|
||||
-d '{"config":{"url":"https://drone.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
|
||||
```
|
||||
|
||||
## Drone CI API
|
||||
- **Server**: `https://drone.viktorbarzin.me`
|
||||
- **Token**: `grep drone_api_token terraform.tfvars | cut -d'"' -f2`
|
||||
|
||||
```bash
|
||||
DRONE_TOKEN=$(grep drone_api_token terraform.tfvars | cut -d'"' -f2)
|
||||
|
||||
# Activate repo
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>"
|
||||
|
||||
# Trigger build
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/builds"
|
||||
|
||||
# Add secret
|
||||
curl -s -X POST -H "Authorization: Bearer $DRONE_TOKEN" "https://drone.viktorbarzin.me/api/repos/ViktorBarzin/<repo>/secrets" \
|
||||
-d '{"name":"secret_name","data":"secret_value"}'
|
||||
```
|
||||
|
||||
## Capabilities
|
||||
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages
|
||||
- **Drone CI**: Activate repos, trigger/monitor builds, manage secrets, configure pipelines
|
||||
52
.claude/reference/proxmox-inventory.md
Normal file
52
.claude/reference/proxmox-inventory.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# Proxmox Inventory & Infrastructure
|
||||
|
||||
> Static reference for VMs, hardware, and network topology.
|
||||
|
||||
## Proxmox Host Hardware
|
||||
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
|
||||
- **RAM**: 142 GB (Dell R730 server)
|
||||
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
|
||||
- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
|
||||
- **Proxmox access**: `ssh root@192.168.1.127`
|
||||
|
||||
## Network Topology
|
||||
```
|
||||
10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
|
||||
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
|
||||
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
|
||||
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
|
||||
```
|
||||
|
||||
## Network Bridges
|
||||
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network
|
||||
- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)
|
||||
|
||||
## VM Inventory
|
||||
|
||||
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
||||
|------|------|--------|------|-----|---------|------|-------|
|
||||
| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
|
||||
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
|
||||
| 103 | home-assistant | running | 8 | 16GB | vmbr0 | 32G | HA, net0(vlan10) disabled |
|
||||
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
|
||||
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
|
||||
| 201 | k8s-node1 | running | 16 | 24GB | vmbr1:vlan20 | 128G | GPU node, Tesla T4 |
|
||||
| 202 | k8s-node2 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
|
||||
| 203 | k8s-node3 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
|
||||
| 204 | k8s-node4 | running | 8 | 16GB | vmbr1:vlan20 | 64G | Worker |
|
||||
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
||||
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
|
||||
| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) |
|
||||
|
||||
## VM Templates
|
||||
| VMID | Name | Purpose |
|
||||
|------|------|---------|
|
||||
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs |
|
||||
| 1001 | docker-registry-template | Docker registry VM |
|
||||
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes |
|
||||
|
||||
## GPU Node (k8s-node1)
|
||||
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
|
||||
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
|
||||
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
|
||||
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|
||||
132
.claude/reference/service-catalog.md
Normal file
132
.claude/reference/service-catalog.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
# Service Catalog
|
||||
|
||||
> Auto-maintained reference. See `.claude/CLAUDE.md` for operational guidance.
|
||||
|
||||
## Critical - Network & Auth (Tier: core)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| wireguard | VPN server | platform |
|
||||
| technitium | DNS server (10.0.20.101) | platform |
|
||||
| headscale | Tailscale control server | platform |
|
||||
| traefik | Ingress controller (Helm) | platform |
|
||||
| xray | Proxy/tunnel | platform |
|
||||
| authentik | Identity provider (SSO) | platform |
|
||||
| cloudflared | Cloudflare tunnel | platform |
|
||||
| authelia | Auth middleware | platform |
|
||||
| monitoring | Prometheus/Grafana/Loki stack | platform |
|
||||
|
||||
## Storage & Security (Tier: cluster)
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| vaultwarden | Bitwarden-compatible password manager | platform |
|
||||
| redis | Shared Redis at `redis.redis.svc.cluster.local` | platform |
|
||||
| immich | Photo management (GPU) | immich |
|
||||
| nvidia | GPU device plugin | platform |
|
||||
| metrics-server | K8s metrics | platform |
|
||||
| uptime-kuma | Status monitoring | platform |
|
||||
| crowdsec | Security/WAF | platform |
|
||||
| kyverno | Policy engine | platform |
|
||||
|
||||
## Admin
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard | platform |
|
||||
| reverse-proxy | Generic reverse proxy | platform |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| mailserver | Email (docker-mailserver) | mailserver |
|
||||
| shadowsocks | Proxy | shadowsocks |
|
||||
| webhook_handler | Webhook processing | webhook_handler |
|
||||
| tuya-bridge | Smart home bridge | tuya-bridge |
|
||||
| dawarich | Location history | dawarich |
|
||||
| owntracks | Location tracking | owntracks |
|
||||
| nextcloud | File sync/share | nextcloud |
|
||||
| calibre | E-book management | calibre |
|
||||
| onlyoffice | Document editing | onlyoffice |
|
||||
| f1-stream | F1 streaming | f1-stream |
|
||||
| rybbit | Analytics | rybbit |
|
||||
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
|
||||
| actualbudget | Budgeting (factory pattern) | actualbudget |
|
||||
|
||||
## Optional
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| blog | Personal blog | blog |
|
||||
| descheduler | Pod descheduler | descheduler |
|
||||
| drone | CI/CD | drone |
|
||||
| hackmd | Collaborative markdown | hackmd |
|
||||
| kms | Key management | kms |
|
||||
| privatebin | Encrypted pastebin | privatebin |
|
||||
| vault | HashiCorp Vault | vault |
|
||||
| reloader | ConfigMap/Secret reloader | reloader |
|
||||
| city-guesser | Game | city-guesser |
|
||||
| echo | Echo server | echo |
|
||||
| url | URL shortener | url |
|
||||
| excalidraw | Whiteboard | excalidraw |
|
||||
| travel_blog | Travel blog | travel_blog |
|
||||
| dashy | Dashboard | dashy |
|
||||
| send | Firefox Send | send |
|
||||
| ytdlp | YouTube downloader | ytdlp |
|
||||
| wealthfolio | Finance tracking | wealthfolio |
|
||||
| audiobookshelf | Audiobook server | audiobookshelf |
|
||||
| paperless-ngx | Document management | paperless-ngx |
|
||||
| jsoncrack | JSON visualizer | jsoncrack |
|
||||
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
|
||||
| ntfy | Push notifications | ntfy |
|
||||
| cyberchef | Data transformation | cyberchef |
|
||||
| diun | Docker image update notifier | diun |
|
||||
| meshcentral | Remote management | meshcentral |
|
||||
| homepage | Dashboard/startpage | homepage |
|
||||
| matrix | Matrix chat server | matrix |
|
||||
| linkwarden | Bookmark manager | linkwarden |
|
||||
| changedetection | Web change detection | changedetection |
|
||||
| tandoor | Recipe manager | tandoor |
|
||||
| n8n | Workflow automation | n8n |
|
||||
| real-estate-crawler | Property crawler | real-estate-crawler |
|
||||
| tor-proxy | Tor proxy | tor-proxy |
|
||||
| forgejo | Git forge | forgejo |
|
||||
| freshrss | RSS reader | freshrss |
|
||||
| navidrome | Music streaming | navidrome |
|
||||
| networking-toolbox | Network tools | networking-toolbox |
|
||||
| stirling-pdf | PDF tools | stirling-pdf |
|
||||
| speedtest | Speed testing | speedtest |
|
||||
| freedify | Music streaming (factory pattern) | freedify |
|
||||
| netbox | Network documentation | netbox |
|
||||
| infra-maintenance | Maintenance jobs | infra-maintenance |
|
||||
| ollama | LLM server (GPU) | ollama |
|
||||
| frigate | NVR/camera (GPU) | frigate |
|
||||
| ebook2audiobook | E-book to audio (GPU) | ebook2audiobook |
|
||||
| affine | Visual canvas/whiteboard (PostgreSQL + Redis) | affine |
|
||||
| health | Apple Health data dashboard (PostgreSQL) | health |
|
||||
| whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
|
||||
| grampsweb | Genealogy web app (Gramps Web) | grampsweb |
|
||||
| openclaw | AI agent gateway (OpenClaw) | openclaw |
|
||||
| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |
|
||||
|
||||
## Cloudflare Domains
|
||||
|
||||
### Proxied (CDN + WAF enabled)
|
||||
```
|
||||
blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send,
|
||||
audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden,
|
||||
changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser,
|
||||
travel, netbox
|
||||
```
|
||||
|
||||
### Non-Proxied (Direct DNS)
|
||||
```
|
||||
mail, wg, headscale, immich, calibre, vaultwarden, drone,
|
||||
mailserver-antispam, mailserver-admin, webhook, uptime,
|
||||
owntracks, dawarich, tuya, meshcentral, nextcloud, actualbudget,
|
||||
onlyoffice, forgejo, freshrss, navidrome, ollama, openwebui,
|
||||
isponsorblocktv, speedtest, freedify, rybbit, paperless,
|
||||
servarr, prowlarr, bazarr, radarr, sonarr, flaresolverr,
|
||||
jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
|
||||
```
|
||||
|
||||
### Special Subdomains
|
||||
- `*.viktor.actualbudget` - Actualbudget factory instances
|
||||
- `*.freedify` - Freedify factory instances
|
||||
- `mailserver.*` - Mail server components (antispam, admin)
|
||||
|
|
@ -1,138 +0,0 @@
|
|||
---
|
||||
name: containerd-multi-registry-pull-through-cache
|
||||
description: |
|
||||
Set up pull-through caches for multiple container registries (ghcr.io, quay.io,
|
||||
registry.k8s.io, reg.kyverno.io) using Docker Registry v2 instances. Use when:
|
||||
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
|
||||
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
|
||||
(3) need to add pull-through cache for a new upstream registry,
|
||||
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
|
||||
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues.
|
||||
Docker Registry v2 can only proxy ONE upstream per instance, so multiple
|
||||
containers are needed for multiple registries.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-14
|
||||
---
|
||||
|
||||
# Containerd Multi-Registry Pull-Through Cache
|
||||
|
||||
## Problem
|
||||
|
||||
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
|
||||
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
|
||||
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
|
||||
and other registries — they get routed to the Docker Hub proxy which can't serve them,
|
||||
causing `ImagePullBackOff`.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
|
||||
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
|
||||
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
|
||||
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Run one Registry v2 container per upstream
|
||||
|
||||
Each upstream needs its own Docker Registry v2 instance on a different port:
|
||||
|
||||
| Port | Registry | Container Name |
|
||||
|------|----------|---------------|
|
||||
| 5000 | docker.io | registry |
|
||||
| 5010 | ghcr.io | registry-ghcr |
|
||||
| 5020 | quay.io | registry-quay |
|
||||
| 5030 | registry.k8s.io | registry-k8s |
|
||||
| 5040 | reg.kyverno.io | registry-kyverno |
|
||||
|
||||
Config for non-Docker-Hub proxies (no auth needed — they're public):
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
http:
|
||||
addr: :5000
|
||||
proxy:
|
||||
remoteurl: https://ghcr.io # change per registry
|
||||
```
|
||||
|
||||
```bash
|
||||
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
|
||||
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
|
||||
```
|
||||
|
||||
### 2. Replace deprecated wildcard mirror with `config_path`
|
||||
|
||||
Instead of:
|
||||
```toml
|
||||
# DEPRECATED - breaks non-Docker-Hub registries
|
||||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
|
||||
endpoint = ["http://10.0.20.10:5000"]
|
||||
```
|
||||
|
||||
Use the modern `config_path` approach:
|
||||
```toml
|
||||
[plugins."io.containerd.grpc.v1.cri".registry]
|
||||
config_path = "/etc/containerd/certs.d"
|
||||
```
|
||||
|
||||
Then create per-registry `hosts.toml` files:
|
||||
```bash
|
||||
mkdir -p /etc/containerd/certs.d/docker.io
|
||||
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
|
||||
server = "https://registry-1.docker.io"
|
||||
|
||||
[host."http://10.0.20.10:5000"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
EOF
|
||||
```
|
||||
|
||||
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
|
||||
|
||||
### 3. Critical: `config_path` and `mirrors` cannot coexist
|
||||
|
||||
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
|
||||
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
|
||||
(including the `[plugins."...registry.mirrors"]` parent section) before setting
|
||||
`config_path`.
|
||||
|
||||
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
|
||||
where the config format is slightly different. If unsure, either:
|
||||
- Don't use config_path on that node (skip the pull-through cache)
|
||||
- Remove the entire `mirrors` section first, then add `config_path`
|
||||
|
||||
### 4. Static IP for registry VM
|
||||
|
||||
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
|
||||
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Test each proxy responds
|
||||
for port in 5000 5010 5020 5030 5040; do
|
||||
curl -s http://10.0.20.10:$port/v2/_catalog
|
||||
done
|
||||
|
||||
# Test containerd can pull through cache
|
||||
crictl pull ghcr.io/some/image:tag
|
||||
|
||||
# Check containerd logs for mirror usage
|
||||
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
|
||||
direct pull from the upstream `server` URL. This provides graceful degradation.
|
||||
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
|
||||
to avoid I/O spikes.
|
||||
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
|
||||
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
|
||||
|
||||
See also: `k8s-docker-registry-cache-bypass` (for stale cached image issues)
|
||||
|
|
@ -1,27 +1,33 @@
|
|||
---
|
||||
name: helm-release-force-rerender
|
||||
name: helm-release-troubleshooting
|
||||
description: |
|
||||
Fix for Helm releases managed by Terraform where changing Helm values doesn't update
|
||||
the actual Kubernetes resources. Use when: (1) Terraform applies successfully but
|
||||
K8s resources (Service, Deployment) don't reflect new Helm values,
|
||||
(2) New ports/volumes/containers from Helm chart values don't appear in the deployed resources,
|
||||
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
|
||||
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
|
||||
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
|
||||
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
|
||||
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale.
|
||||
Solution involves removing from Terraform state, reimporting, and force upgrading.
|
||||
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
|
||||
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
|
||||
(6) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(8) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers force re-rendering via state removal/reimport and stuck release recovery via
|
||||
secret cleanup.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-07
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Helm Release Force Re-render via Terraform
|
||||
# Helm Release Troubleshooting
|
||||
|
||||
## Problem
|
||||
## Force Re-render
|
||||
|
||||
### Problem
|
||||
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
|
||||
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
|
||||
the new values. For example, adding a new port in Helm values doesn't result in that port
|
||||
appearing in the Service spec.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
### Context / Trigger Conditions
|
||||
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
|
||||
the old configuration
|
||||
- Structural changes to Helm values (new ports, new containers, new volumes) are not
|
||||
|
|
@ -30,7 +36,7 @@ appearing in the Service spec.
|
|||
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
|
||||
includes resources based on values
|
||||
|
||||
## Root Cause
|
||||
### Root Cause
|
||||
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
|
||||
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
|
||||
ones rather than doing a full template re-render. For structural changes (like enabling
|
||||
|
|
@ -41,9 +47,9 @@ Additionally, Terraform may see the stored Helm release state as matching the de
|
|||
even though the actual Kubernetes resources don't reflect it, creating a state drift that
|
||||
Terraform doesn't detect.
|
||||
|
||||
## Solution
|
||||
### Solution
|
||||
|
||||
### Step 1: Verify the Discrepancy
|
||||
#### Step 1: Verify the Discrepancy
|
||||
|
||||
Confirm that K8s resources don't match Helm values:
|
||||
```bash
|
||||
|
|
@ -55,7 +61,7 @@ helm get values <release-name> -n <namespace>
|
|||
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
|
||||
```
|
||||
|
||||
### Step 2: Remove Helm Release from Terraform State
|
||||
#### Step 2: Remove Helm Release from Terraform State
|
||||
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
|
|
@ -64,7 +70,7 @@ terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<nam
|
|||
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
|
||||
resources remain untouched in the cluster.
|
||||
|
||||
### Step 3: Import the Helm Release Back
|
||||
#### Step 3: Import the Helm Release Back
|
||||
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
|
|
@ -72,7 +78,7 @@ terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>
|
|||
|
||||
For Helm releases, the import ID format is `namespace/release-name`.
|
||||
|
||||
### Step 4: Force Apply with Terraform
|
||||
#### Step 4: Force Apply with Terraform
|
||||
|
||||
After reimporting, run terraform apply. Terraform should now detect the drift between
|
||||
the desired Helm values and the actual release state:
|
||||
|
|
@ -87,7 +93,7 @@ terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
|||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
### Step 5: Manual Helm Force Upgrade (Last Resort)
|
||||
#### Step 5: Manual Helm Force Upgrade (Last Resort)
|
||||
|
||||
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
|
||||
|
||||
|
|
@ -109,7 +115,7 @@ terraform apply -target=module.kubernetes_cluster.module.<service>
|
|||
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
|
||||
afterward, and use `terraform apply` to verify Terraform is back in sync.
|
||||
|
||||
## Verification
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check the K8s resources now match expected configuration
|
||||
|
|
@ -121,7 +127,7 @@ terraform plan -target=module.kubernetes_cluster.module.<service>
|
|||
# Should show "No changes" or minimal expected drift
|
||||
```
|
||||
|
||||
## Example: Traefik HTTP/3 UDP Port Not Appearing
|
||||
### Example: Traefik HTTP/3 UDP Port Not Appearing
|
||||
|
||||
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
|
||||
successfully, but the Traefik Service only had TCP port 443, missing the expected
|
||||
|
|
@ -143,21 +149,102 @@ kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
|
|||
# Should show: port: 443, protocol: UDP
|
||||
```
|
||||
|
||||
## Notes
|
||||
### Notes
|
||||
|
||||
- This issue is more common with structural Helm value changes (new ports, new sidecars,
|
||||
conditional template blocks) than with simple value changes (image tags, replica counts)
|
||||
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
|
||||
which causes brief downtime. Use with caution on production ingress controllers.
|
||||
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
|
||||
- This is different from the `terraform-state-identity-mismatch` skill, which covers
|
||||
provider-level identity errors. This skill covers Helm template rendering issues where
|
||||
the state looks correct but the actual resources don't match.
|
||||
|
||||
---
|
||||
|
||||
## Stuck Release Recovery
|
||||
|
||||
### Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
#### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
#### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
#### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
### Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
### Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
|
||||
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for this issue)
|
||||
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
|
||||
|
||||
## References
|
||||
|
||||
|
|
@ -1,93 +0,0 @@
|
|||
---
|
||||
name: helm-stuck-release-recovery
|
||||
description: |
|
||||
Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states.
|
||||
Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is
|
||||
in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(4) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers manual secret cleanup to restore Helm release to a deployable state.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Helm Stuck Release Recovery
|
||||
|
||||
## Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
## Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
244
.claude/skills/k8s-container-image-caching/SKILL.md
Normal file
244
.claude/skills/k8s-container-image-caching/SKILL.md
Normal file
|
|
@ -0,0 +1,244 @@
|
|||
---
|
||||
name: k8s-container-image-caching
|
||||
description: |
|
||||
Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
|
||||
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
|
||||
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
|
||||
(3) need to add pull-through cache for a new upstream registry,
|
||||
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
|
||||
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
|
||||
(6) kubectl shows correct image tag but container runs old code,
|
||||
(7) local registry mirror caches stale images,
|
||||
(8) imagePullPolicy: Always doesn't force fresh pulls,
|
||||
(9) containerd config has mirror that intercepts pulls serving stale images.
|
||||
Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
|
||||
via image digest pinning.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Kubernetes Container Image Caching
|
||||
|
||||
## Pull-Through Cache Setup
|
||||
|
||||
### Problem
|
||||
|
||||
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
|
||||
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
|
||||
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
|
||||
and other registries -- they get routed to the Docker Hub proxy which can't serve them,
|
||||
causing `ImagePullBackOff`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
|
||||
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
|
||||
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
|
||||
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Run one Registry v2 container per upstream
|
||||
|
||||
Each upstream needs its own Docker Registry v2 instance on a different port:
|
||||
|
||||
| Port | Registry | Container Name |
|
||||
|------|----------|---------------|
|
||||
| 5000 | docker.io | registry |
|
||||
| 5010 | ghcr.io | registry-ghcr |
|
||||
| 5020 | quay.io | registry-quay |
|
||||
| 5030 | registry.k8s.io | registry-k8s |
|
||||
| 5040 | reg.kyverno.io | registry-kyverno |
|
||||
|
||||
Config for non-Docker-Hub proxies (no auth needed -- they're public):
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
http:
|
||||
addr: :5000
|
||||
proxy:
|
||||
remoteurl: https://ghcr.io # change per registry
|
||||
```
|
||||
|
||||
```bash
|
||||
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
|
||||
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
|
||||
```
|
||||
|
||||
#### 2. Replace deprecated wildcard mirror with `config_path`
|
||||
|
||||
Instead of:
|
||||
```toml
|
||||
# DEPRECATED - breaks non-Docker-Hub registries
|
||||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
|
||||
endpoint = ["http://10.0.20.10:5000"]
|
||||
```
|
||||
|
||||
Use the modern `config_path` approach:
|
||||
```toml
|
||||
[plugins."io.containerd.grpc.v1.cri".registry]
|
||||
config_path = "/etc/containerd/certs.d"
|
||||
```
|
||||
|
||||
Then create per-registry `hosts.toml` files:
|
||||
```bash
|
||||
mkdir -p /etc/containerd/certs.d/docker.io
|
||||
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
|
||||
server = "https://registry-1.docker.io"
|
||||
|
||||
[host."http://10.0.20.10:5000"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
EOF
|
||||
```
|
||||
|
||||
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
|
||||
|
||||
#### 3. Critical: `config_path` and `mirrors` cannot coexist
|
||||
|
||||
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
|
||||
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
|
||||
(including the `[plugins."...registry.mirrors"]` parent section) before setting
|
||||
`config_path`.
|
||||
|
||||
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
|
||||
where the config format is slightly different. If unsure, either:
|
||||
- Don't use config_path on that node (skip the pull-through cache)
|
||||
- Remove the entire `mirrors` section first, then add `config_path`
|
||||
|
||||
#### 4. Static IP for registry VM
|
||||
|
||||
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
|
||||
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test each proxy responds
|
||||
for port in 5000 5010 5020 5030 5040; do
|
||||
curl -s http://10.0.20.10:$port/v2/_catalog
|
||||
done
|
||||
|
||||
# Test containerd can pull through cache
|
||||
crictl pull ghcr.io/some/image:tag
|
||||
|
||||
# Check containerd logs for mirror usage
|
||||
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
|
||||
direct pull from the upstream `server` URL. This provides graceful degradation.
|
||||
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
|
||||
to avoid I/O spikes.
|
||||
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
|
||||
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
|
||||
|
||||
---
|
||||
|
||||
## Cache Bypass / Stale Image Fix
|
||||
|
||||
### Problem
|
||||
Kubernetes pods continue running old Docker images even after pushing new versions with
|
||||
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
|
||||
and serves stale versions, ignoring `imagePullPolicy: Always`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Pod is running but application code is outdated
|
||||
- `docker push` succeeded with new layers
|
||||
- `kubectl describe pod` shows correct image tag
|
||||
- Cluster has a local registry mirror configured (e.g., in containerd config)
|
||||
- `imagePullPolicy: Always` doesn't fix the issue
|
||||
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Get the image digest after pushing
|
||||
```bash
|
||||
docker push viktorbarzin/myimage:latest
|
||||
# Output includes: latest: digest: sha256:abc123... size: 856
|
||||
```
|
||||
|
||||
#### 2. Use digest instead of tag in deployment
|
||||
```hcl
|
||||
# Terraform
|
||||
container {
|
||||
# Use digest to bypass local registry cache
|
||||
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
|
||||
image_pull_policy = "Always"
|
||||
name = "myimage"
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Kubernetes YAML
|
||||
containers:
|
||||
- name: myimage
|
||||
image: docker.io/viktorbarzin/myimage@sha256:abc123...
|
||||
imagePullPolicy: Always
|
||||
```
|
||||
|
||||
#### 3. Apply and restart
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.myservice
|
||||
kubectl rollout restart deployment/myservice -n mynamespace
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Registry mirrors match by tag, not digest
|
||||
- When you specify a digest, the node must fetch that exact manifest
|
||||
- The mirror may not have the digest cached, forcing a pull from upstream
|
||||
- Even if cached, the digest guarantees the exact image version
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check the pod is using the new image
|
||||
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
|
||||
# Verify application behavior reflects new code
|
||||
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
Before (problematic):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web:latest"
|
||||
```
|
||||
|
||||
After (fixed):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
|
||||
```
|
||||
|
||||
### Notes
|
||||
- You must update the digest each time you push a new image
|
||||
- Consider automating digest extraction in CI/CD pipelines
|
||||
- This is a workaround; ideally fix the registry mirror configuration
|
||||
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
|
||||
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
|
||||
|
||||
### Diagnosing Registry Mirror Issues
|
||||
```bash
|
||||
# On a k8s node, check containerd config
|
||||
cat /etc/containerd/config.toml | grep -A5 mirrors
|
||||
|
||||
# Check if mirror is intercepting
|
||||
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
|
||||
|
||||
# List cached images on node
|
||||
crictl images | grep myimage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
|
||||
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
|
||||
|
|
@ -1,110 +0,0 @@
|
|||
---
|
||||
name: k8s-docker-registry-cache-bypass
|
||||
description: |
|
||||
Fix for Kubernetes pods running old Docker images despite pushing new versions.
|
||||
Use when: (1) kubectl shows correct image tag but container runs old code,
|
||||
(2) Local registry mirror caches stale images, (3) imagePullPolicy: Always
|
||||
doesn't force fresh pulls, (4) containerd config has mirror that intercepts pulls.
|
||||
Solution: Use image digest instead of tag to bypass cache entirely.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# Kubernetes Docker Registry Cache Bypass
|
||||
|
||||
## Problem
|
||||
Kubernetes pods continue running old Docker images even after pushing new versions with
|
||||
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
|
||||
and serves stale versions, ignoring `imagePullPolicy: Always`.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Pod is running but application code is outdated
|
||||
- `docker push` succeeded with new layers
|
||||
- `kubectl describe pod` shows correct image tag
|
||||
- Cluster has a local registry mirror configured (e.g., in containerd config)
|
||||
- `imagePullPolicy: Always` doesn't fix the issue
|
||||
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Get the image digest after pushing
|
||||
```bash
|
||||
docker push viktorbarzin/myimage:latest
|
||||
# Output includes: latest: digest: sha256:abc123... size: 856
|
||||
```
|
||||
|
||||
### 2. Use digest instead of tag in deployment
|
||||
```hcl
|
||||
# Terraform
|
||||
container {
|
||||
# Use digest to bypass local registry cache
|
||||
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
|
||||
image_pull_policy = "Always"
|
||||
name = "myimage"
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Kubernetes YAML
|
||||
containers:
|
||||
- name: myimage
|
||||
image: docker.io/viktorbarzin/myimage@sha256:abc123...
|
||||
imagePullPolicy: Always
|
||||
```
|
||||
|
||||
### 3. Apply and restart
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.myservice
|
||||
kubectl rollout restart deployment/myservice -n mynamespace
|
||||
```
|
||||
|
||||
## Why This Works
|
||||
- Registry mirrors match by tag, not digest
|
||||
- When you specify a digest, the node must fetch that exact manifest
|
||||
- The mirror may not have the digest cached, forcing a pull from upstream
|
||||
- Even if cached, the digest guarantees the exact image version
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Check the pod is using the new image
|
||||
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
|
||||
# Verify application behavior reflects new code
|
||||
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
|
||||
```
|
||||
|
||||
## Example
|
||||
|
||||
Before (problematic):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web:latest"
|
||||
```
|
||||
|
||||
After (fixed):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
|
||||
```
|
||||
|
||||
## Notes
|
||||
- You must update the digest each time you push a new image
|
||||
- Consider automating digest extraction in CI/CD pipelines
|
||||
- This is a workaround; ideally fix the registry mirror configuration
|
||||
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
|
||||
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
|
||||
|
||||
## Diagnosing Registry Mirror Issues
|
||||
```bash
|
||||
# On a k8s node, check containerd config
|
||||
cat /etc/containerd/config.toml | grep -A5 mirrors
|
||||
|
||||
# Check if mirror is intercepting
|
||||
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
|
||||
|
||||
# List cached images on node
|
||||
crictl images | grep myimage
|
||||
```
|
||||
|
||||
## References
|
||||
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
|
||||
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
|
||||
Loading…
Add table
Add a link
Reference in a new issue