- Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md
13 KiB
Executable file
Claude Code — Project Configuration
Shared knowledge: Read
AGENTS.mdat repo root for architecture, patterns, rules, and operations. This file adds Claude-specific features on top.
Claude-Specific Resources
- Skills:
.claude/skills/(7 active). Archived runbooks:.claude/skills/archived/ - Agents:
.claude/agents/cluster-health-checker(haiku, autonomous health checks) - Reference:
.claude/reference/— patterns.md, service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md - GitHub API:
curlwith tokens from tfvars (ghCLI blocked by sandbox)
Instructions
- "remember X": Use
memory-tool store "content" --category facts --tags "tag1,tag2"(via exec) for persistent cross-session memory. Also update this file +AGENTS.md(if shared knowledge), commit with[ci skip]. To recall:memory-tool recall "query". To list:memory-tool list. To delete:memory-tool delete <id>. The nativememory_searchandmemory_gettools are also available for searching indexed memory files. For storing new memories, always use thememory-toolCLI via exec. - Apply: Authenticate via
vault login -method=oidc, then usescripts/tg(preferred — handles state decrypt/encrypt) orterragruntdirectly.scripts/tgadds-auto-approvefor--non-interactiveapplies. - New services need CI/CD and monitoring (Prometheus/Uptime Kuma)
- New service: Use
setup-projectskill for full workflow - Ingress:
ingress_factorymodule. Auth:protected = true. Anti-AI: on by default. - Docker images: Always build for
linux/amd64. Use 8-char git SHA tags —:latestcauses stale pull-through cache. - LinuxServer.io containers:
DOCKER_MODSruns apt-get on every start — bake slow mods into a custom image (RUN /docker-mods || truethenENV DOCKER_MODS=). SetNO_CHOWN=trueto skip recursive chown that hangs on NFS mounts. - Node memory changes: When changing VM memory on any k8s node, update kubelet
systemReserved,kubeReserved, and eviction thresholds accordingly. Config:/var/lib/kubelet/config.yaml. Template:stacks/infra/main.tf. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - Sealed Secrets: User-managed secrets go in
sealed-*.yamlfiles in the stack directory. Stacks pick them up viakubernetes_manifest+fileset(path.module, "sealed-*.yaml"). See AGENTS.md for full workflow.
Terraform State — SOPS-Encrypted in Git
- State is local (
backend "local"), encrypted with SOPS and committed as.tfstate.encfiles. - Decrypt priority: Vault Transit (primary, uses existing
vault loginsession) → age key fallback (~/.config/sops/age/keys.txt, for bootstrap/DR). - Encrypt: Always encrypts to both Vault Transit (
transit/keys/sops-state) + age recipients. - Scripts:
scripts/state-sync {encrypt|decrypt|commit} [stack]— handles all state sync.scripts/tgauto-decrypts before and auto-encrypts+commits after mutating ops (apply/destroy/import). - Workflow:
git pull→scripts/tg plan→scripts/tg apply→git push. State sync is transparent. - Config:
.sops.yamlat repo root defines encryption rules. age public keys listed there. - Backups disabled:
terragrunt.hclpasses-backup=-to prevent.backupfile accumulation. - Adding operator: Generate age key (
age-keygen), add pubkey to.sops.yaml, runsops updatekeyson all.encfiles. - Two workstations: Laptop (macOS) + DevVM (10.0.10.10, Linux). Both have age keys + Vault access. Keys backed up in Vault (
secret/viktor/sops_age_key_laptop,sops_age_key_devvm).
Secrets Management — Vault KV
- Vault is the sole source of truth for secrets.
secret/viktor— go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first:vault kv get -field=KEY secret/viktor.- Auth:
vault login -method=oidc(Authentik SSO) →~/.vault-token→ read by Vault TF provider. - Vault stack self-reads:
data "vault_kv_secret_v2" "vault"reads its own OIDC creds fromsecret/vault. - ESO (External Secrets Operator):
stacks/external-secrets/— 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API versionv1beta1. Two ClusterSecretStores:vault-kvandvault-database. - Plan-time pattern: Former plan-time stacks use
data "kubernetes_secret"to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: mustterragrunt apply -target=kubernetes_manifest.external_secretfirst, then full apply.counton resources using secret values fails — remove conditional counts. - 14 hybrid stacks still keep
data "vault_kv_secret_v2"for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. - Database rotation: Vault DB engine rotates passwords every 24h. MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana. PostgreSQL: trading, health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), technitium/crowdsec (Helm-baked), root users.
- K8s credentials: Vault K8s secrets engine. Roles:
dashboard-admin,ci-deployer,openclaw,local-admin. Usevault write kubernetes/creds/ROLE kubernetes_namespace=NS. Helper:scripts/vault-kubeconfig. - CI/CD (GHA + Woodpecker): Docker builds run on GitHub Actions (free on public repos). Woodpecker is deploy-only — receives image tag via API POST, runs
kubectl set image. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushessecret/ci/global→ Woodpecker API every 6h. Shell scripts in HCL heredocs: escape$→$$,%{}→%%{}. - Platform cannot depend on vault (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
- Complex types (maps/lists like
homepage_credentials,k8s_users) stored as JSON strings in KV, decoded withjsondecode()in consuming stacklocalsblocks. - New stacks: Add secret in Vault UI/CLI at
secret/<stack-name>, add ExternalSecret +data "kubernetes_secret"for plan-time,secret_key_reffor env vars. Usedata "vault_kv_secret_v2"only ifdata "kubernetes_secret"won't work (e.g., first-apply bootstrap). - Backup CronJob:
vault-raft-backupuses manually-createdvault-root-tokenK8s Secret (independent of automation). - Bootstrap (fresh cluster): Comment out data source + OIDC → apply Helm → init+unseal → populate
secret/vault→ uncomment → re-apply.
Resource Management Patterns
- CPU: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage.
- Memory: Set explicit
requests=limitsbased on VPA upperBound. Target: upperBound x 1.2 for stable services, x 1.3 for GPU/volatile workloads. - VPA (Goldilocks): Must be
Initialmode (notAuto) — Auto conflicts with Terraform's declarative resource management. - LimitRange: Tier-based defaults silently apply to pods with
resources: {}. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure. - Democratic-CSI sidecars: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange.
csiProxyis a TOP-LEVEL chart key, not nested under controller/node. - ResourceQuota blocks rolling updates: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
- Kyverno ndots drift: Kyverno injects dns_config on all pods. Add
lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }to kubernetes_deployment resources to prevent perpetual TF plan drift. - NVIDIA GPU operator resources: dcgm-exporter and cuda-validator resources configurable via
dcgmExporter.resourcesandvalidator.resourcesin nvidia values.yaml. - Pin database versions: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
- Quarterly right-sizing: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
CI/CD Architecture — GHA Builds + Woodpecker Deploy
Flow: git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image
Migrated to GHA (7): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book Woodpecker-only: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
Per-project files:
.github/workflows/build-and-deploy.yml— GHA: checkout, build, push DockerHub, POST Woodpecker API.woodpecker/deploy.yml— Woodpecker:kubectl set image+ Slack notify (event:[manual, push]).woodpecker/build-fallback.yml— Old full build pipeline preserved (event:deployment— never auto-fires)
Woodpecker API: Uses numeric repo IDs (/api/repos/2/pipelines), NOT owner/name paths (those return HTML).
Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79
Woodpecker YAML gotchas:
- Commands with
${VAR}:${VAR}must be quoted — unquoted:triggers YAML map parsing when vars are empty - Use
bitnami/kubectl:latest(not pinned versions — entrypoint compatibility issues) - Global secrets must have
manualin their events list for API-triggered pipelines
GitHub repo secrets (set on all repos): DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, WOODPECKER_TOKEN
Infra pipelines unchanged: default.yml (terragrunt apply), renew-tls.yml (certbot cron), build-cli.yml (dual registry push), k8s-portal.yml (path-filtered build) — all stay on Woodpecker.
Database Host
postgresql_host in config.tfvars is pg-cluster-rw.dbaas.svc.cluster.local (the CNPG primary). The legacy postgresql.dbaas service has no endpoints — never use it. This variable is shared by ~12 stacks.
Networking & Resilience
- Critical path services scaled to 3: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
- PDBs: minAvailable=2 on Traefik and Authentik.
- Fallback proxies: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- CrowdSec bouncer: graceful degradation mode (fail-open on error).
- Rate limiting: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
- Retry middleware: 2 attempts, 100ms — in default ingress chain.
- HTTP/3 (QUIC): Enabled cluster-wide via Traefik.
Service-Specific Notes
| Service | Key Operational Knowledge |
|---|---|
| Nextcloud | MaxRequestWorkers=150, needs 4Gi memory, very generous startup probe |
| Immich | ML on SSD, disable ModSecurity (breaks streaming), CUDA for ML, frequent upgrades |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3 |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used |
Monitoring & Alerting
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
- Exclude completed CronJob pods from "pod not ready" alerts.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor.
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable.
Known Issues
- CrowdSec Helm upgrade times out:
terragrunt applyon platform stack causes CrowdSec Helm release to get stuck inpending-upgrade. Workaround:helm rollback crowdsec <rev> -n crowdsec. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. - OpenClaw config is writable: OpenClaw writes to
openclaw.jsonat runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory +NODE_OPTIONS=--max-old-space-size=1536. - Goldilocks VPA sets limits: When increasing memory requests, always set explicit
limitstoo — Goldilocks may have added a limit that blocks the change.
User Preferences
- Calendar: Nextcloud at
nextcloud.viktorbarzin.me - Home Assistant: ha-london (default), ha-sofia. "ha"/"HA" = ha-london
- Frontend: Svelte for all new web apps
- Tools: Docker containers only — never
brew installlocally - Pod monitoring: Never use
sleep— spawn background subagent withkubectl get pods -w