- Traefik: add fine-grained Prometheus histogram buckets (0.01-30s) for meaningful P50/P99 - Calibre: relax liveness probe (timeout 5→10s, threshold 3→6) to stop NFS-caused restarts - Novelapp: increase memory 128Mi/256Mi → 640Mi/640Mi (confirmed OOMKilled, VPA upper 505Mi) - Forgejo: increase memory 256Mi → 384Mi (at 80% of limit, VPA upper 311Mi) - ActualBudget: add explicit resources to prevent silent LimitRange defaults - Docs: update Nextcloud note from 4Gi → 8Gi limit (Apache spike history)
15 KiB
Executable file
Claude Code — Project Configuration
Shared knowledge: Read
AGENTS.mdat repo root for architecture, patterns, rules, and operations. This file adds Claude-specific features on top.
Claude-Specific Resources
- Skills:
.claude/skills/(7 active). Archived runbooks:.claude/skills/archived/ - Agents: All agents are global (
~/.claude/agents/, shared via dotfiles). Install Viktor's dotfiles for the full set.- Infra specialists: cluster-health-checker, dba, home-automation-engineer, network-engineer, observability-engineer, platform-engineer, security-engineer, sre
- Incident pipeline: post-mortem → sev-triage → sev-historian → sev-report-writer
- DevOps: devops-engineer, deploy-app, review-loop
- Reference:
.claude/reference/— patterns.md, service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md - GitHub API:
curlwith tokens from tfvars (ghCLI blocked by sandbox)
Instructions
- "remember X": Use
memory-tool store "content" --category facts --tags "tag1,tag2"(via exec) for persistent cross-session memory. Also update this file +AGENTS.md(if shared knowledge), commit with[ci skip]. To recall:memory-tool recall "query". To list:memory-tool list. To delete:memory-tool delete <id>. The nativememory_searchandmemory_gettools are also available for searching indexed memory files. For storing new memories, always use thememory-toolCLI via exec. - Apply: Authenticate via
vault login -method=oidc, then usescripts/tg(preferred — handles state decrypt/encrypt) orterragruntdirectly.scripts/tgadds-auto-approvefor--non-interactiveapplies. - New services need CI/CD and monitoring (Prometheus/Uptime Kuma)
- New service: Use
setup-projectskill for full workflow - Ingress:
ingress_factorymodule. Auth:protected = true. Anti-AI: on by default. - Docker images: Always build for
linux/amd64. Use 8-char git SHA tags —:latestcauses stale pull-through cache. - Private registry:
registry.viktorbarzin.me(htpasswd auth, credentials in Vaultsecret/viktor). Useimage: registry.viktorbarzin.me/<name>:<tag>+imagePullSecrets: [{name: registry-credentials}]. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (10.0.20.10). Containerdhosts.tomlredirects pulls to LAN IP directly. Web UI atdocker.viktorbarzin.me(Authentik-protected). - LinuxServer.io containers:
DOCKER_MODSruns apt-get on every start — bake slow mods into a custom image (RUN /docker-mods || truethenENV DOCKER_MODS=). SetNO_CHOWN=trueto skip recursive chown that hangs on NFS mounts. - Node memory changes: When changing VM memory on any k8s node, update kubelet
systemReserved,kubeReserved, and eviction thresholds accordingly. Config:/var/lib/kubelet/config.yaml. Template:stacks/infra/main.tf. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - Sealed Secrets: User-managed secrets go in
sealed-*.yamlfiles in the stack directory. Stacks pick them up viakubernetes_manifest+fileset(path.module, "sealed-*.yaml"). See AGENTS.md for full workflow.
Terraform State — SOPS-Encrypted in Git
- State is local (
backend "local"), encrypted with SOPS and committed as.tfstate.encfiles. - Decrypt priority: Vault Transit (primary, uses existing
vault loginsession) → age key fallback (~/.config/sops/age/keys.txt, for bootstrap/DR). - Encrypt: Always encrypts to both Vault Transit (
transit/keys/sops-state) + age recipients. - Scripts:
scripts/state-sync {encrypt|decrypt|commit} [stack]— handles all state sync.scripts/tgauto-decrypts before and auto-encrypts+commits after mutating ops (apply/destroy/import). - Workflow:
git pull→scripts/tg plan→scripts/tg apply→git push. State sync is transparent. - Config:
.sops.yamlat repo root defines encryption rules. age public keys listed there. - Backups disabled:
terragrunt.hclpasses-backup=-to prevent.backupfile accumulation. - Adding operator: Generate age key (
age-keygen), add pubkey to.sops.yaml, runsops updatekeyson all.encfiles. - Two workstations: Laptop (macOS) + DevVM (10.0.10.10, Linux). Both have age keys + Vault access. Keys backed up in Vault (
secret/viktor/sops_age_key_laptop,sops_age_key_devvm).
Secrets Management — Vault KV
- Vault is the sole source of truth for secrets.
secret/viktor— go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first:vault kv get -field=KEY secret/viktor.- Auth:
vault login -method=oidc(Authentik SSO) →~/.vault-token→ read by Vault TF provider. - Vault stack self-reads:
data "vault_kv_secret_v2" "vault"reads its own OIDC creds fromsecret/vault. - ESO (External Secrets Operator):
stacks/external-secrets/— 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API versionv1beta1. Two ClusterSecretStores:vault-kvandvault-database. - Plan-time pattern: Former plan-time stacks use
data "kubernetes_secret"to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: mustterragrunt apply -target=kubernetes_manifest.external_secretfirst, then full apply.counton resources using secret values fails — remove conditional counts. - 14 hybrid stacks still keep
data "vault_kv_secret_v2"for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. - Database rotation: Vault DB engine rotates passwords every 24h. MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana. PostgreSQL: trading, health, linkwarden, affine, woodpecker, claude_memory. Excluded: authentik (PgBouncer), technitium/crowdsec (Helm-baked), root users.
- K8s credentials: Vault K8s secrets engine. Roles:
dashboard-admin,ci-deployer,openclaw,local-admin. Usevault write kubernetes/creds/ROLE kubernetes_namespace=NS. Helper:scripts/vault-kubeconfig. - CI/CD (GHA + Woodpecker): Docker builds run on GitHub Actions (free on public repos). Woodpecker is deploy-only — receives image tag via API POST, runs
kubectl set image. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushessecret/ci/global→ Woodpecker API every 6h. Shell scripts in HCL heredocs: escape$→$$,%{}→%%{}. - Platform cannot depend on vault (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
- Complex types (maps/lists like
homepage_credentials,k8s_users) stored as JSON strings in KV, decoded withjsondecode()in consuming stacklocalsblocks. - New stacks: Add secret in Vault UI/CLI at
secret/<stack-name>, add ExternalSecret +data "kubernetes_secret"for plan-time,secret_key_reffor env vars. Usedata "vault_kv_secret_v2"only ifdata "kubernetes_secret"won't work (e.g., first-apply bootstrap). - Backup CronJob:
vault-raft-backupuses manually-createdvault-root-tokenK8s Secret (independent of automation). - Bootstrap (fresh cluster): Comment out data source + OIDC → apply Helm → init+unseal → populate
secret/vault→ uncomment → re-apply.
Resource Management Patterns
- CPU: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage.
- Memory: Set explicit
requests=limitsbased on VPA upperBound. Target: upperBound x 1.2 for stable services, x 1.3 for GPU/volatile workloads. - VPA (Goldilocks): Must be
Initialmode (notAuto) — Auto conflicts with Terraform's declarative resource management. - LimitRange: Tier-based defaults silently apply to pods with
resources: {}. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure. - Democratic-CSI sidecars: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange.
csiProxyis a TOP-LEVEL chart key, not nested under controller/node. - ResourceQuota blocks rolling updates: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
- Kyverno ndots drift: Kyverno injects dns_config on all pods. Add
lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }to kubernetes_deployment resources to prevent perpetual TF plan drift. - NVIDIA GPU operator resources: dcgm-exporter and cuda-validator resources configurable via
dcgmExporter.resourcesandvalidator.resourcesin nvidia values.yaml. - Pin database versions: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
- Quarterly right-sizing: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
CI/CD Architecture — GHA Builds + Woodpecker Deploy
Flow: git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image
Migrated to GHA (7): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book Woodpecker-only: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access)
Per-project files:
.github/workflows/build-and-deploy.yml— GHA: checkout, build, push DockerHub, POST Woodpecker API.woodpecker/deploy.yml— Woodpecker:kubectl set image+ Slack notify (event:[manual, push]).woodpecker/build-fallback.yml— Old full build pipeline preserved (event:deployment— never auto-fires)
Woodpecker API: Uses numeric repo IDs (/api/repos/2/pipelines), NOT owner/name paths (those return HTML).
Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79
Woodpecker YAML gotchas:
- Commands with
${VAR}:${VAR}must be quoted — unquoted:triggers YAML map parsing when vars are empty - Use
bitnami/kubectl:latest(not pinned versions — entrypoint compatibility issues) - Global secrets must have
manualin their events list for API-triggered pipelines
GitHub repo secrets (set on all repos): DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, WOODPECKER_TOKEN
Infra pipelines unchanged: default.yml (terragrunt apply), renew-tls.yml (certbot cron), build-cli.yml (dual registry push), k8s-portal.yml (path-filtered build) — all stay on Woodpecker.
Database Host
postgresql_host in config.tfvars is pg-cluster-rw.dbaas.svc.cluster.local (the CNPG primary). The legacy postgresql.dbaas service has no endpoints — never use it. This variable is shared by ~12 stacks.
Networking & Resilience
- Critical path services scaled to 3: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
- PDBs: minAvailable=2 on Traefik and Authentik.
- Fallback proxies: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- CrowdSec bouncer: graceful degradation mode (fail-open on error).
- Rate limiting: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
- Retry middleware: 2 attempts, 100ms — in default ingress chain.
- HTTP/3 (QUIC): Enabled cluster-wide via Traefik.
Service-Specific Notes
| Service | Key Operational Knowledge |
|---|---|
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
| Immich | ML on SSD, disable ModSecurity (breaks streaming), CUDA for ML, frequent upgrades |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3 |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used |
Monitoring & Alerting
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
- Exclude completed CronJob pods from "pod not ready" alerts.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor.
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable.
Storage & Backup Architecture
Cloud Sync (TrueNAS → Synology NAS)
- Task 1: Weekly push (Monday 09:00) of
/mnt/mainNFS data tonas.viktorbarzin.lan:/Backup/Viki/truenas - zfs diff optimization: Pre-script diffs
main@cloudsync-prevvsmain@cloudsync-new, writes changed files to/tmp/cloudsync_files.txt. Args:--files-from /tmp/cloudsync_files.txt --no-traverse. Post-script rotates snapshots. Falls back to fullfindif no prev snapshot or >100k changes. - Excludes: ytldp, prometheus, logs, post, crowdsec, servarr/downloads, iscsi, iscsi-snaps
iSCSI Backup Architecture
- iSCSI zvols are raw block devices exported to k8s nodes via democratic-csi
- TrueNAS cannot read filesystem contents inside zvols — only the k8s pod can
- Local protection: ZFS snapshots (every 12h, 24h retention + daily, 3-week retention) cover zvols automatically
- Offsite protection: Application-level backup CronJobs dump data to NFS paths, which Task 1 syncs to Synology
- Current CronJob coverage: MySQL (mysqldump), PostgreSQL (pg_dumpall), Vault (raft snapshot), Redis (BGSAVE), Vaultwarden (sqlite3 .backup)
- Convention: Any new iSCSI-backed app MUST add a backup CronJob to its Terraform stack that writes to
/mnt/main/<app>-backup/ - Uncovered (acceptable): Prometheus (disposable metrics), Loki (disposable logs), plotting-book and novelapp (small, low-priority)
Known Issues
- CrowdSec Helm upgrade times out:
terragrunt applyon platform stack causes CrowdSec Helm release to get stuck inpending-upgrade. Workaround:helm rollback crowdsec <rev> -n crowdsec. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. - OpenClaw config is writable: OpenClaw writes to
openclaw.jsonat runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory +NODE_OPTIONS=--max-old-space-size=1536. - Goldilocks VPA sets limits: When increasing memory requests, always set explicit
limitstoo — Goldilocks may have added a limit that blocks the change.
User Preferences
- Calendar: Nextcloud at
nextcloud.viktorbarzin.me - Home Assistant: ha-london (default), ha-sofia. "ha"/"HA" = ha-london
- Frontend: Svelte for all new web apps
- Tools: Docker containers only — never
brew installlocally - Pod monitoring: Never use
sleep— spawn background subagent withkubectl get pods -w