infra

Author	SHA1	Message	Date
Viktor Barzin	a90ce27923	infra: add kubectl + authentik providers across 6 stacks Provider declarations were applied across freshrss, linkwarden, navidrome, openclaw, tandoor, vault in prior sessions; lock files regenerated for the 4 stacks where init had run. Commits the WIP so downstream Terraform plans can proceed. - kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic workaround for Kyverno CRDs (beads code-e2dp) - authentik (goauthentik/authentik ~> 2024.10): used where stacks manage their own Authentik objects	2026-05-22 14:17:00 +00:00
Viktor Barzin	fa2b57f177	openclaw: enable recruiter-api plugin (allowlist + manifest contracts) Plugin needs three things to load under OpenClaw 2026.5.x: 1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin' in the startup command after doctor runs). 2. 'openclaw plugins enable recruiter-api' to flip its registry entry. 3. manifest declares contracts.tools (added in recruiter-responder commit 83ffd9fa). Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the plugin's polling loop knows which Telegram chat to deliver into. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	6417c770c1	recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL (NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID env consumed by the recruiter-api plugin's announcement loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:17:00 +00:00
Viktor Barzin	1dd8f4e2bf	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-22 14:16:53 +00:00
Viktor Barzin	3027ab85a8	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:49 +00:00
root	1177a82452	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:47 +00:00
Viktor Barzin	a98b00324d	recruiter-responder: pin image tag + run plugin installer init as root - stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3 (300s LLM timeouts + IMAP BODY.PEEK[] fix). - stacks/openclaw/main.tf: install-recruiter-plugin init container now runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the recruiter-responder image otherwise drops to uid 10001 which can't write or chown. Smoke-tested end-to-end 2026-05-15 ~23:15: Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage (12.1s, JSON output complete with company/role/salary/location/tech) -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK. Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from the n8n Postgres workflow_entity table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	7e1580ba8c	recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount Three coupled changes for the new recruiter-responder pipeline: 1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the download Job script + cmd renderer to handle text_only=true (skip mmproj download + --mmproj flag). The 3 existing vision models stay on text_only=false; no behaviour change for them. 2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets (app secrets from secret/recruiter-responder, DB creds from Vault DB engine static-creds/pg-recruiter-responder), Deployment (replicas=1, Recreate -- IMAP IDLE + APScheduler want single leader), Service ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder. 3. stacks/openclaw/: add init container `install-recruiter-plugin` that uses the recruiter-responder image to copy the .mjs plugin into /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin version to the recruiter-responder image tag. Also injects RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token from openclaw-secrets.recruiter_responder_bearer_token, optional). Pre-apply checklist for recruiter-responder stack: - Vault: seed secret/recruiter-responder with webhook_bearer_token, imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token, task_webhook_token. - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as above webhook_bearer_token). - dbaas: create DB recruiter_responder + role recruiter_responder, and Vault DB-engine role static-creds/pg-recruiter-responder. - Build + push image via Woodpecker (recruiter-responder repo CI). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	dc134011eb	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	3f2b2f9d32	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix landed (02a12f1a), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	ff5538a667	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	29bb434e1e	ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow	2026-05-10 11:12:38 +00:00
Viktor Barzin	badc341669	openclaw: regenerate kubeconfig at pod start using projected SA tokenFile The previously-baked kubeconfig at /home/node/.openclaw/kubeconfig retained a service-account token bound to the original (long-dead) pod, so kubectl calls from inside the openclaw container failed with "the server has asked for the client to provide credentials" even though the openclaw SA has cluster-admin and kubelet projects a fresh token at /var/run/secrets/kubernetes.io/serviceaccount/token. Add init-container "setup-kubeconfig" that writes a kubeconfig with tokenFile + certificate-authority paths pointing at the projected SA volume — kubelet auto-rotates the token, kubectl always reads fresh creds, no Vault K8s-creds-engine refresh needed. Verified end-to-end: agent ran `kubectl get nodes -o wide` inside the pod and delivered a correct one-line summary to Telegram via openai-codex/gpt-5.4-mini.	2026-05-10 11:12:37 +00:00
Viktor Barzin	41655096c7	openclaw: realtime usage dashboard via Prometheus exporter sidecar Stdlib-only Python exporter ($1) reads ~/.openclaw/agents//sessions/.jsonl (assistant messages with usage) plus auth-profiles.json (OAuth expiry, Plus-tier label) and exposes Prometheus text format on :9099/metrics. Container is python:3.12-slim; pod template gets prometheus.io/scrape annotations so the existing kubernetes-pods job picks it up — no ServiceMonitor needed. Metrics exported: openclaw_codex_messages_total{provider,model,session_kind} counter openclaw_codex_input/output/cache_read/cache_write_tokens_total openclaw_codex_message_errors_total{reason} openclaw_codex_active_sessions{kind} gauge openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge openclaw_codex_last_run_timestamp gauge Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h, cache hit %, OAuth expiry days, active sessions, last-turn age, errors, plus per-model timeseries + bar gauge + error table. Plus rate-card thresholds in the gauge are conservative (1,200/5h floor; real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up below 80%.	2026-05-07 23:29:32 +00:00
Viktor Barzin	115ca184ff	openclaw: switch primary to ChatGPT Plus OAuth (openai-codex/gpt-5.4-mini) Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in 2026.4.21+). Auth profile is OAuth via the device-pairing flow against the Codex backend (account ancaelena98@gmail.com); token persists in /home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h); gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model (gpt-5-pro) after model discovery, so the container command pins the mini back as default after doctor runs but before gateway start.	2026-05-07 23:29:32 +00:00
Viktor Barzin	8b180f7662	openclaw: switch primary model to qwen3-coder-480b (qwen3.5-397b dead on NIM) NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1 returns 404. With both gone the openclaw failover never reached mistral-large-3 in time, so every message hung until the 120s embedded-run timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP ~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.	2026-05-07 23:29:31 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	581aed5fcc	[openclaw,tor-proxy] Opt task-webhook + torrserver out of external monitoring Adds `external_monitor = false` to the ingress_factory calls for task-webhook and torrserver so the `external-monitor-sync` CronJob stops auto-creating `[External] <name>` monitors for them. Both services remain deployed and reachable; only the Uptime Kuma monitors are dropped.	2026-04-19 13:01:36 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	c9d221d578	[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a discoverability tag so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat \| grep -E '\.(tf\|tf\.example\|md)$' \| wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ \| tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:15:51 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	46444e0306	openclaw: remove install-dotfiles init container to reduce NFS writes The init container was cloning the dotfiles repo via git on every pod start, causing 200+ small NFS writes that amplified through ZFS. Dotfiles already exist on NFS from a previous clone — no need to re-clone on every restart. To update dotfiles, run git pull manually. Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB error log left over from migration to MariaDB).	2026-03-29 01:11:33 +02:00
Viktor Barzin	f0eb4fae8b	fix: openclaw task-processor use internal Forgejo URL The task-processor CronJob was failing every 5min because it used https://forgejo.viktorbarzin.me (external, via Cloudflare tunnel) which is unreachable from within the cluster. Changed to http://forgejo.forgejo.svc.cluster.local (internal ClusterIP).	2026-03-24 19:40:15 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	6b8ce04d44	fix(openclaw): change agent workspace from /workspace/infra to /workspace Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.	2026-03-19 23:32:28 +00:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	1acf8cc4e8	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-15 19:05:04 +00:00
Viktor Barzin	1fe7798609	fix openclaw init container: escape shell vars, fix image path [ci skip] - Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation) - Fix image: docker.io/alpine/git (not library/alpine/git) - Inline command instead of heredoc to avoid Terraform interpolation issues	2026-03-15 17:19:03 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	deeea5edab	openclaw: replace cc-config NFS with dotfiles repo clone [ci skip] - Add init container "install-dotfiles" that clones the dotfiles repo and installs skills/agents/hooks to OpenClaw's home directory - Remove nfs_cc_config module and its volume mount - Skills/agents now come from the same chezmoi-managed dotfiles repo that manages the Mac config, eliminating the dual-sync problem	2026-03-15 16:04:02 +00:00
Viktor Barzin	194281e527	right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.	2026-03-15 15:30:18 +00:00
Viktor Barzin	18d012db11	fix: reduce openclaw memory requests for scheduling - openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi (limit 256Mi). Total request 1408Mi fits available capacity.	2026-03-15 10:47:34 +00:00
Viktor Barzin	56ddee457a	fix: openclaw policy violation + reduce memory requests for capacity - openclaw: fix Kyverno policy violation (node:22-alpine -> docker.io/library/node:22-alpine), reduce request to 1536Mi with 2Gi limit for overcommit - rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi) - stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)	2026-03-15 10:37:58 +00:00
Viktor Barzin	4a27345057	enable memory-core plugin for OpenClaw [ci skip] - Add memory-core to plugins.allow and plugins.slots.memory - Add /app/extensions to plugin load paths - Update CLAUDE.md memory instructions to reference native tools	2026-03-15 03:22:07 +00:00
Viktor Barzin	6f562b5da6	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-15 00:03:59 +00:00
Viktor Barzin	46afa85b01	fix openclaw config mount and OOM: use init container, increase memory to 2Gi - Replace subPath ConfigMap mount with init container that copies openclaw.json to writable NFS home (OpenClaw writes back to the file at runtime) - Remove invalid memory-api plugin references causing "Config invalid" - Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536 - Fix tg wrapper to inject -auto-approve when apply --non-interactive is used	2026-03-14 23:42:17 +00:00
Viktor Barzin	eb0301b02b	lower memory limits closer to actual usage openclaw: 1536Mi -> 768Mi, affine: 256Mi -> 128Mi, rybbit: 512Mi -> 384Mi. Also patched via kubectl: aiostreams, cloudflared, crowdsec, uptime-kuma, vaultwarden, pgadmin, phpmyadmin, goflow2, sealed-secrets, ebook2audiobook.	2026-03-14 21:15:26 +00:00
Viktor Barzin	f7c2c06009	right-size memory: set requests=limits based on actual usage - Set memory requests = limits across 56 stacks to prevent overcommit - Right-sized limits based on actual pod usage (2x actual, rounded up) - Scaled down trading-bot (replicas=0) to free memory - Fixed OOMKilled services: forgejo, dawarich, health, meshcentral, paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse - Added startup+liveness probes to calibre-web - Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192) Post node2 OOM incident (2026-03-14). Previous kubelet config had no kubeReserved/systemReserved set, allowing pods to starve the kernel.	2026-03-14 21:01:24 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	39b7dac1a9	fix: bump openclaw memory limit to 1536Mi Was hitting V8 heap OOM at 768Mi during LLM orchestration.	2026-03-14 16:45:57 +00:00
Viktor Barzin	2be858f616	fix: eliminate memory overcommit to prevent node OOM crashes Set requests = limits (Guaranteed QoS) across LimitRange defaults and explicit pod resources. Node2 crashed 2026-03-14 from 250% memory overcommit (61GB limits on 24GB node). Changes: - LimitRange: default = defaultRequest for all 6 tiers - Grafana: 3 → 2 replicas - Grampsweb: document why replicas=0 - Prometheus: 1Gi/4Gi → 3Gi/3Gi - OpenClaw: 512Mi/2Gi → 768Mi/768Mi - Immich server: 256Mi/2Gi → 512Mi/512Mi - Immich postgresql: 256Mi/1Gi → 512Mi/512Mi - Calibre: 256Mi/1536Mi → 256Mi/256Mi - Linkwarden: 256Mi/1536Mi → 768Mi/768Mi - N8N: 256Mi/1Gi → 512Mi/512Mi - MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi - pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi - DBaaS ResourceQuota limits.memory: 64Gi → 12Gi [ci skip]	2026-03-14 16:01:41 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	76a4987eef	[ci skip] add Forgejo task pipeline for OpenClaw AI agent Forgejo issues as a task queue for OpenClaw: - Forgejo OAuth2 with Authentik SSO, self-registration disabled - Webhook-triggered task processing (instant) + CronJob backup (5min poll) - Tasks processed via Mistral Large 3 (NVIDIA NIM API) - Results posted as issue comments, auto-labeled and closed - Comment follow-ups and reopened issues supported - n8n RBAC for OpenClaw pod exec (future workflow integration)	2026-03-07 21:11:07 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00

1 2

76 commits