infra

Author	SHA1	Message	Date
Viktor Barzin	5ef2642eda	claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was declared in TF but never wired into the deployment — the `workspace` volume_mount pointed at an emptyDir, so the PVC sat allocated and idle from 2026-04-15 to 2026-05-11. Restructured per the design intent: * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones. Each agent job clones the infra repo fresh, so persistence doesn't buy anything and emptyDir avoids RWO contention if the deployment is ever scaled past 1 replica. * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases where the agent needs to write state that should survive pod restarts (caches, ad-hoc outputs). RWX so all replicas share it; the service's sequential-mutex lock prevents concurrent writes. Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR /workspace/infra` causes kubelet to create that path inside the emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't write to. Pre-create the path + chmod 0775 to make it writable. NFS export already exists on the PVE host (/srv/nfs/claude-agent-persistent, owned 1000:1000). Verified: pod runs 1/1; `/persistent` writable as agent uid 1000; git-init successfully clones infra into /workspace/infra.	2026-05-11 20:40:21 +00:00
Viktor Barzin	bf752dffa5	fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes After fixing the threshold=80% misconfig and seeing two PVCs (prometheus + technitium primary) get stuck Terminating, a 3rd round showed four more PVCs (frigate, hackmd, immich-postgresql, paperless-ngx) in the same state. Same root cause: TF spec'd a smaller storage size than the autoresizer-grown live value, K8s rejected the shrink, TF force-replaced the PVC, and the pvc-protection finalizer held it in Terminating while the pod kept using the underlying volume. Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests] on every kubernetes_persistent_volume_claim block that has resize.topolvm.io/threshold annotations. The pattern was already documented in .claude/CLAUDE.md but ~63 stacks were missing it. Live PVCs are unaffected; this only prevents future TF applies from attempting the destroy+recreate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 21:57:01 +00:00
Viktor Barzin	fecfa211fd	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (`1e4eac53`) and the inode metrics fix landed (`02a12f1a`), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 19:56:16 +00:00
Viktor Barzin	3148d15d5a	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 18:30:02 +00:00
Viktor Barzin	26ef97d294	[claude-agent-service] Add WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL env vars ## Context Companion fix to 2026-04-19's service-upgrade spec refactor. The agent pod has no Vault CLI auth (no VAULT_TOKEN, port 8200 refused), so every `vault kv get` in the spec returned empty: - `WOODPECKER_TOKEN=""` → 401 on /api/repos/1/pipelines → agent can't find its pipeline → 15m poll timeout → rollback loop → >30m cap. - `SLACK_WEBHOOK=""` → webhook POST to empty URL → no Slack messages for 3+ days (the surface symptom that kicked off bd code-3o3). ## This change Extends the `claude-agent-secrets` ExternalSecret with two more keys, making them available to the agent via `envFrom`: - `WOODPECKER_API_TOKEN` ← `secret/ci/global.woodpecker_api_token` (already used by the vault-woodpecker-sync CronJob, same key) - `SLACK_WEBHOOK_URL` ← `secret/viktor.alertmanager_slack_api_url` (shared webhook also consumed by Alertmanager) Pairs with commit `a5963169` which refactored service-upgrade.md to read these env vars directly instead of shelling out to `vault kv get`. ## What is NOT in this change - REGISTRY_USER / REGISTRY_PASSWORD — not needed on the agent side. The separate `.woodpecker/build-cli.yml` fix (bd code-3o3 fix C) will add those to `secret/ci/global` for the vault-woodpecker-sync CronJob to publish as Woodpecker secrets, not here. ## Test Plan ### Automated `terraform plan` reported `Plan: 0 to add, 2 to change, 0 to destroy` (ExternalSecret + a cosmetic `tier` label drop on the Deployment). Applied cleanly. ### Manual Verification ``` $ kubectl -n claude-agent get externalsecret claude-agent-secrets \ -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}' secret synced $ kubectl -n claude-agent exec deploy/claude-agent-service -- sh -c \ 'echo "WP=${WOODPECKER_API_TOKEN:0:20}... SLACK=${SLACK_WEBHOOK_URL:0:40}..."' WP=eyJhbGciOiJIUzI1NiIs... SLACK=https://hooks.slack.com/services/T02SV75... $ kubectl -n claude-agent rollout status deploy/claude-agent-service deployment "claude-agent-service" successfully rolled out ``` Next step: fire one synthetic DIUN webhook to confirm the agent reaches Slack + lands a commit + exits cleanly, completing code-3o3. Refs: bd code-3o3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:23:12 +00:00
Viktor Barzin	5ea0aa70e3	[claude-agent-service] Bump image_tag to 2fd7670d (45m /execute timeout) ## Context Ships the monorepo commit (code@2fd7670d [claude-agent-service] Raise /execute default timeout from 15m to 45m) that raises ExecuteRequest.timeout_seconds from 900 to 2700. The auto-upgrade pipeline (DIUN → n8n → claude-agent-service → service-upgrade agent) had been silently timing out mid-run for 3 days: 139 × 202 Accepted + 6 × TimeoutError in the last 24h, zero commits to infra, zero Slack posts. Root cause was the 15-minute cap truncating CAUTION-class upgrades that need to summarise multi-release changelogs, poll Woodpecker CI, and wait on on-demand DB backup CronJobs. ## What changed `local.image_tag` 0c24c9b6 → 2fd7670d. Image built + pushed to registry.viktorbarzin.me/claude-agent-service:2fd7670d. Deployment is `Recreate`, so the single pod is dropped + recreated. ## Test Plan ### Automated `terraform plan` — `Plan: 0 to add, 1 to change, 0 to destroy` (3 container image refs flip from 0c24c9b6 → 2fd7670d). `terraform apply` — `Apply complete! Resources: 0 added, 1 changed, 0 destroyed.` ### Manual Verification ``` $ kubectl -n claude-agent rollout status deploy/claude-agent-service --timeout=120s deployment "claude-agent-service" successfully rolled out $ kubectl -n claude-agent get deploy claude-agent-service \ -o jsonpath='{.spec.template.spec.containers[0].image}' registry.viktorbarzin.me/claude-agent-service:2fd7670d $ kubectl -n claude-agent exec deploy/claude-agent-service -- \ sh -c 'cd /srv && python3 -c "from app.main import ExecuteRequest; \ print(ExecuteRequest(prompt=\"p\", agent=\"a\").timeout_seconds)"' 2700 ``` Next DIUN cycle (every 6h) should land ≥1 unattended upgrade as an infra commit + Slack message without TimeoutError in the agent logs. Closes: code-cfy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:29:08 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	2b8bb849c0	[infra] Bump claude-agent-service + beadboard image tags ## Context Two rolling updates tied to the BeadBoard dispatch-button work (code-kel): 1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent (files in /usr/share/agent-seed/), the beads-task-runner agent, and hmac.compare_digest bearer verification. The tag moves from 382d6b14 to 0c24c9b6 (monorepo HEAD). 2. The beadboard Deployment in beads-server now consumes CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image needs the Dispatch button + /api/agent-dispatch + /api/agent-status routes. Tag moves from :latest to :17a38e43 (fork HEAD on github.com/ViktorBarzin/beadboard). ## What this change does - Flips `local.image_tag` in claude-agent-service main.tf. - Drops the "temporary" comment on `beadboard_image_tag` and sets the default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md "Use 8-char git SHA tags — `:latest` causes stale pull-through cache"). ## Test Plan ## Automated - Both images already pushed to registry.viktorbarzin.me{:5050}/ : - claude-agent-service:0c24c9b6 verified via `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/ contains both seed files. - beadboard:17a38e43 pushed, digest cd0d3c47. - terraform fmt/validate clean on both stacks from the earlier commits. ## Manual Verification 1. Push triggers Woodpecker default.yml. 2. Expected: both stacks apply; claude-agent-service pod rolls (new seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch + copies beads-task-runner.md), beadboard pod rolls with new env vars sourced from beadboard-agent-service ExternalSecret. 3. Cross-check: `kubectl -n claude-agent get pod -o yaml \| grep image:` should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard -o yaml \| grep image:` should show :17a38e43. Closes: code-kel Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:24:37 +00:00
Viktor Barzin	e1d20457c4	[infra/claude-agent-service] Seed beads metadata + scratch dir at runtime ## Context Review of the BeadBoard Dispatch wiring found that the claude-agent-service Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and `COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land on paths that are volume-mounted at runtime: - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398) - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427) Kubernetes mounts hide image-layer content at those paths, so the COPYs are dead. The companion commit in `claude-agent-service` restages both files to `/usr/share/agent-seed/` (an image-layer path that is never mounted). Additionally, the beads-task-runner agent rails expect `/workspace/scratch/<job_id>/` to exist, but nothing was creating it. ## Layout before / after ``` Before (dead COPYs): image layer runtime (mounted volumes hide the files) ----------- ----------------------------------- /workspace/ <- hidden by PVC mount .beads/ metadata.json <- UNREACHABLE /home/agent/.claude/ <- hidden by emptyDir mount agents/ beads-task-runner.md <- UNREACHABLE After (init container seeds volumes at pod start): image layer runtime ----------- ------------------------------------ /usr/share/agent-seed/ beads-metadata.json --+ beads-task-runner.md --+-> copied by seed-beads-agent init container into the mounted volumes on every pod start: /workspace/.beads/metadata.json /workspace/scratch/ /home/agent/.claude/agents/beads-task-runner.md ``` ## What ### New init container: `seed-beads-agent` - Positioned AFTER `git-init`, BEFORE the main container. - Uses the same service image (`${local.image}:${local.image_tag}`) — the seed files are baked in at `/usr/share/agent-seed/`. - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`). - Shell body: mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`. - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`). ### Formatting - `terraform fmt -recursive` also normalised whitespace in the token-expiry locals block and the CronJob container definition. No semantic change. ## What is NOT in this change - No image tag bump. The Dockerfile refactor that produces the `/usr/share/agent-seed/` path lands in the claude-agent-service repo and will roll in on the next CI build. Until that build ships and the tag is bumped in this file, the new init container will `cp` from a path that doesn't exist yet — so do NOT apply this commit until the corresponding image tag bump is ready. The commit is declarative prep. - No changes to storage class, RBAC, Service, or any other init. - The main container mounts remain unchanged — only the init containers prepare volume contents. ## Test Plan ### Automated ``` $ terraform fmt -check -recursive stacks/claude-agent-service/ (no output — clean) $ terraform -chdir=stacks/claude-agent-service/ init -backend=false Terraform has been successfully initialized! $ terraform -chdir=stacks/claude-agent-service/ validate Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1) Success! The configuration is valid, but there were some validation warnings as shown above. ``` ### Manual Verification (after image bump + apply) 1. Bump `local.image_tag` in main.tf to the SHA of a build that has `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE \| jq ...` or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`). 2. `scripts/tg apply stacks/claude-agent-service` 3. `kubectl -n claude-agent get pods -w` — all init containers complete. 4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch` Expected: all three paths exist; first two are regular files with the expected content, `scratch` is a directory. 5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json` Expected: `dolt.beads-server.svc.cluster.local`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:23:19 +00:00
Viktor Barzin	c9d221d578	[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a discoverability tag so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat \| grep -E '\.(tf\|tf\.example\|md)$' \| wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ \| tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:15:51 +00:00
Viktor Barzin	82b7866bc9	[claude-agent-service] Remove orphaned DevVM SSH key wiring ## Context The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run `claude -p` was fully migrated to the in-cluster service `claude-agent-service.claude-agent.svc:8080/execute` in commits `42f1c3cf` and `99180bec` (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker + scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed zero remaining SSH+claude sites. This commit removes two cleanup artifacts left behind by that migration. ## This change 1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived skill doc for the obsolete SSH-based pattern. Already in `archived/`, harmless but noise; deleting prevents anyone copy-pasting the old approach. 2. Removes `kubernetes_secret.ssh_key` from `stacks/claude-agent-service/main.tf`. The Secret was created from the `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted into the agent pod. The pod's `git-init` init container uses HTTPS + `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:` and `https://github.com/` URL via `git config url.insteadOf`, so no downstream `git` invocation could fall through to SSH even if it tried. 3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block — the SSH key resource was its only consumer. ## What is NOT in this change - The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place. Removing it requires read/modify/put of the full secret and the upside is one unused Vault key. Not worth it without strong justification. - DevVM host decommission is out of scope (separate audit needed for non-Claude users of the host). - Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment) left untouched per no-adjacent-refactor rule. ## Test plan ### Automated - `terraform fmt -check stacks/claude-agent-service/main.tf` — only the pre-existing lines 464-505 are flagged; no new fmt warnings introduced by these deletions. ### Manual verification 1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply` 2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`. The `ci_secrets` data source removal is plan-time only; does not appear in resource counts. 3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`. 4. `kubectl -n claude-agent get pod` → both pods Running, no restart events. 5. Submit a synthetic agent job via HTTP API to confirm pipeline still works: curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute with a minimal prompt; expect job completes with `exit_code=0`. Closes: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:31:15 +00:00
Viktor Barzin	50dea8f0a7	[monitoring] Add Claude OAuth token expiry monitoring + alerts ## Context The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit `8a054752`) uses long-lived 1-year tokens minted via `claude setup-token`. Tokens don't auto-refresh — at the 1-year mark they expire hard and the upgrade agent stops working. We need to be told 30 days ahead, not find out when DIUN fires and gets 401 again. A cron rotator doesn't make sense here (tokens don't refresh, they just expire) so we alert instead. Two spares at `secret/claude-agent-service-spare-{1,2}` provide failover runway — monitor covers all three. ## This change CronJob (`claude-agent` ns, every 6h): reads a ConfigMap containing `<path> → expiry_unix_timestamp` entries, pushes `claude_oauth_token_expiry_timestamp{path="..."}` and `claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at `prometheus-prometheus-pushgateway.monitoring:9091`. ConfigMap generated from a Terraform local `claude_oauth_token_mint_epochs` — source of truth for mint times. On rotation, update the map + apply. TTL is a shared local (365d). PrometheusRules (in prometheus_chart_values.tpl): - `ClaudeOAuthTokenExpiringSoon` — <30d, warning, for 1h - `ClaudeOAuthTokenCritical` — <7d, critical, for 10m - `ClaudeOAuthTokenMonitorStale` — last push >48h, warning - `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning Alert labels include `{{ $labels.path }}` so we know which token is expiring (primary / spare-1 / spare-2). ## Verification ``` $ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual $ curl pushgateway/metrics \| grep claude_oauth_token_expiry claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09 claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09 claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09 $ query: (claude_oauth_token_expiry_timestamp - time()) / 86400 primary: 365.2 days spare-1: 365.2 days spare-2: 365.2 days ``` ## Rotation playbook (future) 1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token` (or harvest via `harvest3.py` pattern in memory for headless flow) 2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>` 3. Update `claude_oauth_token_mint_epochs["primary"]` in `stacks/claude-agent-service/main.tf` with new unix timestamp 4. `scripts/tg apply` claude-agent-service + monitoring 5. Alert clears within 6h (next cron tick) + 1h of the `ClaudeOAuthTokenExpiringSoon` "for:" duration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:27:11 +00:00
Viktor Barzin	8a05475218	[claude-agent-service] Add CLAUDE_CODE_OAUTH_TOKEN env var — 1-year long-lived auth ## Context Earlier today we hit a silent auth failure on the upgrade agent: the short-lived `sk-ant-oat01-` access token in `.credentials.json` had expired and the CLI's refresh path failed (refresh token either stale or invalidated after the creds sat in Vault for 5 days). The real fix isn't "refresh more often" — it's switching to the long-lived auth mechanism `claude setup-token` provides. Unlike `claude login` (OAuth flow → 6–8h access token + refresh token JSON), `setup-token` mints a single opaque token valid for 1 year* that the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh dance, no JSON file, no rotation for a year. ## This change Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing `claude-agent-secrets` ExternalSecret, sourced from a new `claude_oauth_token` field at `secret/claude-agent-service`. The container already pulls that secret via `envFrom`, so no other wiring needed. The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON file when both are present, so this is additive — `.credentials.json` stays mounted as a fallback while we validate the long-lived path. Future cleanup can remove the JSON mount entirely. Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd` → n8n → claude-agent-service /execute → agent job `fea5ff70dcfe` completed in 30s with exit_code=0, agent correctly identified no matching stack and aborted without changes. No API auth errors. ## Spares Harvested two additional long-lived tokens and stored them at `secret/claude-agent-service-spare-{1,2}` for failover if the primary is compromised or revoked. Verified both coexist with the primary (no revocation on mint). ## What is NOT in this change - No removal of `.credentials.json` mount or its Vault source (keep as fallback until we've run for 24h on env-var auth with no issues). - No cron rotator — 1-year TTL means this can be a yearly manual rotation, alerted on from Vault metadata. If we add rotation, we'll source from the spares pool rather than minting new tokens. ## Reproduce locally ``` 1. vault login -method=oidc 2. vault kv get -field=claude_oauth_token secret/claude-agent-service \| head -c 25 3. cd stacks/claude-agent-service && ../../scripts/tg apply 4. kubectl -n claude-agent exec deploy/claude-agent-service -- \ printenv CLAUDE_CODE_OAUTH_TOKEN # should be 108 chars 5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md) ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:12:30 +00:00

14 commits