infra

Author	SHA1	Message	Date
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	2b8bb849c0	[infra] Bump claude-agent-service + beadboard image tags ## Context Two rolling updates tied to the BeadBoard dispatch-button work (code-kel): 1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent (files in /usr/share/agent-seed/), the beads-task-runner agent, and hmac.compare_digest bearer verification. The tag moves from 382d6b14 to 0c24c9b6 (monorepo HEAD). 2. The beadboard Deployment in beads-server now consumes CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image needs the Dispatch button + /api/agent-dispatch + /api/agent-status routes. Tag moves from :latest to :17a38e43 (fork HEAD on github.com/ViktorBarzin/beadboard). ## What this change does - Flips `local.image_tag` in claude-agent-service main.tf. - Drops the "temporary" comment on `beadboard_image_tag` and sets the default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md "Use 8-char git SHA tags — `:latest` causes stale pull-through cache"). ## Test Plan ## Automated - Both images already pushed to registry.viktorbarzin.me{:5050}/ : - claude-agent-service:0c24c9b6 verified via `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/ contains both seed files. - beadboard:17a38e43 pushed, digest cd0d3c47. - terraform fmt/validate clean on both stacks from the earlier commits. ## Manual Verification 1. Push triggers Woodpecker default.yml. 2. Expected: both stacks apply; claude-agent-service pod rolls (new seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch + copies beads-task-runner.md), beadboard pod rolls with new env vars sourced from beadboard-agent-service ExternalSecret. 3. Cross-check: `kubectl -n claude-agent get pod -o yaml \| grep image:` should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard -o yaml \| grep image:` should show :17a38e43. Closes: code-kel Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 19:24:37 +00:00
Viktor Barzin	e1d20457c4	[infra/claude-agent-service] Seed beads metadata + scratch dir at runtime ## Context Review of the BeadBoard Dispatch wiring found that the claude-agent-service Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and `COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land on paths that are volume-mounted at runtime: - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398) - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427) Kubernetes mounts hide image-layer content at those paths, so the COPYs are dead. The companion commit in `claude-agent-service` restages both files to `/usr/share/agent-seed/` (an image-layer path that is never mounted). Additionally, the beads-task-runner agent rails expect `/workspace/scratch/<job_id>/` to exist, but nothing was creating it. ## Layout before / after ``` Before (dead COPYs): image layer runtime (mounted volumes hide the files) ----------- ----------------------------------- /workspace/ <- hidden by PVC mount .beads/ metadata.json <- UNREACHABLE /home/agent/.claude/ <- hidden by emptyDir mount agents/ beads-task-runner.md <- UNREACHABLE After (init container seeds volumes at pod start): image layer runtime ----------- ------------------------------------ /usr/share/agent-seed/ beads-metadata.json --+ beads-task-runner.md --+-> copied by seed-beads-agent init container into the mounted volumes on every pod start: /workspace/.beads/metadata.json /workspace/scratch/ /home/agent/.claude/agents/beads-task-runner.md ``` ## What ### New init container: `seed-beads-agent` - Positioned AFTER `git-init`, BEFORE the main container. - Uses the same service image (`${local.image}:${local.image_tag}`) — the seed files are baked in at `/usr/share/agent-seed/`. - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`). - Shell body: mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`. - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`). ### Formatting - `terraform fmt -recursive` also normalised whitespace in the token-expiry locals block and the CronJob container definition. No semantic change. ## What is NOT in this change - No image tag bump. The Dockerfile refactor that produces the `/usr/share/agent-seed/` path lands in the claude-agent-service repo and will roll in on the next CI build. Until that build ships and the tag is bumped in this file, the new init container will `cp` from a path that doesn't exist yet — so do NOT apply this commit until the corresponding image tag bump is ready. The commit is declarative prep. - No changes to storage class, RBAC, Service, or any other init. - The main container mounts remain unchanged — only the init containers prepare volume contents. ## Test Plan ### Automated ``` $ terraform fmt -check -recursive stacks/claude-agent-service/ (no output — clean) $ terraform -chdir=stacks/claude-agent-service/ init -backend=false Terraform has been successfully initialized! $ terraform -chdir=stacks/claude-agent-service/ validate Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1) Success! The configuration is valid, but there were some validation warnings as shown above. ``` ### Manual Verification (after image bump + apply) 1. Bump `local.image_tag` in main.tf to the SHA of a build that has `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE \| jq ...` or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`). 2. `scripts/tg apply stacks/claude-agent-service` 3. `kubectl -n claude-agent get pods -w` — all init containers complete. 4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch` Expected: all three paths exist; first two are regular files with the expected content, `scratch` is a directory. 5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json` Expected: `dolt.beads-server.svc.cluster.local`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:23:19 +00:00
Viktor Barzin	c9d221d578	[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a discoverability tag so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat \| grep -E '\.(tf\|tf\.example\|md)$' \| wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ \| tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:15:51 +00:00
Viktor Barzin	82b7866bc9	[claude-agent-service] Remove orphaned DevVM SSH key wiring ## Context The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run `claude -p` was fully migrated to the in-cluster service `claude-agent-service.claude-agent.svc:8080/execute` in commits `42f1c3cf` and `99180bec` (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker + scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed zero remaining SSH+claude sites. This commit removes two cleanup artifacts left behind by that migration. ## This change 1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived skill doc for the obsolete SSH-based pattern. Already in `archived/`, harmless but noise; deleting prevents anyone copy-pasting the old approach. 2. Removes `kubernetes_secret.ssh_key` from `stacks/claude-agent-service/main.tf`. The Secret was created from the `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted into the agent pod. The pod's `git-init` init container uses HTTPS + `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:` and `https://github.com/` URL via `git config url.insteadOf`, so no downstream `git` invocation could fall through to SSH even if it tried. 3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block — the SSH key resource was its only consumer. ## What is NOT in this change - The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place. Removing it requires read/modify/put of the full secret and the upside is one unused Vault key. Not worth it without strong justification. - DevVM host decommission is out of scope (separate audit needed for non-Claude users of the host). - Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment) left untouched per no-adjacent-refactor rule. ## Test plan ### Automated - `terraform fmt -check stacks/claude-agent-service/main.tf` — only the pre-existing lines 464-505 are flagged; no new fmt warnings introduced by these deletions. ### Manual verification 1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply` 2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`. The `ci_secrets` data source removal is plan-time only; does not appear in resource counts. 3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`. 4. `kubectl -n claude-agent get pod` → both pods Running, no restart events. 5. Submit a synthetic agent job via HTTP API to confirm pipeline still works: curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute with a minimal prompt; expect job completes with `exit_code=0`. Closes: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:31:15 +00:00
Viktor Barzin	50dea8f0a7	[monitoring] Add Claude OAuth token expiry monitoring + alerts ## Context The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit `8a054752`) uses long-lived 1-year tokens minted via `claude setup-token`. Tokens don't auto-refresh — at the 1-year mark they expire hard and the upgrade agent stops working. We need to be told 30 days ahead, not find out when DIUN fires and gets 401 again. A cron rotator doesn't make sense here (tokens don't refresh, they just expire) so we alert instead. Two spares at `secret/claude-agent-service-spare-{1,2}` provide failover runway — monitor covers all three. ## This change CronJob (`claude-agent` ns, every 6h): reads a ConfigMap containing `<path> → expiry_unix_timestamp` entries, pushes `claude_oauth_token_expiry_timestamp{path="..."}` and `claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at `prometheus-prometheus-pushgateway.monitoring:9091`. ConfigMap generated from a Terraform local `claude_oauth_token_mint_epochs` — source of truth for mint times. On rotation, update the map + apply. TTL is a shared local (365d). PrometheusRules (in prometheus_chart_values.tpl): - `ClaudeOAuthTokenExpiringSoon` — <30d, warning, for 1h - `ClaudeOAuthTokenCritical` — <7d, critical, for 10m - `ClaudeOAuthTokenMonitorStale` — last push >48h, warning - `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning Alert labels include `{{ $labels.path }}` so we know which token is expiring (primary / spare-1 / spare-2). ## Verification ``` $ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual $ curl pushgateway/metrics \| grep claude_oauth_token_expiry claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09 claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09 claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09 $ query: (claude_oauth_token_expiry_timestamp - time()) / 86400 primary: 365.2 days spare-1: 365.2 days spare-2: 365.2 days ``` ## Rotation playbook (future) 1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token` (or harvest via `harvest3.py` pattern in memory for headless flow) 2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>` 3. Update `claude_oauth_token_mint_epochs["primary"]` in `stacks/claude-agent-service/main.tf` with new unix timestamp 4. `scripts/tg apply` claude-agent-service + monitoring 5. Alert clears within 6h (next cron tick) + 1h of the `ClaudeOAuthTokenExpiringSoon` "for:" duration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:27:11 +00:00
Viktor Barzin	8a05475218	[claude-agent-service] Add CLAUDE_CODE_OAUTH_TOKEN env var — 1-year long-lived auth ## Context Earlier today we hit a silent auth failure on the upgrade agent: the short-lived `sk-ant-oat01-` access token in `.credentials.json` had expired and the CLI's refresh path failed (refresh token either stale or invalidated after the creds sat in Vault for 5 days). The real fix isn't "refresh more often" — it's switching to the long-lived auth mechanism `claude setup-token` provides. Unlike `claude login` (OAuth flow → 6–8h access token + refresh token JSON), `setup-token` mints a single opaque token valid for 1 year* that the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh dance, no JSON file, no rotation for a year. ## This change Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing `claude-agent-secrets` ExternalSecret, sourced from a new `claude_oauth_token` field at `secret/claude-agent-service`. The container already pulls that secret via `envFrom`, so no other wiring needed. The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON file when both are present, so this is additive — `.credentials.json` stays mounted as a fallback while we validate the long-lived path. Future cleanup can remove the JSON mount entirely. Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd` → n8n → claude-agent-service /execute → agent job `fea5ff70dcfe` completed in 30s with exit_code=0, agent correctly identified no matching stack and aborted without changes. No API auth errors. ## Spares Harvested two additional long-lived tokens and stored them at `secret/claude-agent-service-spare-{1,2}` for failover if the primary is compromised or revoked. Verified both coexist with the primary (no revocation on mint). ## What is NOT in this change - No removal of `.credentials.json` mount or its Vault source (keep as fallback until we've run for 24h on env-var auth with no issues). - No cron rotator — 1-year TTL means this can be a yearly manual rotation, alerted on from Vault metadata. If we add rotation, we'll source from the spares pool rather than minting new tokens. ## Reproduce locally ``` 1. vault login -method=oidc 2. vault kv get -field=claude_oauth_token secret/claude-agent-service \| head -c 25 3. cd stacks/claude-agent-service && ../../scripts/tg apply 4. kubectl -n claude-agent exec deploy/claude-agent-service -- \ printenv CLAUDE_CODE_OAUTH_TOKEN # should be 108 chars 5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md) ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:12:30 +00:00

7 commits