Commit graph

7 commits

Author SHA1 Message Date
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
2b8bb849c0 [infra] Bump claude-agent-service + beadboard image tags
## Context
Two rolling updates tied to the BeadBoard dispatch-button work (code-kel):

1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent
   (files in /usr/share/agent-seed/), the beads-task-runner agent, and
   hmac.compare_digest bearer verification. The tag moves from 382d6b14
   to 0c24c9b6 (monorepo HEAD).
2. The beadboard Deployment in beads-server now consumes
   CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image
   needs the Dispatch button + /api/agent-dispatch + /api/agent-status
   routes. Tag moves from :latest to :17a38e43 (fork HEAD on
   github.com/ViktorBarzin/beadboard).

## What this change does
- Flips `local.image_tag` in claude-agent-service main.tf.
- Drops the "temporary" comment on `beadboard_image_tag` and sets the
  default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md
  "Use 8-char git SHA tags — `:latest` causes stale pull-through cache").

## Test Plan
## Automated
- Both images already pushed to registry.viktorbarzin.me{:5050}/ :
  - claude-agent-service:0c24c9b6 verified via
    `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/
    contains both seed files.
  - beadboard:17a38e43 pushed, digest cd0d3c47.
- terraform fmt/validate clean on both stacks from the earlier commits.

## Manual Verification
1. Push triggers Woodpecker default.yml.
2. Expected: both stacks apply; claude-agent-service pod rolls (new
   seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch
   + copies beads-task-runner.md), beadboard pod rolls with new env vars
   sourced from beadboard-agent-service ExternalSecret.
3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:`
   should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard
   -o yaml | grep image:` should show :17a38e43.

Closes: code-kel
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
Viktor Barzin
e1d20457c4 [infra/claude-agent-service] Seed beads metadata + scratch dir at runtime
## Context

Review of the BeadBoard Dispatch wiring found that the claude-agent-service
Dockerfile's `COPY beads/metadata.json /workspace/.beads/metadata.json` and
`COPY agents/beads-task-runner.md /home/agent/.claude/agents/...` both land
on paths that are volume-mounted at runtime:

  - `/workspace` → `claude-agent-workspace-encrypted` PVC (main.tf:394-398)
  - `/home/agent/.claude` → `claude-home` emptyDir (main.tf:424-427)

Kubernetes mounts hide image-layer content at those paths, so the COPYs are
dead. The companion commit in `claude-agent-service` restages both files to
`/usr/share/agent-seed/` (an image-layer path that is never mounted).

Additionally, the beads-task-runner agent rails expect
`/workspace/scratch/<job_id>/` to exist, but nothing was creating it.

## Layout before / after

```
  Before (dead COPYs):

    image layer          runtime (mounted volumes hide the files)
    -----------          -----------------------------------
    /workspace/          <- hidden by PVC mount
      .beads/
        metadata.json    <- UNREACHABLE
    /home/agent/.claude/ <- hidden by emptyDir mount
      agents/
        beads-task-runner.md  <- UNREACHABLE

  After (init container seeds volumes at pod start):

    image layer          runtime
    -----------          ------------------------------------
    /usr/share/agent-seed/
      beads-metadata.json    --+
      beads-task-runner.md    --+-> copied by seed-beads-agent init
                                    container into the mounted volumes
                                    on every pod start:
                                      /workspace/.beads/metadata.json
                                      /workspace/scratch/
                                      /home/agent/.claude/agents/beads-task-runner.md
```

## What

### New init container: `seed-beads-agent`
  - Positioned AFTER `git-init`, BEFORE the main container.
  - Uses the same service image (`${local.image}:${local.image_tag}`) — the
    seed files are baked in at `/usr/share/agent-seed/`.
  - Runs as default uid 1000 (the PVCs are already chowned by `fix-perms`).
  - Shell body:
      mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
      cp /usr/share/agent-seed/beads-metadata.json     /workspace/.beads/metadata.json
      cp /usr/share/agent-seed/beads-task-runner.md    /home/agent/.claude/agents/beads-task-runner.md
  - Mounts: `workspace` at `/workspace`, `claude-home` at `/home/agent/.claude`.
  - Resources: 32Mi requests / 64Mi limits (matches `fix-perms`/`copy-claude-creds`).

### Formatting
  - `terraform fmt -recursive` also normalised whitespace in the token-expiry
    locals block and the CronJob container definition. No semantic change.

## What is NOT in this change

  - No image tag bump. The Dockerfile refactor that produces the
    `/usr/share/agent-seed/` path lands in the claude-agent-service repo
    and will roll in on the next CI build. Until that build ships and the
    tag is bumped in this file, the new init container will `cp` from a
    path that doesn't exist yet — so do NOT apply this commit until the
    corresponding image tag bump is ready. The commit is declarative prep.
  - No changes to storage class, RBAC, Service, or any other init.
  - The main container mounts remain unchanged — only the init containers
    prepare volume contents.

## Test Plan

### Automated

```
$ terraform fmt -check -recursive stacks/claude-agent-service/
(no output — clean)

$ terraform -chdir=stacks/claude-agent-service/ init -backend=false
Terraform has been successfully initialized!

$ terraform -chdir=stacks/claude-agent-service/ validate
Warning: Deprecated Resource (pre-existing; use kubernetes_namespace_v1)
Success! The configuration is valid, but there were some validation warnings
as shown above.
```

### Manual Verification (after image bump + apply)

1. Bump `local.image_tag` in main.tf to the SHA of a build that has
   `/usr/share/agent-seed/*` (verify with `docker inspect $IMAGE | jq ...`
   or `kubectl run tmp --image ... -- ls /usr/share/agent-seed`).
2. `scripts/tg apply stacks/claude-agent-service`
3. `kubectl -n claude-agent get pods -w` — all init containers complete.
4. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- ls -la /workspace/.beads/metadata.json /home/agent/.claude/agents/beads-task-runner.md /workspace/scratch`
   Expected: all three paths exist; first two are regular files with the
   expected content, `scratch` is a directory.
5. `kubectl -n claude-agent exec deploy/claude-agent-service -c claude-agent-service -- jq -r .dolt_server_host /workspace/.beads/metadata.json`
   Expected: `dolt.beads-server.svc.cluster.local`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:23:19 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
82b7866bc9 [claude-agent-service] Remove orphaned DevVM SSH key wiring
## Context

The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.

This commit removes two cleanup artifacts left behind by that migration.

## This change

1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
   skill doc for the obsolete SSH-based pattern. Already in `archived/`,
   harmless but noise; deleting prevents anyone copy-pasting the old approach.

2. Removes `kubernetes_secret.ssh_key` from
   `stacks/claude-agent-service/main.tf`. The Secret was created from the
   `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
   into the agent pod. The pod's `git-init` init container uses HTTPS +
   `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
   and `https://github.com/` URL via `git config url.insteadOf`, so no
   downstream `git` invocation could fall through to SSH even if it tried.

3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
   the SSH key resource was its only consumer.

## What is NOT in this change

- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
  Removing it requires read/modify/put of the full secret and the upside
  is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
  non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
  left untouched per no-adjacent-refactor rule.

## Test plan

### Automated

- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
  pre-existing lines 464-505 are flagged; no new fmt warnings introduced
  by these deletions.

### Manual verification

1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
   The `ci_secrets` data source removal is plan-time only; does not appear
   in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
   curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
   with a minimal prompt; expect job completes with `exit_code=0`.

Closes: code-bck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:31:15 +00:00
Viktor Barzin
50dea8f0a7 [monitoring] Add Claude OAuth token expiry monitoring + alerts
## Context

The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.

A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.

## This change

**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.

**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).

**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon`  — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical`      — <7d,  critical, for 10m
- `ClaudeOAuthTokenMonitorStale`  — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning

Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).

## Verification

```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09

$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
  primary: 365.2 days
  spare-1: 365.2 days
  spare-2: 365.2 days
```

## Rotation playbook (future)

1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
   (or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
   `stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
   `ClaudeOAuthTokenExpiringSoon` "for:" duration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:27:11 +00:00
Viktor Barzin
8a05475218 [claude-agent-service] Add CLAUDE_CODE_OAUTH_TOKEN env var — 1-year long-lived auth
## Context

Earlier today we hit a silent auth failure on the upgrade agent: the
short-lived `sk-ant-oat01-*` access token in `.credentials.json` had
expired and the CLI's refresh path failed (refresh token either stale
or invalidated after the creds sat in Vault for 5 days).

The real fix isn't "refresh more often" — it's switching to the
long-lived auth mechanism `claude setup-token` provides. Unlike
`claude login` (OAuth flow → 6–8h access token + refresh token JSON),
`setup-token` mints a single opaque token valid for **1 year** that
the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh
dance, no JSON file, no rotation for a year.

## This change

Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing
`claude-agent-secrets` ExternalSecret, sourced from a new
`claude_oauth_token` field at `secret/claude-agent-service`. The
container already pulls that secret via `envFrom`, so no other wiring
needed.

The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON
file when both are present, so this is additive — `.credentials.json`
stays mounted as a fallback while we validate the long-lived path.
Future cleanup can remove the JSON mount entirely.

Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd`
→ n8n → claude-agent-service /execute → agent job `fea5ff70dcfe`
completed in 30s with exit_code=0, agent correctly identified no
matching stack and aborted without changes. No API auth errors.

## Spares

Harvested two additional long-lived tokens and stored them at
`secret/claude-agent-service-spare-{1,2}` for failover if the
primary is compromised or revoked. Verified both coexist with the
primary (no revocation on mint).

## What is NOT in this change

- No removal of `.credentials.json` mount or its Vault source (keep
  as fallback until we've run for 24h on env-var auth with no issues).
- No cron rotator — 1-year TTL means this can be a yearly manual
  rotation, alerted on from Vault metadata. If we add rotation, we'll
  source from the spares pool rather than minting new tokens.

## Reproduce locally

```
1. vault login -method=oidc
2. vault kv get -field=claude_oauth_token secret/claude-agent-service | head -c 25
3. cd stacks/claude-agent-service && ../../scripts/tg apply
4. kubectl -n claude-agent exec deploy/claude-agent-service -- \
     printenv CLAUDE_CODE_OAUTH_TOKEN   # should be 108 chars
5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md)
```

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:12:30 +00:00