## Context
The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run
`claude -p` was fully migrated to the in-cluster service
`claude-agent-service.claude-agent.svc:8080/execute` in commits 42f1c3cf and
99180bec (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker
+ scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed
zero remaining SSH+claude sites.
This commit removes two cleanup artifacts left behind by that migration.
## This change
1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived
skill doc for the obsolete SSH-based pattern. Already in `archived/`,
harmless but noise; deleting prevents anyone copy-pasting the old approach.
2. Removes `kubernetes_secret.ssh_key` from
`stacks/claude-agent-service/main.tf`. The Secret was created from the
`devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted
into the agent pod. The pod's `git-init` init container uses HTTPS +
`$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:`
and `https://github.com/` URL via `git config url.insteadOf`, so no
downstream `git` invocation could fall through to SSH even if it tried.
3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block —
the SSH key resource was its only consumer.
## What is NOT in this change
- The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place.
Removing it requires read/modify/put of the full secret and the upside
is one unused Vault key. Not worth it without strong justification.
- DevVM host decommission is out of scope (separate audit needed for
non-Claude users of the host).
- Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment)
left untouched per no-adjacent-refactor rule.
## Test plan
### Automated
- `terraform fmt -check stacks/claude-agent-service/main.tf` — only the
pre-existing lines 464-505 are flagged; no new fmt warnings introduced
by these deletions.
### Manual verification
1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply`
2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`.
The `ci_secrets` data source removal is plan-time only; does not appear
in resource counts.
3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`.
4. `kubectl -n claude-agent get pod` → both pods Running, no restart events.
5. Submit a synthetic agent job via HTTP API to confirm pipeline still works:
curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute
with a minimal prompt; expect job completes with `exit_code=0`.
Closes: code-bck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The new CLAUDE_CODE_OAUTH_TOKEN mechanism (commit 8a054752) uses
long-lived 1-year tokens minted via `claude setup-token`. Tokens don't
auto-refresh — at the 1-year mark they expire hard and the upgrade
agent stops working. We need to be told 30 days ahead, not find out
when DIUN fires and gets 401 again.
A cron rotator doesn't make sense here (tokens don't refresh, they
just expire) so we alert instead. Two spares at
`secret/claude-agent-service-spare-{1,2}` provide failover runway —
monitor covers all three.
## This change
**CronJob** (`claude-agent` ns, every 6h): reads a ConfigMap
containing `<path> → expiry_unix_timestamp` entries, pushes
`claude_oauth_token_expiry_timestamp{path="..."}` and
`claude_oauth_expiry_monitor_last_push_timestamp` to Pushgateway at
`prometheus-prometheus-pushgateway.monitoring:9091`.
**ConfigMap** generated from a Terraform local `claude_oauth_token_mint_epochs`
— source of truth for mint times. On rotation, update the map + apply.
TTL is a shared local (365d).
**PrometheusRules** (in prometheus_chart_values.tpl):
- `ClaudeOAuthTokenExpiringSoon` — <30d, warning, for 1h
- `ClaudeOAuthTokenCritical` — <7d, critical, for 10m
- `ClaudeOAuthTokenMonitorStale` — last push >48h, warning
- `ClaudeOAuthTokenMonitorNeverRun` — metric absent for 2h, warning
Alert labels include `{{ $labels.path }}` so we know which token is
expiring (primary / spare-1 / spare-2).
## Verification
```
$ kubectl -n claude-agent create job --from=cronjob/claude-oauth-expiry-monitor manual
$ curl pushgateway/metrics | grep claude_oauth_token_expiry
claude_oauth_token_expiry_timestamp{...,path="primary"} 1.808064429e+09
claude_oauth_token_expiry_timestamp{...,path="spare-1"} 1.80806428e+09
claude_oauth_token_expiry_timestamp{...,path="spare-2"} 1.808064429e+09
$ query: (claude_oauth_token_expiry_timestamp - time()) / 86400
primary: 365.2 days
spare-1: 365.2 days
spare-2: 365.2 days
```
## Rotation playbook (future)
1. `kubectl run -it --rm --image=registry.viktorbarzin.me/claude-agent-service:latest tokmint -- claude setup-token`
(or harvest via `harvest3.py` pattern in memory for headless flow)
2. `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`
3. Update `claude_oauth_token_mint_epochs["primary"]` in
`stacks/claude-agent-service/main.tf` with new unix timestamp
4. `scripts/tg apply` claude-agent-service + monitoring
5. Alert clears within 6h (next cron tick) + 1h of the
`ClaudeOAuthTokenExpiringSoon` "for:" duration
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Earlier today we hit a silent auth failure on the upgrade agent: the
short-lived `sk-ant-oat01-*` access token in `.credentials.json` had
expired and the CLI's refresh path failed (refresh token either stale
or invalidated after the creds sat in Vault for 5 days).
The real fix isn't "refresh more often" — it's switching to the
long-lived auth mechanism `claude setup-token` provides. Unlike
`claude login` (OAuth flow → 6–8h access token + refresh token JSON),
`setup-token` mints a single opaque token valid for **1 year** that
the CLI consumes via `CLAUDE_CODE_OAUTH_TOKEN` env var. No refresh
dance, no JSON file, no rotation for a year.
## This change
Adds `CLAUDE_CODE_OAUTH_TOKEN` to the existing
`claude-agent-secrets` ExternalSecret, sourced from a new
`claude_oauth_token` field at `secret/claude-agent-service`. The
container already pulls that secret via `envFrom`, so no other wiring
needed.
The Claude CLI prefers `CLAUDE_CODE_OAUTH_TOKEN` over the OAuth JSON
file when both are present, so this is additive — `.credentials.json`
stays mounted as a fallback while we validate the long-lived path.
Future cleanup can remove the JSON mount entirely.
Verified E2E: synthetic DIUN webhook for `docker.io/library/httpd`
→ n8n → claude-agent-service /execute → agent job `fea5ff70dcfe`
completed in 30s with exit_code=0, agent correctly identified no
matching stack and aborted without changes. No API auth errors.
## Spares
Harvested two additional long-lived tokens and stored them at
`secret/claude-agent-service-spare-{1,2}` for failover if the
primary is compromised or revoked. Verified both coexist with the
primary (no revocation on mint).
## What is NOT in this change
- No removal of `.credentials.json` mount or its Vault source (keep
as fallback until we've run for 24h on env-var auth with no issues).
- No cron rotator — 1-year TTL means this can be a yearly manual
rotation, alerted on from Vault metadata. If we add rotation, we'll
source from the spares pool rather than minting new tokens.
## Reproduce locally
```
1. vault login -method=oidc
2. vault kv get -field=claude_oauth_token secret/claude-agent-service | head -c 25
3. cd stacks/claude-agent-service && ../../scripts/tg apply
4. kubectl -n claude-agent exec deploy/claude-agent-service -- \
printenv CLAUDE_CODE_OAUTH_TOKEN # should be 108 chars
5. Fire synthetic DIUN webhook (see docs/architecture/automated-upgrades.md)
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>