infra/.claude/agents
Viktor Barzin a5963169ec [service-upgrade] Drop vault-CLI assumptions + check default workflow only
## Context
Since the 2026-04-15 migration from SSH-on-DevVM to in-cluster
claude-agent-service, the agent spec's four `vault kv get ...` calls
have been dead code: the pod has no `VAULT_TOKEN`, no `~/.vault-token`,
no Vault login method, and port 8200 is refused. Every token fetch
returns empty, which silently breaks:

- **Slack**: `SLACK_WEBHOOK=""` → POSTs 404 → no messages for 3+ days
  (the exact user-visible symptom that started this thread).
- **Woodpecker CI polling**: `WOODPECKER_TOKEN=""` → 401 on
  `/api/repos/1/pipelines` → agent can't find its own pipeline → 15-min
  poll times out → jumps to rollback → same failure in the revert → hits
  n8n's 30-min ceiling → SIGKILL mid-saga → no commit, no Slack.
- **Changelog fetch**: `GITHUB_TOKEN=""` overrides the env var supplied
  by `envFrom: claude-agent-secrets`, crippling changelog lookups too.

Separately, Step 9 read the overall pipeline `status`, which is
`failure` any time a single workflow fails — e.g. the unrelated
`build-cli` workflow (docker image push to registry.viktorbarzin.me:5050
has been erroring since private-registry htpasswd was enabled on
2026-03-22). That made the agent spuriously rollback every otherwise-
successful upgrade.

## This change
- Replace the four `vault kv get ...` invocations with the matching
  env-var reads (`$GITHUB_TOKEN`, `$WOODPECKER_API_TOKEN`,
  `$SLACK_WEBHOOK_URL`) and document the env-var contract at the top
  of the "Environment" section. The env vars are expected to be
  pre-loaded via `envFrom: claude-agent-secrets` — that part is tracked
  as the companion ExternalSecret/Terraform change in bd code-3o3
  (must land before this spec is effective).
- Rewrite Step 9 to poll the `default` workflow's `state` instead of
  the overall pipeline `status`. Adds a jq example and explicitly
  documents the build-cli noise so future operators know why overall
  status is unreliable.

## What is NOT in this change
- The matching ExternalSecret / Terraform changes that feed
  WOODPECKER_API_TOKEN / SLACK_WEBHOOK_URL / REGISTRY_USER /
  REGISTRY_PASSWORD into the pod. Until those land, this spec still
  produces empty env vars at runtime — but at least the *shape* of the
  contract is correct and grep-friendly.
- The .woodpecker/build-cli.yml `logins:` entry for
  registry.viktorbarzin.me:5050. That's fix C in the same task.

## Test Plan
### Automated
None — this is pure markdown guidance for the model. Syntax-checked by
`grep -nE 'vault kv get|WOODPECKER_TOKEN|SLACK_WEBHOOK[^_]'
.claude/agents/service-upgrade.md` showing only the explanatory
warning on line 37 as a match.

### Manual Verification
After the companion ExternalSecret change lands and the pod has
WOODPECKER_API_TOKEN + SLACK_WEBHOOK_URL in env:
1. Trigger a DIUN-style webhook on a known slow service.
2. Watch `kubectl -n claude-agent logs -f deploy/claude-agent-service`.
3. Expect curl to `ci.viktorbarzin.me/api/...` return 200 and pipeline
   JSON (no 401), and Slack `$SLACK_WEBHOOK_URL` return 200.
4. Expect a Slack `[Upgrade Agent] Starting:` post inside the first
   minute, and a `SUCCESS` or `FAILED + ROLLED BACK` post on exit.

Refs: bd code-3o3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 13:15:06 +00:00
..
issue-responder.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
payslip-extractor.md [payslip-ingest] Update extractor agent + dashboard for v2 regex parser 2026-04-19 10:54:33 +00:00
post-mortem.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
postmortem-todo-resolver.md feat: post-mortem automation pipeline 2026-04-14 15:34:42 +00:00
service-upgrade.md [service-upgrade] Drop vault-CLI assumptions + check default workflow only 2026-04-19 13:15:06 +00:00
sev-historian.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
sev-report-writer.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
sev-triage.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00