Adds the `claude_oauth_token` Vault entries to the secrets table, a new "OAuth token lifecycle" section explaining the two CLI auth modes (`claude login` vs `claude setup-token`) and why we picked the latter for headless use, the Ink 300-col PTY gotcha from today's harvest, and the monitoring/rotation playbook for the new expiry alerts. Follow-up to8a054752and50dea8f0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Automated Service Upgrades
Overview
OSS services are automatically upgraded via a pipeline that detects new container image versions, analyzes changelogs for breaking changes, backs up databases, applies version bumps through Terraform, and verifies health post-upgrade with automatic rollback on failure.
Architecture
DIUN (every 6h)
│ detects new image tags
│
▼
n8n Webhook (POST /webhook/<uuid>)
│ filters: skip databases, custom images, infra, :latest
│ rate limit: max 5 upgrades per 6h window
│
▼
HTTP POST → claude-agent-service (K8s)
│
▼
claude -p "upgrade agent prompt" (in-cluster)
│
▼
Service Upgrade Agent
├── 1. Identify service + .tf files (grep stacks/)
├── 2. Resolve GitHub repo (config overrides + auto-detect)
├── 3. Fetch changelogs via GitHub API (authenticated, 5000 req/hr)
├── 4. Classify risk (SAFE / CAUTION / UNKNOWN)
├── 5. Slack notification — starting
├── 6. DB backup (if DB-backed service)
├── 7. Edit .tf files (version bump + config changes)
├── 8. Commit + push (Woodpecker CI applies)
├── 9. Wait for CI (poll Woodpecker API)
├── 10. Verify (pod ready + HTTP + Uptime Kuma)
├── 11a. SUCCESS → Slack report
└── 11b. FAILURE → git revert + CI re-applies → Slack alert
Components
DIUN (Docker Image Update Notifier)
- Stack:
stacks/diun/ - Schedule: Every 6 hours (
DIUN_WATCH_SCHEDULE=0 */6 * * *) - Role: Detection only — fires a webhook to n8n when a new image tag is found
- Skip patterns: Databases,
viktorbarzin/*,registry.viktorbarzin.me/*, infrastructure images - Webhook:
DIUN_NOTIF_WEBHOOK_ENDPOINTfrom Vaultsecret/diun→n8n_webhook_url
n8n Workflow ("DIUN Upgrade Agent")
- Stack:
stacks/n8n/ - Workflow backup:
stacks/n8n/workflows/diun-upgrade.json - Webhook path: UUID-based (
/webhook/<uuid>) - Filters:
- Only
status=update(skipnew,unchanged) - Skip databases, custom images, infra images,
:latest
- Only
- Rate limiting: Max 5 upgrades per 6-hour window using
$getWorkflowStaticData('global') - Action: HTTP POST to
claude-agent-service.claude-agent.svc:8080/executewith the upgrade agent prompt
Upgrade Agent
- Prompt:
.claude/agents/service-upgrade.md - Config:
.claude/reference/upgrade-config.json - Contains:
- 50+ Docker image → GitHub repo mappings
- 22 Helm chart → GitHub repo mappings
- 27 DB-backed service definitions with backup metadata
- Skip patterns and breaking change keywords
Risk Classification
| Risk | Criteria | Verification | Version Jump |
|---|---|---|---|
| SAFE | Patch/minor bump, no breaking keywords in release notes | 2 minutes | Direct to target |
| CAUTION | Major bump, or breaking change keywords found, or in version_jump_always_step list |
10 minutes | Step through each version |
| UNKNOWN | Changelog unavailable | 2 minutes (SAFE defaults) | Direct to target |
Breaking change keywords: breaking, BREAKING, migration required, schema change, database migration, manual intervention, action required, removed, deprecated, renamed, incompatible
Database Backup
DB-backed services trigger a pre-upgrade backup automatically:
- Shared PostgreSQL:
kubectl create job --from=cronjob/postgresql-backup -n dbaas - Shared MySQL:
kubectl create job --from=cronjob/mysql-backup -n dbaas - Dedicated databases (e.g., Immich): Trigger existing backup CronJob in the service's namespace
If the backup fails, the upgrade is aborted.
Rollback
On verification failure:
git revert --no-edit <upgrade-commit-sha>git push→ Woodpecker CI re-applies the old version- Re-verify rollback succeeded
- If rollback also fails → CRITICAL Slack alert for manual intervention
Version Patterns
The agent handles all three version patterns in Terraform:
| Pattern | Example | Agent Action |
|---|---|---|
| Variable-based | variable "immich_version" { default = "v2.7.4" } |
Edit the default value |
| Hardcoded | image = "vaultwarden/server:1.35.4" |
Replace tag in image string |
| Helm chart | version = "2026.2.2" in helm_release |
Bump chart version |
Configuration
Excluding images (handled by DIUN + n8n)
- Databases:
*postgres*,*mysql*,*redis*,*clickhouse*,*etcd* - Custom:
viktorbarzin/*,registry.viktorbarzin.me/*,ancamilea/*,mghee/* - Infrastructure:
registry.k8s.io/*,quay.io/tigera/*,nvcr.io/*,reg.kyverno.io/* :latesttags
Rate limiting
- Max 5 upgrades per 6-hour DIUN scan cycle
- Counter resets when the window expires
- Configurable in the n8n "Filter and Rate Limit" code node
Services that always step through versions
- Authentik, Nextcloud, Immich (configured in
upgrade-config.json→version_jump_always_step)
Monitoring
- Slack: All upgrade events reported (start, success, failure, rollback)
- Git: Detailed commit messages with changelog summaries, risk level, backup status
- DIUN Slack: Independent Slack channel for raw version detection (separate from upgrade agent)
Bulk Upgrades
To upgrade all outdated services at once, fire webhooks for each service:
WEBHOOK="https://n8n.viktorbarzin.me/webhook/<uuid>"
curl -s -X POST "$WEBHOOK" \
-H "Content-Type: application/json" \
-d '{"diun_entry_status":"update","diun_entry_image":"<image>","diun_entry_imagetag":"<new_tag>","diun_entry_provider":"kubernetes"}'
n8n processes all webhooks in parallel (one claude -p per webhook). Before bulk runs, increase the rate limit in the n8n Code node (MAX_UPGRADES_PER_WINDOW) and reset the counter:
-- Reset rate limiter
UPDATE workflow_entity SET "staticData" = '{}'::json WHERE name = 'DIUN Upgrade Agent';
First Bulk Run (2026-04-16)
12 services upgraded in ~30 minutes, fully automated:
| Service | From | To | Notes |
|---|---|---|---|
| audiobookshelf | 2.32.1 | 2.33.1 | Security fixes (IDOR) |
| owntracks | 0.9.9 | 1.0.1 | Major version bump |
| open-webui | v0.7.2 | v0.8.12 | |
| immich | v2.7.4 | v2.7.5 | Patch, DB backup taken |
| coturn | 4.6.3-r1 | 4.10.0-r1 | Major version bump |
| shlink | 4.3.4 | 5.0.2 | Major, DB-backed |
| phpipam | v1.7.0 | v1.7.4 | Patch, DB-backed |
| onlyoffice | 8.2.3 | 9.3.1 | Major version bump |
| paperless-ngx | 2.16.4 | 2.20.14 | Agent also bumped memory 1Gi → 2Gi |
| linkwarden | v2.9.1 | v2.14.0 | 23 intermediate releases, 254M DB backup |
| synapse | v1.125.0 | v1.151.0 | Large jump, DB-backed |
| dawarich | 0.37.1 | 1.6.1 | Upgraded → verification failed → auto-rolled back → forward-fixed |
Key behaviors observed:
- Auto-rollback works: Dawarich upgrade failed verification, agent reverted, then re-applied with a forward fix
- Resource awareness: Paperless-ngx agent detected the new version needed more memory and bumped limits
- DB backups: All DB-backed services had pre-upgrade dumps taken automatically
- Changelog analysis: Linkwarden commit summarized 23 intermediate releases; vaultwarden (earlier test) identified 3 CVEs
- Parallel execution: 11 agents ran concurrently, handled git rebase conflicts automatically
Secrets
| Secret | Vault Path | Purpose |
|---|---|---|
| n8n webhook URL | secret/diun → n8n_webhook_url |
DIUN → n8n trigger |
| Agent API bearer token | secret/claude-agent-service → api_bearer_token |
n8n → claude-agent-service /execute auth. Synced into both claude-agent ns (consumer) and n8n ns (caller) via ESO. n8n exposes it to the container as CLAUDE_AGENT_API_TOKEN env var. |
| Claude OAuth (primary) | secret/claude-agent-service → claude_oauth_token |
Long-lived 1-year token from claude setup-token. Consumed by the CLI via CLAUDE_CODE_OAUTH_TOKEN env var (set on the container via envFrom). Preferred over the short-lived .credentials.json — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. |
| Claude OAuth (spares) | secret/claude-agent-service-spare-{1,2} → claude_oauth_token |
Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. |
| GitHub PAT | secret/viktor → github_pat |
Changelog fetch (5000 req/hr) |
| Slack webhook | secret/platform → alertmanager_slack_api_url |
Upgrade notifications |
| Woodpecker token | secret/viktor → woodpecker_token |
CI pipeline polling |
OAuth token lifecycle
The CLI supports two auth modes. We use the second — long-lived.
| Mode | How minted | TTL | Needs refresh? | When to use |
|---|---|---|---|---|
claude login → .credentials.json |
Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines |
claude setup-token → opaque sk-ant-oat01-* |
Interactive browser OAuth | 1 year | No — expires hard | Headless / service accounts (us) |
When both are present on disk, CLAUDE_CODE_OAUTH_TOKEN env var wins.
Harvesting headless: setup-token uses Ink (React for terminals) and needs a real PTY with ≥300-column width. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod.
Monitoring: CronJob claude-oauth-expiry-monitor (claude-agent ns, every 6h) pushes claude_oauth_token_expiry_timestamp{path="..."} to Pushgateway. Alerts: ClaudeOAuthTokenExpiringSoon (30d, warn), ClaudeOAuthTokenCritical (7d, crit), ClaudeOAuthTokenMonitorStale (48h no push, warn), ClaudeOAuthTokenMonitorNeverRun (metric absent, warn).
Rotation: on alert, harvest a new token, vault kv patch secret/claude-agent-service claude_oauth_token=<new>, update the claude_oauth_token_mint_epochs local in stacks/claude-agent-service/main.tf, scripts/tg apply → alert clears on next cron tick.
n8n workflow gotchas
The DIUN Upgrade Agent workflow is imported once into n8n's PG DB — it is not Terraform-managed. The JSON at stacks/n8n/workflows/diun-upgrade.json is a backup; the live state lives in workflow_entity.nodes. Drift between the two is possible.
- HTTP Request node header expressions must use template-literal form:
=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}works;='Bearer ' + $env.CLAUDE_AGENT_API_TOKENdoes NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service. N8N_BLOCK_ENV_ACCESS_IN_NODE=falsemust be set on the n8n deployment for expressions to read$env.*at all.- Troubleshooting 401: the workflow will show
successstatus on the webhook node but error onRun Upgrade Agent. Inspect in n8n UI → Executions, or queryexecution_entity+execution_datadirectly. Claude-agent-service logs will also showPOST /execute HTTP/1.1 401 Unauthorized. - Patching the live workflow (one-off, since it's not in TF):
UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';