# Automated Service Upgrades ## Overview OSS services are automatically upgraded via a pipeline that detects new container image versions, analyzes changelogs for breaking changes, backs up databases, applies version bumps through Terraform, and verifies health post-upgrade with automatic rollback on failure. ## Architecture ``` DIUN (every 6h) │ detects new image tags │ ▼ n8n Webhook (POST /webhook/) │ filters: skip databases, custom images, infra, :latest │ rate limit: max 5 upgrades per 6h window │ ▼ HTTP POST → claude-agent-service (K8s) │ ▼ claude -p "upgrade agent prompt" (in-cluster) │ ▼ Service Upgrade Agent ├── 1. Identify service + .tf files (grep stacks/) ├── 2. Resolve GitHub repo (config overrides + auto-detect) ├── 3. Fetch changelogs via GitHub API (authenticated, 5000 req/hr) ├── 4. Classify risk (SAFE / CAUTION / UNKNOWN) ├── 5. Slack notification — starting ├── 6. DB backup (if DB-backed service) ├── 7. Edit .tf files (version bump + config changes) ├── 8. Commit + push (Woodpecker CI applies) ├── 9. Wait for CI (poll Woodpecker API) ├── 10. Verify (pod ready + HTTP + Uptime Kuma) ├── 11a. SUCCESS → Slack report └── 11b. FAILURE → git revert + CI re-applies → Slack alert ``` ## Components ### DIUN (Docker Image Update Notifier) - **Stack**: `stacks/diun/` - **Schedule**: Every 6 hours (`DIUN_WATCH_SCHEDULE=0 */6 * * *`) - **Role**: Detection only — fires a webhook to n8n when a new image tag is found - **Skip patterns**: Databases, `viktorbarzin/*`, `registry.viktorbarzin.me/*`, infrastructure images - **Webhook**: `DIUN_NOTIF_WEBHOOK_ENDPOINT` from Vault `secret/diun` → `n8n_webhook_url` ### n8n Workflow ("DIUN Upgrade Agent") - **Stack**: `stacks/n8n/` - **Workflow backup**: `stacks/n8n/workflows/diun-upgrade.json` - **Webhook path**: UUID-based (`/webhook/`) - **Filters**: - Only `status=update` (skip `new`, `unchanged`) - Skip databases, custom images, infra images, `:latest` - **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')` - **Action**: HTTP POST to `claude-agent-service.claude-agent.svc:8080/execute` with the upgrade agent prompt ### Upgrade Agent - **Prompt**: `.claude/agents/service-upgrade.md` - **Config**: `.claude/reference/upgrade-config.json` - Contains: - 50+ Docker image → GitHub repo mappings - 22 Helm chart → GitHub repo mappings - 27 DB-backed service definitions with backup metadata - Skip patterns and breaking change keywords ## Risk Classification | Risk | Criteria | Verification | Version Jump | |------|----------|-------------|-------------| | **SAFE** | Patch/minor bump, no breaking keywords in release notes | 2 minutes | Direct to target | | **CAUTION** | Major bump, or breaking change keywords found, or in `version_jump_always_step` list | 10 minutes | Step through each version | | **UNKNOWN** | Changelog unavailable | 2 minutes (SAFE defaults) | Direct to target | **Breaking change keywords**: `breaking`, `BREAKING`, `migration required`, `schema change`, `database migration`, `manual intervention`, `action required`, `removed`, `deprecated`, `renamed`, `incompatible` ## Database Backup DB-backed services trigger a pre-upgrade backup automatically: - **Shared PostgreSQL**: `kubectl create job --from=cronjob/postgresql-backup -n dbaas` - **Shared MySQL**: `kubectl create job --from=cronjob/mysql-backup -n dbaas` - **Dedicated databases** (e.g., Immich): Trigger existing backup CronJob in the service's namespace If the backup fails, the upgrade is **aborted**. ## Rollback On verification failure: 1. `git revert --no-edit ` 2. `git push` → Woodpecker CI re-applies the old version 3. Re-verify rollback succeeded 4. If rollback also fails → CRITICAL Slack alert for manual intervention ## Version Patterns The agent handles all three version patterns in Terraform: | Pattern | Example | Agent Action | |---------|---------|-------------| | Variable-based | `variable "immich_version" { default = "v2.7.4" }` | Edit the `default` value | | Hardcoded | `image = "vaultwarden/server:1.35.4"` | Replace tag in image string | | Helm chart | `version = "2026.2.2"` in `helm_release` | Bump chart version | ## Configuration ### Excluding images (handled by DIUN + n8n) - Databases: `*postgres*`, `*mysql*`, `*redis*`, `*clickhouse*`, `*etcd*` - Custom: `viktorbarzin/*`, `registry.viktorbarzin.me/*`, `ancamilea/*`, `mghee/*` - Infrastructure: `registry.k8s.io/*`, `quay.io/tigera/*`, `nvcr.io/*`, `reg.kyverno.io/*` - `:latest` tags ### Rate limiting - Max 5 upgrades per 6-hour DIUN scan cycle - Counter resets when the window expires - Configurable in the n8n "Filter and Rate Limit" code node ### Services that always step through versions - Authentik, Nextcloud, Immich (configured in `upgrade-config.json` → `version_jump_always_step`) ## Monitoring - **Slack**: All upgrade events reported (start, success, failure, rollback) - **Git**: Detailed commit messages with changelog summaries, risk level, backup status - **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent) ## Bulk Upgrades To upgrade all outdated services at once, fire webhooks for each service: ```bash WEBHOOK="https://n8n.viktorbarzin.me/webhook/" curl -s -X POST "$WEBHOOK" \ -H "Content-Type: application/json" \ -d '{"diun_entry_status":"update","diun_entry_image":"","diun_entry_imagetag":"","diun_entry_provider":"kubernetes"}' ``` n8n processes all webhooks in parallel (one `claude -p` per webhook). Before bulk runs, increase the rate limit in the n8n Code node (`MAX_UPGRADES_PER_WINDOW`) and reset the counter: ```sql -- Reset rate limiter UPDATE workflow_entity SET "staticData" = '{}'::json WHERE name = 'DIUN Upgrade Agent'; ``` ### First Bulk Run (2026-04-16) 12 services upgraded in ~30 minutes, fully automated: | Service | From | To | Notes | |---------|------|----|-------| | audiobookshelf | 2.32.1 | 2.33.1 | Security fixes (IDOR) | | owntracks | 0.9.9 | 1.0.1 | Major version bump | | open-webui | v0.7.2 | v0.8.12 | | | immich | v2.7.4 | v2.7.5 | Patch, DB backup taken | | coturn | 4.6.3-r1 | 4.10.0-r1 | Major version bump | | shlink | 4.3.4 | 5.0.2 | Major, DB-backed | | phpipam | v1.7.0 | v1.7.4 | Patch, DB-backed | | onlyoffice | 8.2.3 | 9.3.1 | Major version bump | | paperless-ngx | 2.16.4 | 2.20.14 | Agent also bumped memory 1Gi → 2Gi | | linkwarden | v2.9.1 | v2.14.0 | 23 intermediate releases, 254M DB backup | | synapse | v1.125.0 | v1.151.0 | Large jump, DB-backed | | dawarich | 0.37.1 | 1.6.1 | Upgraded → verification failed → auto-rolled back → forward-fixed | Key behaviors observed: - **Auto-rollback works**: Dawarich upgrade failed verification, agent reverted, then re-applied with a forward fix - **Resource awareness**: Paperless-ngx agent detected the new version needed more memory and bumped limits - **DB backups**: All DB-backed services had pre-upgrade dumps taken automatically - **Changelog analysis**: Linkwarden commit summarized 23 intermediate releases; vaultwarden (earlier test) identified 3 CVEs - **Parallel execution**: 11 agents ran concurrently, handled git rebase conflicts automatically ## Secrets | Secret | Vault Path | Purpose | |--------|-----------|---------| | n8n webhook URL | `secret/diun` → `n8n_webhook_url` | DIUN → n8n trigger | | Agent API bearer token | `secret/claude-agent-service` → `api_bearer_token` | n8n → claude-agent-service `/execute` auth. Synced into both `claude-agent` ns (consumer) and `n8n` ns (caller) via ESO. n8n exposes it to the container as `CLAUDE_AGENT_API_TOKEN` env var. | | Claude OAuth (primary) | `secret/claude-agent-service` → `claude_oauth_token` | Long-lived 1-year token from `claude setup-token`. Consumed by the CLI via `CLAUDE_CODE_OAUTH_TOKEN` env var (set on the container via `envFrom`). Preferred over the short-lived `.credentials.json` — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. | | Claude OAuth (spares) | `secret/claude-agent-service-spare-{1,2}` → `claude_oauth_token` | Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. | | GitHub PAT | `secret/viktor` → `github_pat` | Changelog fetch (5000 req/hr) | | Slack webhook | `secret/platform` → `alertmanager_slack_api_url` | Upgrade notifications | | Woodpecker token | `secret/viktor` → `woodpecker_token` | CI pipeline polling | ## OAuth token lifecycle The CLI supports two auth modes. We use the second — long-lived. | Mode | How minted | TTL | Needs refresh? | When to use | |------|-----------|-----|----------------|-------------| | `claude login` → `.credentials.json` | Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines | | `claude setup-token` → opaque `sk-ant-oat01-*` | Interactive browser OAuth | **1 year** | No — expires hard | **Headless / service accounts (us)** | When both are present on disk, `CLAUDE_CODE_OAUTH_TOKEN` env var wins. **Harvesting headless**: `setup-token` uses Ink (React for terminals) and needs a real PTY with **≥300-column width**. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod. **Monitoring**: CronJob `claude-oauth-expiry-monitor` (claude-agent ns, every 6h) pushes `claude_oauth_token_expiry_timestamp{path="..."}` to Pushgateway. Alerts: `ClaudeOAuthTokenExpiringSoon` (30d, warn), `ClaudeOAuthTokenCritical` (7d, crit), `ClaudeOAuthTokenMonitorStale` (48h no push, warn), `ClaudeOAuthTokenMonitorNeverRun` (metric absent, warn). **Rotation**: on alert, harvest a new token, `vault kv patch secret/claude-agent-service claude_oauth_token=`, update the `claude_oauth_token_mint_epochs` local in `stacks/claude-agent-service/main.tf`, `scripts/tg apply` → alert clears on next cron tick. ## n8n workflow gotchas The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **not** Terraform-managed. The JSON at `stacks/n8n/workflows/diun-upgrade.json` is a backup; the live state lives in `workflow_entity.nodes`. Drift between the two is possible. - **HTTP Request node header expressions must use template-literal form**: `=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}` works; `='Bearer ' + $env.CLAUDE_AGENT_API_TOKEN` does NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service. - **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all. - **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`. - **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`