feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN

Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-15 21:38:27 +00:00
parent 27d7c91608
commit c33f597111
5 changed files with 206 additions and 1 deletions

View file

@ -79,7 +79,7 @@
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| ntfy | Push notifications | ntfy |
| cyberchef | Data transformation | cyberchef |
| diun | Docker image update notifier | diun |
| diun | Docker image update notifier — detects new versions, fires webhook to n8n upgrade agent | diun |
| meshcentral | Remote management | meshcentral |
| homepage | Dashboard/startpage | homepage |
| matrix | Matrix chat server | matrix |

View file

@ -104,5 +104,14 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **Add a secret**: `sops set secrets.sops.json '["key"]' '"value"'` then commit.
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
## Automated Service Upgrades
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH → `claude -p` (upgrade agent)
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
- **Skipped**: databases, `:latest`, custom images (`viktorbarzin/*`), infrastructure images
- **Risk**: SAFE (2min verify) vs CAUTION (10min, DB backup, step through versions) based on changelog analysis
- **Docs**: `docs/architecture/automated-upgrades.md`
## Detailed Reference
See `.claude/reference/patterns.md` for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.

View file

@ -0,0 +1,134 @@
# Automated Service Upgrades
## Overview
OSS services are automatically upgraded via a pipeline that detects new container image versions, analyzes changelogs for breaking changes, backs up databases, applies version bumps through Terraform, and verifies health post-upgrade with automatic rollback on failure.
## Architecture
```
DIUN (every 6h)
│ detects new image tags
n8n Webhook (POST /webhook/<uuid>)
│ filters: skip databases, custom images, infra, :latest
│ rate limit: max 5 upgrades per 6h window
SSH → Dev VM (10.0.10.10)
claude -p "upgrade agent prompt"
Service Upgrade Agent
├── 1. Identify service + .tf files (grep stacks/)
├── 2. Resolve GitHub repo (config overrides + auto-detect)
├── 3. Fetch changelogs via GitHub API (authenticated, 5000 req/hr)
├── 4. Classify risk (SAFE / CAUTION / UNKNOWN)
├── 5. Slack notification — starting
├── 6. DB backup (if DB-backed service)
├── 7. Edit .tf files (version bump + config changes)
├── 8. Commit + push (Woodpecker CI applies)
├── 9. Wait for CI (poll Woodpecker API)
├── 10. Verify (pod ready + HTTP + Uptime Kuma)
├── 11a. SUCCESS → Slack report
└── 11b. FAILURE → git revert + CI re-applies → Slack alert
```
## Components
### DIUN (Docker Image Update Notifier)
- **Stack**: `stacks/diun/`
- **Schedule**: Every 6 hours (`DIUN_WATCH_SCHEDULE=0 */6 * * *`)
- **Role**: Detection only — fires a webhook to n8n when a new image tag is found
- **Skip patterns**: Databases, `viktorbarzin/*`, `registry.viktorbarzin.me/*`, infrastructure images
- **Webhook**: `DIUN_NOTIF_WEBHOOK_ENDPOINT` from Vault `secret/diun``n8n_webhook_url`
### n8n Workflow ("DIUN Upgrade Agent")
- **Stack**: `stacks/n8n/`
- **Workflow backup**: `stacks/n8n/workflows/diun-upgrade.json`
- **Webhook path**: UUID-based (`/webhook/<uuid>`)
- **Filters**:
- Only `status=update` (skip `new`, `unchanged`)
- Skip databases, custom images, infra images, `:latest`
- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')`
- **Action**: SSH to dev VM, runs `claude -p` with the upgrade agent prompt
### Upgrade Agent
- **Prompt**: `.claude/agents/service-upgrade.md`
- **Config**: `.claude/reference/upgrade-config.json`
- Contains:
- 50+ Docker image → GitHub repo mappings
- 22 Helm chart → GitHub repo mappings
- 27 DB-backed service definitions with backup metadata
- Skip patterns and breaking change keywords
## Risk Classification
| Risk | Criteria | Verification | Version Jump |
|------|----------|-------------|-------------|
| **SAFE** | Patch/minor bump, no breaking keywords in release notes | 2 minutes | Direct to target |
| **CAUTION** | Major bump, or breaking change keywords found, or in `version_jump_always_step` list | 10 minutes | Step through each version |
| **UNKNOWN** | Changelog unavailable | 2 minutes (SAFE defaults) | Direct to target |
**Breaking change keywords**: `breaking`, `BREAKING`, `migration required`, `schema change`, `database migration`, `manual intervention`, `action required`, `removed`, `deprecated`, `renamed`, `incompatible`
## Database Backup
DB-backed services trigger a pre-upgrade backup automatically:
- **Shared PostgreSQL**: `kubectl create job --from=cronjob/postgresql-backup -n dbaas`
- **Shared MySQL**: `kubectl create job --from=cronjob/mysql-backup -n dbaas`
- **Dedicated databases** (e.g., Immich): Trigger existing backup CronJob in the service's namespace
If the backup fails, the upgrade is **aborted**.
## Rollback
On verification failure:
1. `git revert --no-edit <upgrade-commit-sha>`
2. `git push` → Woodpecker CI re-applies the old version
3. Re-verify rollback succeeded
4. If rollback also fails → CRITICAL Slack alert for manual intervention
## Version Patterns
The agent handles all three version patterns in Terraform:
| Pattern | Example | Agent Action |
|---------|---------|-------------|
| Variable-based | `variable "immich_version" { default = "v2.7.4" }` | Edit the `default` value |
| Hardcoded | `image = "vaultwarden/server:1.35.4"` | Replace tag in image string |
| Helm chart | `version = "2026.2.2"` in `helm_release` | Bump chart version |
## Configuration
### Excluding images (handled by DIUN + n8n)
- Databases: `*postgres*`, `*mysql*`, `*redis*`, `*clickhouse*`, `*etcd*`
- Custom: `viktorbarzin/*`, `registry.viktorbarzin.me/*`, `ancamilea/*`, `mghee/*`
- Infrastructure: `registry.k8s.io/*`, `quay.io/tigera/*`, `nvcr.io/*`, `reg.kyverno.io/*`
- `:latest` tags
### Rate limiting
- Max 5 upgrades per 6-hour DIUN scan cycle
- Counter resets when the window expires
- Configurable in the n8n "Filter and Rate Limit" code node
### Services that always step through versions
- Authentik, Nextcloud, Immich (configured in `upgrade-config.json``version_jump_always_step`)
## Monitoring
- **Slack**: All upgrade events reported (start, success, failure, rollback)
- **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
## Secrets
| Secret | Vault Path | Purpose |
|--------|-----------|---------|
| n8n webhook URL | `secret/diun``n8n_webhook_url` | DIUN → n8n trigger |
| GitHub PAT | `secret/viktor``github_pat` | Changelog fetch (5000 req/hr) |
| Slack webhook | `secret/platform``alertmanager_slack_api_url` | Upgrade notifications |
| Woodpecker token | `secret/viktor``woodpecker_token` | CI pipeline polling |

View file

@ -150,6 +150,10 @@ resource "kubernetes_deployment" "n8n" {
container {
name = "n8n"
image = "docker.n8n.io/n8nio/n8n:1.80.0"
env {
name = "N8N_PORT"
value = "5678"
}
env {
name = "DB_TYPE"
value = "postgresdb"

View file

@ -0,0 +1,58 @@
{
"name": "DIUN Upgrade Agent",
"active": true,
"nodes": [
{
"parameters": {"httpMethod": "POST", "path": "30805ab6-7281-4d42-8aa1-fbfe5a9694fa", "options": {}},
"id": "webhook-trigger",
"name": "DIUN Webhook",
"type": "n8n-nodes-base.webhook",
"typeVersion": 2,
"position": [250, 300],
"webhookId": "30805ab6-7281-4d42-8aa1-fbfe5a9694fa"
},
{
"parameters": {
"conditions": {
"options": {"caseSensitive": true, "leftValue": "", "typeValidation": "strict"},
"conditions": [{"id": "cond-status", "leftValue": "={{ $json.body.diun_entry_status }}", "rightValue": "update", "operator": {"type": "string", "operation": "equals"}}],
"combinator": "and"
},
"options": {}
},
"id": "filter-status",
"name": "Filter Updates Only",
"type": "n8n-nodes-base.filter",
"typeVersion": 2,
"position": [470, 300]
},
{
"parameters": {
"jsCode": "const MAX_UPGRADES_PER_WINDOW = 5;\nconst WINDOW_HOURS = 6;\n\nconst staticData = $getWorkflowStaticData('global');\nconst now = Date.now();\nconst windowMs = WINDOW_HOURS * 60 * 60 * 1000;\n\nif (!staticData.windowStart || (now - staticData.windowStart) > windowMs) {\n staticData.windowStart = now;\n staticData.count = 0;\n}\n\nif (staticData.count >= MAX_UPGRADES_PER_WINDOW) {\n console.log('Rate limit reached: ' + staticData.count + '/' + MAX_UPGRADES_PER_WINDOW);\n return [];\n}\n\nconst image = $input.first().json.body.diun_entry_image || '';\nconst tag = $input.first().json.body.diun_entry_imagetag || '';\n\nconst dbPatterns = ['postgres', 'mysql', 'redis', 'clickhouse', 'etcd'];\nif (dbPatterns.some(p => image.toLowerCase().includes(p))) return [];\n\nconst skipPrefixes = ['viktorbarzin/', 'registry.viktorbarzin.me/', 'ancamilea/', 'mghee/'];\nif (skipPrefixes.some(p => image.startsWith(p))) return [];\n\nconst infraPrefixes = ['registry.k8s.io/', 'quay.io/tigera/', 'quay.io/metallb/', 'nvcr.io/', 'reg.kyverno.io/'];\nif (infraPrefixes.some(p => image.startsWith(p))) return [];\n\nif (tag === 'latest' || tag === '') return [];\n\nstaticData.count += 1;\nconsole.log('Upgrade ' + staticData.count + '/' + MAX_UPGRADES_PER_WINDOW + ': ' + image + ':' + tag);\n\nreturn [$input.first()];"
},
"id": "filter-images",
"name": "Filter and Rate Limit",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [690, 300]
},
{
"parameters": {
"command": "='claude -p \"You are the service-upgrade agent. Read /home/wizard/code/infra/.claude/agents/service-upgrade.md for full instructions.\\n\\nUpgrade task:\\n- Image: ' + $json.body.diun_entry_image + '\\n- New tag: ' + $json.body.diun_entry_imagetag + '\\n- Hub link: ' + ($json.body.diun_entry_hublink || 'none') + '\\n\\nExecute the upgrade workflow now.\"'",
"cwd": "/home/wizard/code/infra"
},
"id": "ssh-execute",
"name": "Run Upgrade Agent",
"type": "n8n-nodes-base.ssh",
"typeVersion": 1,
"position": [910, 300],
"credentials": {"sshPassword": {"id": "REPLACE_WITH_SSH_CRED_ID", "name": "Dev VM SSH"}}
}
],
"connections": {
"DIUN Webhook": {"main": [[{"node": "Filter Updates Only", "type": "main", "index": 0}]]},
"Filter Updates Only": {"main": [[{"node": "Filter and Rate Limit", "type": "main", "index": 0}]]},
"Filter and Rate Limit": {"main": [[{"node": "Run Upgrade Agent", "type": "main", "index": 0}]]}
},
"settings": {"executionOrder": "v1"}
}