diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 3edf001e..3fa30de6 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -79,7 +79,7 @@ | servarr | Media automation (Sonarr/Radarr/etc) | servarr | | ntfy | Push notifications | ntfy | | cyberchef | Data transformation | cyberchef | -| diun | Docker image update notifier | diun | +| diun | Docker image update notifier — detects new versions, fires webhook to n8n upgrade agent | diun | | meshcentral | Remote management | meshcentral | | homepage | Dashboard/startpage | homepage | | matrix | Matrix chat server | matrix | diff --git a/AGENTS.md b/AGENTS.md index 92325edd..1a8c79d2 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -104,5 +104,14 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro - **Add a secret**: `sops set secrets.sops.json '["key"]' '"value"'` then commit. - **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/"`), add to `/etc/exports`, run `exportfs -ra`. +## Automated Service Upgrades +- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH → `claude -p` (upgrade agent) +- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure +- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns +- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow) +- **Skipped**: databases, `:latest`, custom images (`viktorbarzin/*`), infrastructure images +- **Risk**: SAFE (2min verify) vs CAUTION (10min, DB backup, step through versions) based on changelog analysis +- **Docs**: `docs/architecture/automated-upgrades.md` + ## Detailed Reference See `.claude/reference/patterns.md` for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index. diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md new file mode 100644 index 00000000..8df2fbda --- /dev/null +++ b/docs/architecture/automated-upgrades.md @@ -0,0 +1,134 @@ +# Automated Service Upgrades + +## Overview + +OSS services are automatically upgraded via a pipeline that detects new container image versions, analyzes changelogs for breaking changes, backs up databases, applies version bumps through Terraform, and verifies health post-upgrade with automatic rollback on failure. + +## Architecture + +``` +DIUN (every 6h) + │ detects new image tags + │ + ▼ +n8n Webhook (POST /webhook/) + │ filters: skip databases, custom images, infra, :latest + │ rate limit: max 5 upgrades per 6h window + │ + ▼ +SSH → Dev VM (10.0.10.10) + │ + ▼ +claude -p "upgrade agent prompt" + │ + ▼ +Service Upgrade Agent + ├── 1. Identify service + .tf files (grep stacks/) + ├── 2. Resolve GitHub repo (config overrides + auto-detect) + ├── 3. Fetch changelogs via GitHub API (authenticated, 5000 req/hr) + ├── 4. Classify risk (SAFE / CAUTION / UNKNOWN) + ├── 5. Slack notification — starting + ├── 6. DB backup (if DB-backed service) + ├── 7. Edit .tf files (version bump + config changes) + ├── 8. Commit + push (Woodpecker CI applies) + ├── 9. Wait for CI (poll Woodpecker API) + ├── 10. Verify (pod ready + HTTP + Uptime Kuma) + ├── 11a. SUCCESS → Slack report + └── 11b. FAILURE → git revert + CI re-applies → Slack alert +``` + +## Components + +### DIUN (Docker Image Update Notifier) +- **Stack**: `stacks/diun/` +- **Schedule**: Every 6 hours (`DIUN_WATCH_SCHEDULE=0 */6 * * *`) +- **Role**: Detection only — fires a webhook to n8n when a new image tag is found +- **Skip patterns**: Databases, `viktorbarzin/*`, `registry.viktorbarzin.me/*`, infrastructure images +- **Webhook**: `DIUN_NOTIF_WEBHOOK_ENDPOINT` from Vault `secret/diun` → `n8n_webhook_url` + +### n8n Workflow ("DIUN Upgrade Agent") +- **Stack**: `stacks/n8n/` +- **Workflow backup**: `stacks/n8n/workflows/diun-upgrade.json` +- **Webhook path**: UUID-based (`/webhook/`) +- **Filters**: + - Only `status=update` (skip `new`, `unchanged`) + - Skip databases, custom images, infra images, `:latest` +- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')` +- **Action**: SSH to dev VM, runs `claude -p` with the upgrade agent prompt + +### Upgrade Agent +- **Prompt**: `.claude/agents/service-upgrade.md` +- **Config**: `.claude/reference/upgrade-config.json` +- Contains: + - 50+ Docker image → GitHub repo mappings + - 22 Helm chart → GitHub repo mappings + - 27 DB-backed service definitions with backup metadata + - Skip patterns and breaking change keywords + +## Risk Classification + +| Risk | Criteria | Verification | Version Jump | +|------|----------|-------------|-------------| +| **SAFE** | Patch/minor bump, no breaking keywords in release notes | 2 minutes | Direct to target | +| **CAUTION** | Major bump, or breaking change keywords found, or in `version_jump_always_step` list | 10 minutes | Step through each version | +| **UNKNOWN** | Changelog unavailable | 2 minutes (SAFE defaults) | Direct to target | + +**Breaking change keywords**: `breaking`, `BREAKING`, `migration required`, `schema change`, `database migration`, `manual intervention`, `action required`, `removed`, `deprecated`, `renamed`, `incompatible` + +## Database Backup + +DB-backed services trigger a pre-upgrade backup automatically: +- **Shared PostgreSQL**: `kubectl create job --from=cronjob/postgresql-backup -n dbaas` +- **Shared MySQL**: `kubectl create job --from=cronjob/mysql-backup -n dbaas` +- **Dedicated databases** (e.g., Immich): Trigger existing backup CronJob in the service's namespace + +If the backup fails, the upgrade is **aborted**. + +## Rollback + +On verification failure: +1. `git revert --no-edit ` +2. `git push` → Woodpecker CI re-applies the old version +3. Re-verify rollback succeeded +4. If rollback also fails → CRITICAL Slack alert for manual intervention + +## Version Patterns + +The agent handles all three version patterns in Terraform: + +| Pattern | Example | Agent Action | +|---------|---------|-------------| +| Variable-based | `variable "immich_version" { default = "v2.7.4" }` | Edit the `default` value | +| Hardcoded | `image = "vaultwarden/server:1.35.4"` | Replace tag in image string | +| Helm chart | `version = "2026.2.2"` in `helm_release` | Bump chart version | + +## Configuration + +### Excluding images (handled by DIUN + n8n) +- Databases: `*postgres*`, `*mysql*`, `*redis*`, `*clickhouse*`, `*etcd*` +- Custom: `viktorbarzin/*`, `registry.viktorbarzin.me/*`, `ancamilea/*`, `mghee/*` +- Infrastructure: `registry.k8s.io/*`, `quay.io/tigera/*`, `nvcr.io/*`, `reg.kyverno.io/*` +- `:latest` tags + +### Rate limiting +- Max 5 upgrades per 6-hour DIUN scan cycle +- Counter resets when the window expires +- Configurable in the n8n "Filter and Rate Limit" code node + +### Services that always step through versions +- Authentik, Nextcloud, Immich (configured in `upgrade-config.json` → `version_jump_always_step`) + +## Monitoring + +- **Slack**: All upgrade events reported (start, success, failure, rollback) +- **Git**: Detailed commit messages with changelog summaries, risk level, backup status +- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent) + +## Secrets + +| Secret | Vault Path | Purpose | +|--------|-----------|---------| +| n8n webhook URL | `secret/diun` → `n8n_webhook_url` | DIUN → n8n trigger | +| GitHub PAT | `secret/viktor` → `github_pat` | Changelog fetch (5000 req/hr) | +| Slack webhook | `secret/platform` → `alertmanager_slack_api_url` | Upgrade notifications | +| Woodpecker token | `secret/viktor` → `woodpecker_token` | CI pipeline polling | diff --git a/stacks/n8n/main.tf b/stacks/n8n/main.tf index daff4662..8c7d5cf9 100644 --- a/stacks/n8n/main.tf +++ b/stacks/n8n/main.tf @@ -150,6 +150,10 @@ resource "kubernetes_deployment" "n8n" { container { name = "n8n" image = "docker.n8n.io/n8nio/n8n:1.80.0" + env { + name = "N8N_PORT" + value = "5678" + } env { name = "DB_TYPE" value = "postgresdb" diff --git a/stacks/n8n/workflows/diun-upgrade.json b/stacks/n8n/workflows/diun-upgrade.json new file mode 100644 index 00000000..9246e339 --- /dev/null +++ b/stacks/n8n/workflows/diun-upgrade.json @@ -0,0 +1,58 @@ +{ + "name": "DIUN Upgrade Agent", + "active": true, + "nodes": [ + { + "parameters": {"httpMethod": "POST", "path": "30805ab6-7281-4d42-8aa1-fbfe5a9694fa", "options": {}}, + "id": "webhook-trigger", + "name": "DIUN Webhook", + "type": "n8n-nodes-base.webhook", + "typeVersion": 2, + "position": [250, 300], + "webhookId": "30805ab6-7281-4d42-8aa1-fbfe5a9694fa" + }, + { + "parameters": { + "conditions": { + "options": {"caseSensitive": true, "leftValue": "", "typeValidation": "strict"}, + "conditions": [{"id": "cond-status", "leftValue": "={{ $json.body.diun_entry_status }}", "rightValue": "update", "operator": {"type": "string", "operation": "equals"}}], + "combinator": "and" + }, + "options": {} + }, + "id": "filter-status", + "name": "Filter Updates Only", + "type": "n8n-nodes-base.filter", + "typeVersion": 2, + "position": [470, 300] + }, + { + "parameters": { + "jsCode": "const MAX_UPGRADES_PER_WINDOW = 5;\nconst WINDOW_HOURS = 6;\n\nconst staticData = $getWorkflowStaticData('global');\nconst now = Date.now();\nconst windowMs = WINDOW_HOURS * 60 * 60 * 1000;\n\nif (!staticData.windowStart || (now - staticData.windowStart) > windowMs) {\n staticData.windowStart = now;\n staticData.count = 0;\n}\n\nif (staticData.count >= MAX_UPGRADES_PER_WINDOW) {\n console.log('Rate limit reached: ' + staticData.count + '/' + MAX_UPGRADES_PER_WINDOW);\n return [];\n}\n\nconst image = $input.first().json.body.diun_entry_image || '';\nconst tag = $input.first().json.body.diun_entry_imagetag || '';\n\nconst dbPatterns = ['postgres', 'mysql', 'redis', 'clickhouse', 'etcd'];\nif (dbPatterns.some(p => image.toLowerCase().includes(p))) return [];\n\nconst skipPrefixes = ['viktorbarzin/', 'registry.viktorbarzin.me/', 'ancamilea/', 'mghee/'];\nif (skipPrefixes.some(p => image.startsWith(p))) return [];\n\nconst infraPrefixes = ['registry.k8s.io/', 'quay.io/tigera/', 'quay.io/metallb/', 'nvcr.io/', 'reg.kyverno.io/'];\nif (infraPrefixes.some(p => image.startsWith(p))) return [];\n\nif (tag === 'latest' || tag === '') return [];\n\nstaticData.count += 1;\nconsole.log('Upgrade ' + staticData.count + '/' + MAX_UPGRADES_PER_WINDOW + ': ' + image + ':' + tag);\n\nreturn [$input.first()];" + }, + "id": "filter-images", + "name": "Filter and Rate Limit", + "type": "n8n-nodes-base.code", + "typeVersion": 2, + "position": [690, 300] + }, + { + "parameters": { + "command": "='claude -p \"You are the service-upgrade agent. Read /home/wizard/code/infra/.claude/agents/service-upgrade.md for full instructions.\\n\\nUpgrade task:\\n- Image: ' + $json.body.diun_entry_image + '\\n- New tag: ' + $json.body.diun_entry_imagetag + '\\n- Hub link: ' + ($json.body.diun_entry_hublink || 'none') + '\\n\\nExecute the upgrade workflow now.\"'", + "cwd": "/home/wizard/code/infra" + }, + "id": "ssh-execute", + "name": "Run Upgrade Agent", + "type": "n8n-nodes-base.ssh", + "typeVersion": 1, + "position": [910, 300], + "credentials": {"sshPassword": {"id": "REPLACE_WITH_SSH_CRED_ID", "name": "Dev VM SSH"}} + } + ], + "connections": { + "DIUN Webhook": {"main": [[{"node": "Filter Updates Only", "type": "main", "index": 0}]]}, + "Filter Updates Only": {"main": [[{"node": "Filter and Rate Limit", "type": "main", "index": 0}]]}, + "Filter and Rate Limit": {"main": [[{"node": "Run Upgrade Agent", "type": "main", "index": 0}]]} + }, + "settings": {"executionOrder": "v1"} +}