Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
322 lines
13 KiB
Markdown
322 lines
13 KiB
Markdown
# Auto-Upgrade Apps Implementation Plan
|
||
|
||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||
|
||
**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
|
||
|
||
**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
|
||
|
||
**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
|
||
|
||
**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
|
||
|
||
**Key context:**
|
||
- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
|
||
- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
|
||
- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
|
||
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
|
||
|
||
---
|
||
|
||
## Phase 0 — Foundation
|
||
|
||
### Task 0.1: Resolve remaining open question
|
||
|
||
Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
|
||
|
||
Remaining open question:
|
||
|
||
**Helm-atomic services.** Survey:
|
||
```bash
|
||
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
|
||
```
|
||
|
||
For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
|
||
|
||
---
|
||
|
||
### Task 0.2: Create the Keel stack
|
||
|
||
**Files:**
|
||
- Create: `stacks/keel/terragrunt.hcl`
|
||
- Create: `stacks/keel/main.tf`
|
||
- Create: `stacks/keel/variables.tf`
|
||
- Create: `stacks/keel/modules/keel/main.tf`
|
||
|
||
**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks` — **NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
|
||
|
||
**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
|
||
|
||
Key Helm values:
|
||
- `polling.enabled: true`
|
||
- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
|
||
- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
|
||
- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
|
||
|
||
**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
|
||
|
||
**Acceptance:**
|
||
- `kubectl -n keel get pod` shows Keel Ready.
|
||
- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
|
||
|
||
---
|
||
|
||
### Task 0.3: Author the Kyverno ClusterPolicy
|
||
|
||
**Files:**
|
||
- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
|
||
|
||
ClusterPolicy `inject-keel-annotations`:
|
||
|
||
```yaml
|
||
apiVersion: kyverno.io/v1
|
||
kind: ClusterPolicy
|
||
metadata:
|
||
name: inject-keel-annotations
|
||
spec:
|
||
background: true
|
||
rules:
|
||
- name: add-keel-annotation
|
||
match:
|
||
any:
|
||
- resources:
|
||
kinds: [Deployment, StatefulSet, DaemonSet]
|
||
namespaces: [] # populated per phase
|
||
exclude:
|
||
any:
|
||
- resources:
|
||
namespaces: ["keel"] # decision #11
|
||
- resources:
|
||
# Workloads can opt out by setting this label
|
||
selector:
|
||
matchLabels:
|
||
keel.sh/policy: never
|
||
mutate:
|
||
patchStrategicMerge:
|
||
metadata:
|
||
annotations:
|
||
+(keel.sh/policy): force
|
||
+(keel.sh/trigger): poll
|
||
+(keel.sh/pollSchedule): "@every 1h"
|
||
```
|
||
|
||
- `+()` syntax adds only if not present (preserves per-workload overrides).
|
||
- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
|
||
|
||
**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
|
||
|
||
**Acceptance:**
|
||
- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
|
||
- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
|
||
|
||
---
|
||
|
||
### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
|
||
|
||
**Files:**
|
||
- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
|
||
- Modify: `.claude/CLAUDE.md` — reference the V2 marker
|
||
|
||
Snippet to copy-paste:
|
||
|
||
```hcl
|
||
lifecycle {
|
||
ignore_changes = [
|
||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||
metadata[0].annotations["keel.sh/policy"],
|
||
metadata[0].annotations["keel.sh/trigger"],
|
||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||
]
|
||
}
|
||
```
|
||
|
||
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
|
||
|
||
---
|
||
|
||
## Phase 1 — Self-hosted (uniform model)
|
||
|
||
**Set:** all self-hosted services. Three sub-categories:
|
||
|
||
- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
|
||
- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
|
||
- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
|
||
- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
|
||
|
||
### Task 1.1: Audit current image refs
|
||
|
||
```bash
|
||
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
|
||
```
|
||
|
||
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
|
||
|
||
### Task 1.2: Per-service uniform conversion
|
||
|
||
For each Woodpecker-build-only service:
|
||
1. Edit Terraform: `local.image_tag` / `var.image_tag` → `"latest"`.
|
||
2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
|
||
3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
|
||
|
||
For each GHA-migrated service:
|
||
1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
|
||
2. Add the KYVERNO_LIFECYCLE_V2 snippet.
|
||
3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
|
||
4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
|
||
5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
|
||
|
||
For each no-CI service:
|
||
- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
|
||
- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
|
||
- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
|
||
|
||
For `kms-website`: only add the Keel annotation; CI changes optional.
|
||
|
||
### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
|
||
|
||
Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
|
||
|
||
```yaml
|
||
namespaces:
|
||
- claude-agent-service
|
||
- fire-planner
|
||
- job-hunter
|
||
- payslip-ingest
|
||
- recruiter-responder
|
||
- claude-memory-mcp
|
||
- kms-website
|
||
# GHA-migrated set:
|
||
- website # or whatever the namespace is named per repo
|
||
- k8s-portal
|
||
- f1-stream
|
||
- apple-health-data
|
||
- audiblez-web
|
||
- plotting-book
|
||
- insta2spotify
|
||
- audiobook-search
|
||
- council-complaints
|
||
# No-CI set:
|
||
- beads-server
|
||
- chrome-service
|
||
- freedify
|
||
- wealthfolio
|
||
```
|
||
|
||
Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
|
||
|
||
Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
|
||
|
||
### Task 1.4: Soak
|
||
|
||
1 week. Monitor:
|
||
- Slack `#deployments` for Keel rollout notifications.
|
||
- `KubePodCrashLooping` alerts.
|
||
- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
|
||
|
||
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
|
||
|
||
**Acceptance:**
|
||
- All 7 services running latest digests within 24h of Phase 1 apply.
|
||
- No CrashLooping persisting >1h.
|
||
- No more than 2 services pinned-out during the soak week.
|
||
|
||
---
|
||
|
||
## Phase 2 — Stateless third-party web apps
|
||
|
||
**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
|
||
|
||
### Task 2.1: Audit current tags via Diun
|
||
|
||
```bash
|
||
# Diun's REST API or UI exports a "new tags available" report
|
||
# Use as the per-service decision source
|
||
```
|
||
|
||
For each service, pick floating tag:
|
||
- `:latest` if upstream publishes it and it's stable.
|
||
- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
|
||
- `glob` + `ignore_changes` as last resort.
|
||
|
||
### Task 2.2: Catch-up PR
|
||
|
||
Single combined PR:
|
||
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
|
||
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
|
||
- Append Phase 2 namespaces to Kyverno allowlist.
|
||
|
||
Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
|
||
|
||
### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
|
||
|
||
---
|
||
|
||
## Phases 3–9 — same template
|
||
|
||
For each phase, repeat:
|
||
|
||
1. Define the set (precise namespace list).
|
||
2. Audit current tags (Diun + grep).
|
||
3. Pick floating tag per service.
|
||
4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
|
||
5. Apply paced (≤5/hr).
|
||
6. Soak 1 week. Pin-out any service that breaks repeatedly.
|
||
|
||
Set definitions per phase: see design doc Phase Ordering table.
|
||
|
||
**Special-handling phases:**
|
||
|
||
- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
|
||
- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
|
||
- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
|
||
|
||
---
|
||
|
||
## Cleanup tasks (after Phase 9 stable)
|
||
|
||
### Task C.1: Retire Service Upgrade Agent
|
||
|
||
**Files:**
|
||
- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
|
||
- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
|
||
- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
|
||
|
||
### Task C.2: Update CLAUDE.md files
|
||
|
||
- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
|
||
- Reverse same in root `/CLAUDE.md` if duplicated.
|
||
- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
|
||
- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
|
||
|
||
### Task C.3: Add a runbook
|
||
|
||
**Files:**
|
||
- Create: `docs/runbooks/keel-rollback.md`
|
||
|
||
Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
|
||
|
||
### Task C.4: Tidy Diun
|
||
|
||
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
|
||
|
||
---
|
||
|
||
## Rollback (whole project)
|
||
|
||
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
|
||
|
||
1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
|
||
2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
|
||
3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
|
||
4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
|
||
5. Reconsider opt-in approach for next iteration.
|
||
|
||
---
|
||
|
||
## Success criteria
|
||
|
||
- All ~70 services running latest within 8 weeks of Phase 0 completion.
|
||
- Zero unrolled-back outages caused by Keel.
|
||
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
|
||
- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
|
||
- Service Upgrade Agent + supporting infra retired.
|