Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 KiB
Auto-Upgrade Apps Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc :latest references to a Keel-driven auto-update model where every workload tracks :latest (or a chosen :major floating tag) and rolls automatically when upstream advances.
Architecture: Kyverno cluster-wide ClusterPolicy mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (keel.sh/policy: force, keel.sh/trigger: poll, keel.sh/pollSchedule: @every 1h). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the NamespaceSelector allowlist.
Tech Stack: Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
Design doc: docs/plans/2026-05-16-auto-upgrade-apps-design.md
Key context:
- Cache is already correctly configured (nginx URL-split +
proxy.ttl: 0). No cache changes needed. - Per-stack
lifecycle.ignore_changesis already required for the existingdns_configKyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations. - Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
Phase 0 — Foundation
Task 0.1: Resolve remaining open question
Q1 and Q2 from the design doc are resolved (uniform :latest + Keel model for all self-hosted; per-service plan for no-CI services).
Remaining open question:
Helm-atomic services. Survey:
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
For each match: either remove atomic = true (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
Task 0.2: Create the Keel stack
Files:
- Create:
stacks/keel/terragrunt.hcl - Create:
stacks/keel/main.tf - Create:
stacks/keel/variables.tf - Create:
stacks/keel/modules/keel/main.tf
Step 1: Add keel to terragrunt.hcl locals.tier0_stacks — NO. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
Step 2: Deploy via Helm chart keel-hq/keel (verify current version via context7 before pinning).
Key Helm values:
polling.enabled: truehelmProvider.enabled: false(we use annotations, not Helm hooks)notifications.slack.enabled: truewith channel#deployments(verify channel exists)- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (
secret/viktor/forgejo_pull_token).
Step 3: Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
Acceptance:
kubectl -n keel get podshows Keel Ready.kubectl -n keel logs deploy/keel | grep registryshows successful manifest queries.
Task 0.3: Author the Kyverno ClusterPolicy
Files:
- Create:
stacks/kyverno/modules/kyverno/keel-annotations.tf(or extendsecurity-policies.tf)
ClusterPolicy inject-keel-annotations:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-keel-annotations
spec:
background: true
rules:
- name: add-keel-annotation
match:
any:
- resources:
kinds: [Deployment, StatefulSet, DaemonSet]
namespaces: [] # populated per phase
exclude:
any:
- resources:
namespaces: ["keel"] # decision #11
- resources:
# Workloads can opt out by setting this label
selector:
matchLabels:
keel.sh/policy: never
mutate:
patchStrategicMerge:
metadata:
annotations:
+(keel.sh/policy): force
+(keel.sh/trigger): poll
+(keel.sh/pollSchedule): "@every 1h"
+()syntax adds only if not present (preserves per-workload overrides).exclude.selector.matchLabels[keel.sh/policy=never]is the per-workload escape hatch (used during rollback per decision #9).
Step 2: Initially deploy with namespaces: [] — policy exists but matches nothing.
Acceptance:
kubectl get clusterpolicy inject-keel-annotationsshows Ready.kubectl get deploy -A -o yaml | grep keel.sh/policyshows no matches yet (empty allowlist).
Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
Files:
- Modify:
AGENTS.md— add the V2 snippet to the "Kyverno Drift Suppression" section - Modify:
.claude/CLAUDE.md— reference the V2 marker
Snippet to copy-paste:
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
Phase 1 — Self-hosted (uniform model)
Set: all self-hosted services. Three sub-categories:
- Woodpecker-build-only (6):
claude-agent-service,fire-planner,job-hunter,payslip-ingest,recruiter-responder,claude-memory-mcp. - GHA-migrated (10, per memory id=388): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
- No-CI (4, per design Q2):
wealthfolio(→ upstream),chrome-service(verify pin intent),beadboard(add CI),freedify(add CI). - Already-uniform (1):
kms-website— already pushes:latestAND SHA; just needs Keel annotation.
Task 1.1: Audit current image refs
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
Task 1.2: Per-service uniform conversion
For each Woodpecker-build-only service:
- Edit Terraform:
local.image_tag/var.image_tag→"latest". - Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
- Verify
.woodpecker.ymlpushes:lateston every build (most do viaauto_tag: true).
For each GHA-migrated service:
- Edit Terraform: switch
image_tagfrom SHA reference to"latest". - Add the KYVERNO_LIFECYCLE_V2 snippet.
- Edit
.github/workflows/build-and-deploy.yml: push:latest(in addition to:<8-char-sha>for traceability). Remove the Woodpecker API POST step. - Delete
.woodpecker/deploy.ymland.woodpecker/build-fallback.ymlfrom each repo (no longer needed). - Remove the Woodpecker repo config for these repos from Terraform if applicable.
For each no-CI service:
wealthfolio: change Terraform image towealthfolio/wealthfolio:latest(upstream DockerHub). Validate the image starts cleanly.chrome-service: check git blame on the:v4pin. If deliberate → labelkeel.sh/policy: never. If accidental → bump to upstream:latest.beadboard,freedify: write a minimal.woodpecker.yml(single build step pushing to Forgejo:latest). Trigger an initial build to populate:latest.
For kms-website: only add the Keel annotation; CI changes optional.
Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
Edit stacks/kyverno/modules/kyverno/keel-annotations.tf:
namespaces:
- claude-agent-service
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
- claude-memory-mcp
- kms-website
# GHA-migrated set:
- website # or whatever the namespace is named per repo
- k8s-portal
- f1-stream
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- audiobook-search
- council-complaints
# No-CI set:
- beads-server
- chrome-service
- freedify
- wealthfolio
Verify each namespace name from kubectl get ns before locking in (some may differ from the repo name).
Apply. Watch kubectl get deploy -n <ns> -o yaml | grep keel.sh confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
Task 1.4: Soak
1 week. Monitor:
- Slack
#deploymentsfor Keel rollout notifications. KubePodCrashLoopingalerts.- Manual
kubectl rollout statuson each service after a Keel-triggered rollout.
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
Acceptance:
- All 7 services running latest digests within 24h of Phase 1 apply.
- No CrashLooping persisting >1h.
- No more than 2 services pinned-out during the soak week.
Phase 2 — Stateless third-party web apps
Set: linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from kubectl get deploy -A filtered against the phase-1 set + skip-bucket).
Task 2.1: Audit current tags via Diun
# Diun's REST API or UI exports a "new tags available" report
# Use as the per-service decision source
For each service, pick floating tag:
:latestif upstream publishes it and it's stable.:<major>(e.g.:2,:v3) if:latestis unreliable.glob+ignore_changesas last resort.
Task 2.2: Catch-up PR
Single combined PR:
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
- Append Phase 2 namespaces to Kyverno allowlist.
Apply with -target= per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
Task 2.3: Soak — 1 week, same monitoring as Phase 1.
Phases 3–9 — same template
For each phase, repeat:
- Define the set (precise namespace list).
- Audit current tags (Diun + grep).
- Pick floating tag per service.
- Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
- Apply paced (≤5/hr).
- Soak 1 week. Pin-out any service that breaks repeatedly.
Set definitions per phase: see design doc Phase Ordering table.
Special-handling phases:
- Phase 7 (Operators). Restart of an operator can confuse its managed CRD reconciles. Use
imagePullPolicy: Always+ readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance. - Phase 8 (Critical infra). Calico/CSI DaemonSet rollouts impact each node briefly. Verify
updateStrategy.rollingUpdate.maxUnavailable: 1on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale. - Phase 9 (Bootstrap). Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of
/srv/nfs/<db>-backup/before applying the phase enrollment.
Cleanup tasks (after Phase 9 stable)
Task C.1: Retire Service Upgrade Agent
Files:
- Modify:
stacks/n8n/— remove the Service Upgrade Agent workflow - Delete: any supporting scripts (
infra/scripts/service-upgrade-*.shif they exist) - Modify:
stacks/diun/— disable webhook notification to n8n (keep Slack notification for visibility)
Task C.2: Update CLAUDE.md files
- Reverse the "use 8-char git SHA tags" rule in
infra/.claude/CLAUDE.md"Docker images" line. - Reverse same in root
/CLAUDE.mdif duplicated. - Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
- Update memory via
mcp__claude_memory__memory_updateon entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
Task C.3: Add a runbook
Files:
- Create:
docs/runbooks/keel-rollback.md
Document the rollback flow (decision #9): kubectl rollout undo → Terraform pin → annotation keel.sh/policy: never.
Task C.4: Tidy Diun
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
Rollback (whole project)
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
- Set Kyverno ClusterPolicy
inject-keel-annotationsto emptynamespaces: []. - Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale
keelDeployment to 0. - Pin every workload's Terraform image_tag back to its current running digest (use
kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'). - Document failure modes in
post-mortems/2026-XX-XX-keel-rollback.md. - Reconsider opt-in approach for next iteration.
Success criteria
- All ~70 services running latest within 8 weeks of Phase 0 completion.
- Zero unrolled-back outages caused by Keel.
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
terragrunt planshows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).- Service Upgrade Agent + supporting infra retired.