Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
9476649539
commit
020f62555b
7 changed files with 679 additions and 1 deletions
|
|
@ -37,7 +37,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`.
|
||||
- **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
|
||||
- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct), declare a second `ingress_factory` with `ingress_path = ["/api"]` pointing at the bare backend service. Active on: blog, www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
|
||||
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
|
||||
- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
|
||||
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
|
||||
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
|
||||
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
|
||||
|
|
|
|||
31
AGENTS.md
31
AGENTS.md
|
|
@ -154,6 +154,37 @@ lifecycle {
|
|||
|
||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
|
||||
|
||||
### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
|
||||
|
||||
When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects three annotations on every Deployment / StatefulSet / DaemonSet:
|
||||
|
||||
```
|
||||
keel.sh/policy: force
|
||||
keel.sh/trigger: poll
|
||||
keel.sh/pollSchedule: "@every 1h"
|
||||
```
|
||||
|
||||
To suppress the resulting Terraform drift, **enrolled workloads** must extend their `ignore_changes` block:
|
||||
|
||||
```hcl
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The V2 snippet is added **per workload** as namespaces are phase-enrolled — not as a mass sweep. Workloads in un-enrolled namespaces do not receive the annotation and don't need the V2 block.
|
||||
|
||||
Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
|
||||
|
||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads.
|
||||
|
||||
**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
|
||||
|
||||
## Tier System
|
||||
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
|
||||
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
|
||||
|
|
|
|||
165
docs/plans/2026-05-16-auto-upgrade-apps-design.md
Normal file
165
docs/plans/2026-05-16-auto-upgrade-apps-design.md
Normal file
|
|
@ -0,0 +1,165 @@
|
|||
# Auto-Upgrade Apps Design
|
||||
|
||||
**Date**: 2026-05-16
|
||||
**Status**: Approved (brainstorm + grill complete; implementation pending)
|
||||
|
||||
## Problem
|
||||
|
||||
Three constraints in tension across the cluster's ~70 services:
|
||||
|
||||
1. **Keep apps at latest.** Most services drift behind upstream; manual bumps don't scale.
|
||||
2. **Stay Terraform-compatible.** Image refs live in `.tf`; we want declarative source of truth.
|
||||
3. **Don't let the pull-through cache serve stale `:latest`.** Cache layer must not lie about what `:latest` means today.
|
||||
|
||||
The previous `Diun → n8n → Service Upgrade Agent` flow handled (1) via changelog-reviewed PR bumps for third-party. Self-hosted services have inconsistent CI: 1 of 11 fully wired (CI builds + pushes + rolls out), 6 partially wired (build but no rollout trigger), 4 with no CI at all. Self-hosted services typically pull `forgejo.viktorbarzin.me/viktor/<name>:<8-char-sha>` with Terraform tracking each SHA in `var.image_tag`.
|
||||
|
||||
The user wants to simplify by retiring the changelog-review agent and moving to a pure "latest, always" model, with the cache freshness concern handled at the cache layer (already done — see Architecture §1).
|
||||
|
||||
## Decisions
|
||||
|
||||
| # | Decision | Notes |
|
||||
|---|----------|-------|
|
||||
| 1 | **Auto-roll for everything** (no PR-bump gate) | Retires the Service Upgrade Agent; Diun's role narrows to notification only |
|
||||
| 2 | **Actuator: Keel** ([keel.sh](https://keel.sh)) | Annotation-driven Deployment/StatefulSet/DaemonSet auto-update operator |
|
||||
| 3 | **Tag scheme: `:latest` where it exists, `:major` where it doesn't, glob+`ignore_changes` last resort** | `keel.sh/policy: force` for `:latest` / `:major`; tag string stays in Terraform |
|
||||
| 4 | **Opt-out-pure (no skip-list)** | Every workload auto-rolls, including Vault, CNPG, operators, CNI, CSI. User accepts recoverability risk |
|
||||
| 5 | **Phased rollout (9 phases)** | Low-risk → bootstrap. Catch up to latest as we phase in. Each phase soaks ~1 week |
|
||||
| 6 | **Per-phase: single combined PR** | Switch image refs to floating tag + add to Kyverno mutate allowlist in same commit |
|
||||
| 7 | **Diun is the audit source for catch-up** | Existing 6h-poll already reports outdated images; export as worklist per phase |
|
||||
| 8 | **Polling, hourly** (`@every 1h`) | Not webhooks — single mechanism, all registries supported |
|
||||
| 9 | **Rollback: `kubectl rollout undo` → pin in Terraform → add `keel.sh/policy: never`** | (c) from grill: immediate undo, durable Terraform pin within ≤1h before next Keel poll |
|
||||
| 10 | **Implementation: Kyverno cluster-wide mutate** | One `ClusterPolicy` injects Keel annotations; phase boundary = `NamespaceSelector` allowlist |
|
||||
| 11 | **Keel exempt from its own mutate** | One-line `NamespaceSelector` exclusion. Supervisor self-update has uniquely bad failure mode |
|
||||
| 12 | **Uniform CI model for all self-hosted** | CI builds + pushes `:latest`, Keel polls and rolls. No per-repo `kubectl set image` step. Retires the GHA-migrated SHA-tag flow (memory id=388) |
|
||||
|
||||
## Architecture
|
||||
|
||||
### 1. Cache freshness — already correct
|
||||
|
||||
Pull-through cache at `10.0.20.10` already splits caching by URL at the nginx layer:
|
||||
|
||||
- `location ~ /v2/.*/blobs/` → `proxy_cache_valid 200 24h` — blobs cached (content-addressed, immutable)
|
||||
- `location /v2/` (manifests) → pass through, no cache
|
||||
|
||||
Combined with `registry.proxy.ttl: 0` at the docker-registry layer, mutable manifests revalidate against upstream on every pull. **No cache changes needed for this design.** The CLAUDE.md note "Use 8-char git SHA tags — `:latest` causes stale pull-through cache" predates the nginx URL-split fix and should be updated as part of this work.
|
||||
|
||||
### 2. Detection — Keel polls upstream
|
||||
|
||||
Keel runs as a Deployment in its own namespace. Every annotated workload polls its registry hourly (Keel-managed; configurable per workload). On detection of a new digest under the watched tag:
|
||||
|
||||
- `keel.sh/policy: force` (for mutable tags `:latest`, `:16`, `:7`, etc.) → trigger Deployment update (pod template hash changes → restart)
|
||||
- `keel.sh/policy: minor` / `major` / `glob` (only for images that publish neither `:latest` nor a stable floating tag) → rewrite tag string on the Deployment; requires `lifecycle { ignore_changes = [...image] }`
|
||||
|
||||
### 3. Application — kubelet pull through the cache
|
||||
|
||||
When Keel triggers restart:
|
||||
|
||||
1. kubelet asks the cache (via containerd hosts.toml) for `image:tag` manifest.
|
||||
2. nginx passes the manifest request through to the docker-registry layer.
|
||||
3. docker-registry (with `proxy.ttl: 0`) passes through to upstream.
|
||||
4. Upstream returns current digest.
|
||||
5. kubelet pulls blobs (mostly cached at nginx layer; new blobs from upstream).
|
||||
6. New pod runs new image.
|
||||
|
||||
### 4. Annotation injection — Kyverno mutate
|
||||
|
||||
Single `ClusterPolicy` adds these annotations to every Deployment / StatefulSet / DaemonSet in opted-in namespaces:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
keel.sh/policy: force
|
||||
keel.sh/trigger: poll
|
||||
keel.sh/pollSchedule: "@every 1h"
|
||||
```
|
||||
|
||||
Phase = a `match.any[].resources.namespaces` list. Phase advance = append namespaces. Keel namespace is excluded.
|
||||
|
||||
### 5. Terraform drift handling
|
||||
|
||||
Existing convention (`# KYVERNO_LIFECYCLE_V1` marker) handles `dns_config` injection. We extend with a new marker:
|
||||
|
||||
```hcl
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This is added per workload as we phase in. Mechanical, grep-able.
|
||||
|
||||
## Phase ordering
|
||||
|
||||
| Phase | Set | Rationale |
|
||||
|-------|-----|-----------|
|
||||
| 0 | Foundation (Keel install, Kyverno ClusterPolicy with empty allowlist) | Build infra without enrolling anything |
|
||||
| 1 | Self-hosted (forgejo-hosted: ~11 services) | We own the code; failures are easy to diagnose |
|
||||
| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
|
||||
| 3 | Exporters, sidecars, utilities | Stateless |
|
||||
| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
|
||||
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
|
||||
| 6 | Authentik | Auth outage |
|
||||
| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
|
||||
| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
|
||||
| 9 | Bootstrap (Vault, CNPG PG cluster, mysql-standalone) | Lose recoverability if broken |
|
||||
|
||||
Per-phase: combined PR → apply (catch-up rolls happen) → soak 1 week → next phase. If a service breaks repeatedly, apply rollback runbook (decision #9) and proceed; re-enroll later or leave pinned.
|
||||
|
||||
## Risk register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| Bad upstream image rolls into prod | High | Service-level outage | Existing alerts (`KubePodCrashLooping`, `KubeletImagePullErrors`, `PodsStuckContainerCreating`); rollback runbook (decision #9) |
|
||||
| Catch-up rollout overwhelms cache | Medium | ImagePullBackOff cascade (memory id=603) | Rate-limit catch-up to ~5 rollouts/6h via `-target=` per phase; same pacing as retired Service Upgrade Agent (memory id=612) |
|
||||
| Calico / CSI auto-roll cascades (memory id=390: 26h outage) | Low-Medium | Cluster-level outage | Phase 8 is intentionally late; user opted into the risk; rollback to pinned chart version via Terraform |
|
||||
| Vault auto-rolls to broken image | Low | Loss of secrets sync; 43 ExternalSecrets stop reconciling | Phase 9 last; Tier 0 SOPS state allows manual recovery |
|
||||
| CNPG PG cluster auto-rolls to broken image | Low | Tier 1 Terraform state inaccessible; 105 stacks can't apply | Phase 9 last; Tier 0 stack `cnpg` is bootstrap-capable |
|
||||
| Helm-atomic-trap services (memory id=981) | Medium | `terraform apply` hangs in pending-rollback | Identify `helm_release` services with `atomic = true`; either remove atomic or skip from Keel |
|
||||
| Keel itself rolls to broken version | Low | Supervisor down; no auto-rolls until manual pin | Decision #11: exempt Keel from mutate |
|
||||
| Terraform drift after Kyverno injects annotation | High at first | Spurious diffs on every plan | KYVERNO_LIFECYCLE_V2 marker (Architecture §5); applied incrementally per phase |
|
||||
|
||||
## What we give up
|
||||
|
||||
- **Terraform no longer tracks deployed version.** Image refs in `.tf` say `:latest` or `:16`, but the running digest is whatever Keel pulled. To know what's running: `kubectl describe pod`. This is a deliberate trade — the previous SHA-pinned flow tracked version in TF but required N stack edits per deploy.
|
||||
- **No changelog review before rollout.** The Service Upgrade Agent's risk classification is gone. We rely on alerts to catch breakage post-deploy, not prevent it.
|
||||
- **CLAUDE.md SHA-tag rule is reversed for this design.** The "use 8-char git SHA tags" rule predates the nginx URL-split fix. New rule (post-rollout): "use floating tags + Keel annotation" — to be updated in both `infra/.claude/CLAUDE.md` and the repo-root `CLAUDE.md` once Phase 1 is stable.
|
||||
|
||||
## Decisions resolved post-grill
|
||||
|
||||
### Q1 — Uniform CI model for ALL self-hosted (resolved 2026-05-16)
|
||||
|
||||
Every self-hosted service moves to the same shape:
|
||||
|
||||
```
|
||||
CI (GHA or Woodpecker) → build → push :latest (optionally also :<SHA> for traceability) → done
|
||||
Keel → poll registry → detect new digest → trigger rollout
|
||||
```
|
||||
|
||||
The 10 GHA-migrated repos (memory id=388: Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) drop the `Woodpecker API → kubectl set image` step. Their `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` files become obsolete; remove during Phase 1.
|
||||
|
||||
Terraform image refs for all self-hosted: `<registry>/<repo>:latest` (with `${var.image_tag}` defaulting to `"latest"` where the variable exists).
|
||||
|
||||
### Q2 — No-CI self-hosted services (resolution: uniform participation)
|
||||
|
||||
| Service | Action |
|
||||
|---------|--------|
|
||||
| `wealthfolio` | Switch Terraform to upstream `wealthfolio/wealthfolio:latest` (DockerHub). No CI needed. |
|
||||
| `chrome-service` | Verify whether `:v4` is a deliberate pin. If yes → tag stays, add `keel.sh/policy: never` label. If no → switch to `:latest` or `:major`. Investigate during Phase 1 prep. |
|
||||
| `beadboard` (used by `beads-server`) | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
|
||||
| `freedify` | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
|
||||
|
||||
## Open questions (still need resolution before Phase 1)
|
||||
|
||||
1. **`helm_release atomic = true` services**: count and identify before Phase 1. Either remove `atomic` (preferred — eliminates the memory id=981 trap), or skip from Kyverno mutate via per-namespace exclusion. Survey command: `grep -rn 'atomic.*true' infra/stacks/ infra/modules/`.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Cache TTL changes — current config is already correct (nginx URL-split).
|
||||
- Webhook-based Keel triggers — polling is sufficient for this cadence.
|
||||
- Replacing Diun — kept for notification visibility into new tags not yet under Keel annotation (during phase rollout).
|
||||
- Keel approval gate (`keel.sh/approvals: N`) — user wants unattended auto-roll.
|
||||
- Keel auto-rollback on health-check failure — out of scope for v1; revisit if breakage rate is high.
|
||||
322
docs/plans/2026-05-16-auto-upgrade-apps-plan.md
Normal file
322
docs/plans/2026-05-16-auto-upgrade-apps-plan.md
Normal file
|
|
@ -0,0 +1,322 @@
|
|||
# Auto-Upgrade Apps Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
|
||||
|
||||
**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
|
||||
|
||||
**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
|
||||
|
||||
**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
|
||||
|
||||
**Key context:**
|
||||
- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
|
||||
- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
|
||||
- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
|
||||
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Foundation
|
||||
|
||||
### Task 0.1: Resolve remaining open question
|
||||
|
||||
Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
|
||||
|
||||
Remaining open question:
|
||||
|
||||
**Helm-atomic services.** Survey:
|
||||
```bash
|
||||
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
|
||||
```
|
||||
|
||||
For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
|
||||
|
||||
---
|
||||
|
||||
### Task 0.2: Create the Keel stack
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/keel/terragrunt.hcl`
|
||||
- Create: `stacks/keel/main.tf`
|
||||
- Create: `stacks/keel/variables.tf`
|
||||
- Create: `stacks/keel/modules/keel/main.tf`
|
||||
|
||||
**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks` — **NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
|
||||
|
||||
**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
|
||||
|
||||
Key Helm values:
|
||||
- `polling.enabled: true`
|
||||
- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
|
||||
- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
|
||||
- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
|
||||
|
||||
**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
|
||||
|
||||
**Acceptance:**
|
||||
- `kubectl -n keel get pod` shows Keel Ready.
|
||||
- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
|
||||
|
||||
---
|
||||
|
||||
### Task 0.3: Author the Kyverno ClusterPolicy
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
|
||||
|
||||
ClusterPolicy `inject-keel-annotations`:
|
||||
|
||||
```yaml
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: inject-keel-annotations
|
||||
spec:
|
||||
background: true
|
||||
rules:
|
||||
- name: add-keel-annotation
|
||||
match:
|
||||
any:
|
||||
- resources:
|
||||
kinds: [Deployment, StatefulSet, DaemonSet]
|
||||
namespaces: [] # populated per phase
|
||||
exclude:
|
||||
any:
|
||||
- resources:
|
||||
namespaces: ["keel"] # decision #11
|
||||
- resources:
|
||||
# Workloads can opt out by setting this label
|
||||
selector:
|
||||
matchLabels:
|
||||
keel.sh/policy: never
|
||||
mutate:
|
||||
patchStrategicMerge:
|
||||
metadata:
|
||||
annotations:
|
||||
+(keel.sh/policy): force
|
||||
+(keel.sh/trigger): poll
|
||||
+(keel.sh/pollSchedule): "@every 1h"
|
||||
```
|
||||
|
||||
- `+()` syntax adds only if not present (preserves per-workload overrides).
|
||||
- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
|
||||
|
||||
**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
|
||||
|
||||
**Acceptance:**
|
||||
- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
|
||||
- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
|
||||
|
||||
---
|
||||
|
||||
### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
|
||||
|
||||
**Files:**
|
||||
- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
|
||||
- Modify: `.claude/CLAUDE.md` — reference the V2 marker
|
||||
|
||||
Snippet to copy-paste:
|
||||
|
||||
```hcl
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Self-hosted (uniform model)
|
||||
|
||||
**Set:** all self-hosted services. Three sub-categories:
|
||||
|
||||
- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
|
||||
- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
|
||||
- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
|
||||
- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
|
||||
|
||||
### Task 1.1: Audit current image refs
|
||||
|
||||
```bash
|
||||
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
|
||||
```
|
||||
|
||||
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
|
||||
|
||||
### Task 1.2: Per-service uniform conversion
|
||||
|
||||
For each Woodpecker-build-only service:
|
||||
1. Edit Terraform: `local.image_tag` / `var.image_tag` → `"latest"`.
|
||||
2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
|
||||
3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
|
||||
|
||||
For each GHA-migrated service:
|
||||
1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
|
||||
2. Add the KYVERNO_LIFECYCLE_V2 snippet.
|
||||
3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
|
||||
4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
|
||||
5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
|
||||
|
||||
For each no-CI service:
|
||||
- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
|
||||
- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
|
||||
- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
|
||||
|
||||
For `kms-website`: only add the Keel annotation; CI changes optional.
|
||||
|
||||
### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
|
||||
|
||||
Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
|
||||
|
||||
```yaml
|
||||
namespaces:
|
||||
- claude-agent-service
|
||||
- fire-planner
|
||||
- job-hunter
|
||||
- payslip-ingest
|
||||
- recruiter-responder
|
||||
- claude-memory-mcp
|
||||
- kms-website
|
||||
# GHA-migrated set:
|
||||
- website # or whatever the namespace is named per repo
|
||||
- k8s-portal
|
||||
- f1-stream
|
||||
- apple-health-data
|
||||
- audiblez-web
|
||||
- plotting-book
|
||||
- insta2spotify
|
||||
- audiobook-search
|
||||
- council-complaints
|
||||
# No-CI set:
|
||||
- beads-server
|
||||
- chrome-service
|
||||
- freedify
|
||||
- wealthfolio
|
||||
```
|
||||
|
||||
Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
|
||||
|
||||
Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
|
||||
|
||||
### Task 1.4: Soak
|
||||
|
||||
1 week. Monitor:
|
||||
- Slack `#deployments` for Keel rollout notifications.
|
||||
- `KubePodCrashLooping` alerts.
|
||||
- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
|
||||
|
||||
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
|
||||
|
||||
**Acceptance:**
|
||||
- All 7 services running latest digests within 24h of Phase 1 apply.
|
||||
- No CrashLooping persisting >1h.
|
||||
- No more than 2 services pinned-out during the soak week.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Stateless third-party web apps
|
||||
|
||||
**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
|
||||
|
||||
### Task 2.1: Audit current tags via Diun
|
||||
|
||||
```bash
|
||||
# Diun's REST API or UI exports a "new tags available" report
|
||||
# Use as the per-service decision source
|
||||
```
|
||||
|
||||
For each service, pick floating tag:
|
||||
- `:latest` if upstream publishes it and it's stable.
|
||||
- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
|
||||
- `glob` + `ignore_changes` as last resort.
|
||||
|
||||
### Task 2.2: Catch-up PR
|
||||
|
||||
Single combined PR:
|
||||
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
|
||||
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
|
||||
- Append Phase 2 namespaces to Kyverno allowlist.
|
||||
|
||||
Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
|
||||
|
||||
### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
|
||||
|
||||
---
|
||||
|
||||
## Phases 3–9 — same template
|
||||
|
||||
For each phase, repeat:
|
||||
|
||||
1. Define the set (precise namespace list).
|
||||
2. Audit current tags (Diun + grep).
|
||||
3. Pick floating tag per service.
|
||||
4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
|
||||
5. Apply paced (≤5/hr).
|
||||
6. Soak 1 week. Pin-out any service that breaks repeatedly.
|
||||
|
||||
Set definitions per phase: see design doc Phase Ordering table.
|
||||
|
||||
**Special-handling phases:**
|
||||
|
||||
- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
|
||||
- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
|
||||
- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
|
||||
|
||||
---
|
||||
|
||||
## Cleanup tasks (after Phase 9 stable)
|
||||
|
||||
### Task C.1: Retire Service Upgrade Agent
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
|
||||
- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
|
||||
- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
|
||||
|
||||
### Task C.2: Update CLAUDE.md files
|
||||
|
||||
- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
|
||||
- Reverse same in root `/CLAUDE.md` if duplicated.
|
||||
- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
|
||||
- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
|
||||
|
||||
### Task C.3: Add a runbook
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/runbooks/keel-rollback.md`
|
||||
|
||||
Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
|
||||
|
||||
### Task C.4: Tidy Diun
|
||||
|
||||
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
|
||||
|
||||
---
|
||||
|
||||
## Rollback (whole project)
|
||||
|
||||
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
|
||||
|
||||
1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
|
||||
2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
|
||||
3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
|
||||
4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
|
||||
5. Reconsider opt-in approach for next iteration.
|
||||
|
||||
---
|
||||
|
||||
## Success criteria
|
||||
|
||||
- All ~70 services running latest within 8 weeks of Phase 0 completion.
|
||||
- Zero unrolled-back outages caused by Keel.
|
||||
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
|
||||
- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
|
||||
- Service Upgrade Agent + supporting infra retired.
|
||||
65
stacks/keel/main.tf
Normal file
65
stacks/keel/main.tf
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
# Keel — automated Kubernetes Deployment image updates.
|
||||
# Design: docs/plans/2026-05-16-auto-upgrade-apps-design.md
|
||||
# Plan: docs/plans/2026-05-16-auto-upgrade-apps-plan.md
|
||||
#
|
||||
# Operation: Keel polls each watched workload's registry hourly (default
|
||||
# schedule below; overridable per-workload via keel.sh/pollSchedule).
|
||||
# Detection of a new digest under the watched tag triggers a Deployment
|
||||
# update (pod template hash bump → rolling restart). Workloads opt in by
|
||||
# carrying keel.sh/policy + keel.sh/trigger annotations — those are
|
||||
# injected cluster-wide by the inject-keel-annotations ClusterPolicy
|
||||
# (stacks/kyverno/modules/kyverno/keel-annotations.tf) on namespaces
|
||||
# labeled keel.sh/enrolled=true.
|
||||
|
||||
resource "kubernetes_namespace" "keel" {
|
||||
metadata {
|
||||
name = "keel"
|
||||
labels = {
|
||||
tier = local.tiers.cluster
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
resource "helm_release" "keel" {
|
||||
name = "keel"
|
||||
namespace = kubernetes_namespace.keel.metadata[0].name
|
||||
repository = "https://charts.keel.sh"
|
||||
chart = "keel"
|
||||
version = "1.0.6"
|
||||
|
||||
# Atomic mitigates partial-deploy state. Keel itself is exempt from
|
||||
# auto-update (Kyverno mutate excludes the keel namespace), so it only
|
||||
# rolls when this stack applies — making atomic safe here.
|
||||
atomic = true
|
||||
|
||||
values = [yamlencode({
|
||||
polling = {
|
||||
enabled = true
|
||||
# Default poll cadence for workloads that don't override per-Deployment
|
||||
# via keel.sh/pollSchedule. Decision #8 in the design doc.
|
||||
defaultSchedule = "@every 1h"
|
||||
}
|
||||
helmProvider = {
|
||||
enabled = false # We use annotations, not Helm hooks
|
||||
}
|
||||
notificationLevel = "info"
|
||||
persistence = {
|
||||
enabled = false
|
||||
}
|
||||
# Keel uses each watched Deployment's own imagePullSecrets to query
|
||||
# its registry. Forgejo creds (`registry-credentials`) are auto-synced
|
||||
# to every namespace by Kyverno already, so Keel pods don't need a
|
||||
# separate pull-secret for their own image (ghcr.io is public).
|
||||
rbac = {
|
||||
enabled = true
|
||||
}
|
||||
resources = {
|
||||
requests = { cpu = "50m", memory = "64Mi" }
|
||||
limits = { memory = "256Mi" }
|
||||
}
|
||||
})]
|
||||
}
|
||||
13
stacks/keel/terragrunt.hcl
Normal file
13
stacks/keel/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
dependency "kyverno" {
|
||||
config_path = "../kyverno"
|
||||
skip_outputs = true
|
||||
}
|
||||
82
stacks/kyverno/modules/kyverno/keel-annotations.tf
Normal file
82
stacks/kyverno/modules/kyverno/keel-annotations.tf
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
# =============================================================================
|
||||
# Keel Auto-Update Annotation Injector
|
||||
# =============================================================================
|
||||
# Design: infra/docs/plans/2026-05-16-auto-upgrade-apps-design.md
|
||||
# Plan: infra/docs/plans/2026-05-16-auto-upgrade-apps-plan.md
|
||||
#
|
||||
# Mutate policy that adds keel.sh/* annotations to Deployments,
|
||||
# StatefulSets and DaemonSets in *opted-in* namespaces. Opt-in is via a
|
||||
# label on the namespace:
|
||||
#
|
||||
# labels = { "keel.sh/enrolled" = "true" }
|
||||
#
|
||||
# Phase rollout = label more namespaces. No edit to this file per phase.
|
||||
#
|
||||
# Workloads can individually opt out with the label keel.sh/policy=never
|
||||
# (used by the rollback runbook). The keel namespace itself is always
|
||||
# excluded (design decision #11 — supervisor must not auto-update).
|
||||
|
||||
resource "kubernetes_manifest" "policy_inject_keel_annotations" {
|
||||
manifest = {
|
||||
apiVersion = "kyverno.io/v1"
|
||||
kind = "ClusterPolicy"
|
||||
metadata = {
|
||||
name = "inject-keel-annotations"
|
||||
annotations = {
|
||||
"policies.kyverno.io/title" = "Inject Keel Auto-Update Annotations"
|
||||
"policies.kyverno.io/category" = "Automation"
|
||||
"policies.kyverno.io/severity" = "low"
|
||||
"policies.kyverno.io/description" = "Adds keel.sh/policy: force + trigger: poll annotations to workloads in namespaces labeled keel.sh/enrolled=true. Phase rollout per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md."
|
||||
}
|
||||
}
|
||||
spec = {
|
||||
background = true
|
||||
rules = [{
|
||||
name = "add-keel-annotations"
|
||||
match = {
|
||||
any = [{
|
||||
resources = {
|
||||
kinds = ["Deployment", "StatefulSet", "DaemonSet"]
|
||||
namespaceSelector = {
|
||||
matchLabels = {
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
}]
|
||||
}
|
||||
exclude = {
|
||||
any = [
|
||||
{
|
||||
resources = {
|
||||
namespaces = ["keel"]
|
||||
}
|
||||
},
|
||||
{
|
||||
resources = {
|
||||
selector = {
|
||||
matchLabels = {
|
||||
"keel.sh/policy" = "never"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
mutate = {
|
||||
patchStrategicMerge = {
|
||||
metadata = {
|
||||
annotations = {
|
||||
# `+(...)` only adds if not present; per-workload overrides win.
|
||||
"+(keel.sh/policy)" = "force"
|
||||
"+(keel.sh/trigger)" = "poll"
|
||||
"+(keel.sh/pollSchedule)" = "@every 1h"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
depends_on = [helm_release.kyverno]
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue