Re-enables Keel after the 2026-05-26 emergency stop, with a safer default.
Switch Kyverno-injected default from `force + match-tag=true` (proven
unreliable — it rewrote tag strings cluster-wide despite the design intent)
to `patch`, which is semver-parser-bounded:
- Only patch bumps within current major.minor (1.2.3 → 1.2.4, never
1.3.x or 2.x — the parser does the math, not string compare).
- Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are
IGNORED entirely. No tag rewriting under any code path.
- 151 stale `force` annotations migrated to `patch` cluster-wide
during this apply (anchor `+()` dropped, then re-added).
Live state after this commit:
0 workloads on `force`, 209 on `patch`, 22 on `never`.
Keel deployment back to 1/1 on `:0.21.1`.
Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation
mutated to `patch` during the migration despite Kyverno's
matchLabels-based exclude rule — appears to be a quirk of
`mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched
all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then
restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno
reconciles preserve them.
Also fixes CI build-cli workflow which was blocked by
`deny-privileged-containers` since wave 1 enforce flip on 2026-05-18:
woodpecker namespace added to the shared security_policy_exclude_namespaces
list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use).
The `default` workflow (terragrunt apply) was already passing — only the
parallel `build-cli` workflow (which builds the infra-cli docker image) was
failing, but it took the overall pipeline status down with it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).
Changes:
* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
0/0. Keep off until either match-tag is root-caused or every enrolled
workload migrates to a content-addressed (SHA) pin.
* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
deployment label (matches Kyverno's exclude rule so the inject-keel-
annotations ClusterPolicy stops mutating) AND the annotation (so Keel
itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
so TF owns it as `never` and can't drift back to `force`.
* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
on both seed-config and workbench containers (was :latest, Keel rolled
to :0.1.0).
* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
by Keel from the prior live :3.2.1).
* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
100% and blocked every new pod create with FailedCreate. Raising the cap
unblocked the four affected DaemonSets in one shot.
* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
face-detection burst behaviour.
* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
(matches the 21 other stacks that already declare it).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.
- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
container port 9300. Surfaces pending_approvals,
poll_trigger_tracked_images, and registries_scanned_total{image}
so the skill has a real timeseries (also opens the door to a
future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
(Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
SSH fan-out (parallel subshells) to all five nodes for apt
state + reboot-required + uu log; Prometheus query for Keel;
Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
~/.claude/skills/upgrade-state/SKILL.md so the skill is
discoverable from both monorepo-rooted and global sessions.
Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire Keel's Slack notifier to the existing bot token in Vault
(secret/viktor -> slack_bot_token). Posts to #general by default;
override via slack.channel in the Helm values if you want a dedicated
channel like #keel-notifications.
Notification level is "info" so we get every rollout event, not just
errors. Approval flow is OFF — opt-out-pure means all updates apply
unattended. If we later introduce approvals, add slack.approvalsChannel.
Resolves user request: 'keel should send notifications to slack everytime
it upgrades an app'.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5,
1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed
silently. Bump to 1.2.0 (app version 0.21.1, latest stable).
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>