infra/stacks/keel/main.tf

# Keel — automated Kubernetes Deployment image updates.
# Design: docs/plans/2026-05-16-auto-upgrade-apps-design.md
# Plan:   docs/plans/2026-05-16-auto-upgrade-apps-plan.md
#
# Operation: Keel polls each watched workload's registry hourly (default
# schedule below; overridable per-workload via keel.sh/pollSchedule).
# Detection of a new digest under the watched tag triggers a Deployment
# update (pod template hash bump → rolling restart). Workloads opt in by
# carrying keel.sh/policy + keel.sh/trigger annotations — those are
# injected cluster-wide by the inject-keel-annotations ClusterPolicy
# (stacks/kyverno/modules/kyverno/keel-annotations.tf) on namespaces
# labeled keel.sh/enrolled=true.

# Slack bot token for posting upgrade notifications. Existing token in
# Vault — same one used elsewhere — see secret/viktor -> slack_bot_token.
data "vault_kv_secret_v2" "viktor" {
  mount = "secret"
  name  = "viktor"
}

resource "kubernetes_namespace" "keel" {
  metadata {
    name = "keel"
    labels = {
      tier = local.tiers.cluster
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1
    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
  }
}

resource "helm_release" "keel" {
  name       = "keel"
  namespace  = kubernetes_namespace.keel.metadata[0].name
  repository = "https://charts.keel.sh"
  chart      = "keel"
  # Latest stable per `helm search repo keel/keel -l` 2026-05-16
  # (app version 0.21.1). 1.0.6 doesn't exist — verify before bumping.
  version    = "1.2.0"

  # Atomic mitigates partial-deploy state. Keel itself is exempt from
  # auto-update (Kyverno mutate excludes the keel namespace), so it only
  # rolls when this stack applies — making atomic safe here.
  atomic = true

  values = [yamlencode({
    # 2026-05-26 17:30: re-enabled after switching the Kyverno-injected
    # default from `force + match-tag=true` (proven unreliable — see
    # stacks/kyverno/modules/kyverno/keel-annotations.tf) to `patch` which
    # is semver-parser-bounded. Under `patch`:
    #   - Semver-tagged workloads get patch bumps only (1.2.3 → 1.2.4).
    #   - Float / SHA / non-semver tags are IGNORED — no tag rewriting.
    # The 2026-05-26 emergency-stop scope (replicaCount=0) is reverted now
    # that the default is safe. Workloads pinned out-of-band (uptime-kuma
    # via keel.sh/policy=never LABEL) stay opted-out via the Kyverno
    # exclude rule, not via Keel's own annotation.
    replicaCount = 1
    # Prometheus pod-annotation scrape — picks up Keel-specific metrics
    # (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})
    # on container port 9300 /metrics. The cluster's `kubernetes-pods`
    # Prometheus job keys on these annotations. Used by
    # infra/scripts/upgrade_state.sh (the /upgrade-state skill).
    podAnnotations = {
      "prometheus.io/scrape" = "true"
      "prometheus.io/port"   = "9300"
      "prometheus.io/path"   = "/metrics"
    }
    polling = {
      enabled = true
      # Default poll cadence for workloads that don't override per-Deployment
      # via keel.sh/pollSchedule. Decision #8 in the design doc.
      defaultSchedule = "@every 1h"
    }
    helmProvider = {
      enabled = false # We use annotations, not Helm hooks
    }
    notificationLevel = "info"
    persistence = {
      enabled = false
    }
    # Slack notifications: post every rollout to the configured channel.
    # Bot token from Vault (secret/viktor -> slack_bot_token). The Keel
    # chart sets SLACK_BOT_TOKEN, SLACK_CHANNELS, etc. on the deployment
    # from these values.
    slack = {
      enabled  = true
      botToken = data.vault_kv_secret_v2.viktor.data["slack_bot_token"]
      channel  = "general"
      # No approval flow — opt-out-pure means everything auto-rolls.
      # If we ever introduce gated rollouts, set approvalsChannel here.
    }
    # Keel uses each watched Deployment's own imagePullSecrets to query
    # its registry. Forgejo creds (`registry-credentials`) are auto-synced
    # to every namespace by Kyverno already, so Keel pods don't need a
    # separate pull-secret for their own image (ghcr.io is public).
    rbac = {
      enabled = true
    }
    resources = {
      requests = { cpu = "50m", memory = "64Mi" }
      limits   = { memory = "256Mi" }
    }
  })]
}
Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 12:19:34 +00:00			`# Keel — automated Kubernetes Deployment image updates.`
			`# Design: docs/plans/2026-05-16-auto-upgrade-apps-design.md`
			`# Plan: docs/plans/2026-05-16-auto-upgrade-apps-plan.md`
			`#`
			`# Operation: Keel polls each watched workload's registry hourly (default`
			`# schedule below; overridable per-workload via keel.sh/pollSchedule).`
			`# Detection of a new digest under the watched tag triggers a Deployment`
			`# update (pod template hash bump → rolling restart). Workloads opt in by`
			`# carrying keel.sh/policy + keel.sh/trigger annotations — those are`
			`# injected cluster-wide by the inject-keel-annotations ClusterPolicy`
			`# (stacks/kyverno/modules/kyverno/keel-annotations.tf) on namespaces`
			`# labeled keel.sh/enrolled=true.`

keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 13:01:35 +00:00			`# Slack bot token for posting upgrade notifications. Existing token in`
			`# Vault — same one used elsewhere — see secret/viktor -> slack_bot_token.`
			`data "vault_kv_secret_v2" "viktor" {`
			`mount = "secret"`
			`name = "viktor"`
			`}`

Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 12:19:34 +00:00			`resource "kubernetes_namespace" "keel" {`
			`metadata {`
			`name = "keel"`
			`labels = {`
			`tier = local.tiers.cluster`
			`}`
			`}`
			`lifecycle {`
			`# KYVERNO_LIFECYCLE_V1`
			`ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]`
			`}`
			`}`

			`resource "helm_release" "keel" {`
			`name = "keel"`
			`namespace = kubernetes_namespace.keel.metadata[0].name`
			`repository = "https://charts.keel.sh"`
			`chart = "keel"`
keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable). 2026-05-16 12:30:19 +00:00			# Latest stable per `helm search repo keel/keel -l` 2026-05-16
			`# (app version 0.21.1). 1.0.6 doesn't exist — verify before bumping.`
			`version = "1.2.0"`
Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 12:19:34 +00:00
			`# Atomic mitigates partial-deploy state. Keel itself is exempt from`
			`# auto-update (Kyverno mutate excludes the keel namespace), so it only`
			`# rolls when this stack applies — making atomic safe here.`
			`atomic = true`

			`values = [yamlencode({`
keel: re-enable with policy=patch (semver-bounded) + fix CI deny-privileged Re-enables Keel after the 2026-05-26 emergency stop, with a safer default. Switch Kyverno-injected default from `force + match-tag=true` (proven unreliable — it rewrote tag strings cluster-wide despite the design intent) to `patch`, which is semver-parser-bounded: - Only patch bumps within current major.minor (1.2.3 → 1.2.4, never 1.3.x or 2.x — the parser does the math, not string compare). - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are IGNORED entirely. No tag rewriting under any code path. - 151 stale `force` annotations migrated to `patch` cluster-wide during this apply (anchor `+()` dropped, then re-added). Live state after this commit: 0 workloads on `force`, 209 on `patch`, 22 on `never`. Keel deployment back to 1/1 on `:0.21.1`. Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation mutated to `patch` during the migration despite Kyverno's matchLabels-based exclude rule — appears to be a quirk of `mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno reconciles preserve them. Also fixes CI build-cli workflow which was blocked by `deny-privileged-containers` since wave 1 enforce flip on 2026-05-18: woodpecker namespace added to the shared security_policy_exclude_namespaces list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use). The `default` workflow (terragrunt apply) was already passing — only the parallel `build-cli` workflow (which builds the infra-cli docker image) was failing, but it took the overall pipeline status down with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-26 19:06:51 +00:00			`# 2026-05-26 17:30: re-enabled after switching the Kyverno-injected`
			# default from `force + match-tag=true` (proven unreliable — see
			# stacks/kyverno/modules/kyverno/keel-annotations.tf) to `patch` which
			# is semver-parser-bounded. Under `patch`:
			`# - Semver-tagged workloads get patch bumps only (1.2.3 → 1.2.4).`
			`# - Float / SHA / non-semver tags are IGNORED — no tag rewriting.`
			`# The 2026-05-26 emergency-stop scope (replicaCount=0) is reverted now`
			`# that the default is safe. Workloads pinned out-of-band (uptime-kuma`
			`# via keel.sh/policy=never LABEL) stay opted-out via the Kyverno`
			`# exclude rule, not via Keel's own annotation.`
			`replicaCount = 1`
upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-18 10:50:43 +00:00			`# Prometheus pod-annotation scrape — picks up Keel-specific metrics`
			`# (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})`
			# on container port 9300 /metrics. The cluster's `kubernetes-pods`
			`# Prometheus job keys on these annotations. Used by`
			`# infra/scripts/upgrade_state.sh (the /upgrade-state skill).`
			`podAnnotations = {`
			`"prometheus.io/scrape" = "true"`
			`"prometheus.io/port" = "9300"`
			`"prometheus.io/path" = "/metrics"`
			`}`
Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 12:19:34 +00:00			`polling = {`
			`enabled = true`
			`# Default poll cadence for workloads that don't override per-Deployment`
			`# via keel.sh/pollSchedule. Decision #8 in the design doc.`
			`defaultSchedule = "@every 1h"`
			`}`
			`helmProvider = {`
			`enabled = false # We use annotations, not Helm hooks`
			`}`
			`notificationLevel = "info"`
			`persistence = {`
			`enabled = false`
			`}`
keel: enable Slack notifications on every upgrade Wire Keel's Slack notifier to the existing bot token in Vault (secret/viktor -> slack_bot_token). Posts to #general by default; override via slack.channel in the Helm values if you want a dedicated channel like #keel-notifications. Notification level is "info" so we get every rollout event, not just errors. Approval flow is OFF — opt-out-pure means all updates apply unattended. If we later introduce approvals, add slack.approvalsChannel. Resolves user request: 'keel should send notifications to slack everytime it upgrades an app'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 13:01:35 +00:00			`# Slack notifications: post every rollout to the configured channel.`
			`# Bot token from Vault (secret/viktor -> slack_bot_token). The Keel`
			`# chart sets SLACK_BOT_TOKEN, SLACK_CHANNELS, etc. on the deployment`
			`# from these values.`
			`slack = {`
			`enabled = true`
			`botToken = data.vault_kv_secret_v2.viktor.data["slack_bot_token"]`
			`channel = "general"`
			`# No approval flow — opt-out-pure means everything auto-rolls.`
			`# If we ever introduce gated rollouts, set approvalsChannel here.`
			`}`
Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-16 12:19:34 +00:00			`# Keel uses each watched Deployment's own imagePullSecrets to query`
			# its registry. Forgejo creds (`registry-credentials`) are auto-synced
			`# to every namespace by Kyverno already, so Keel pods don't need a`
			`# separate pull-secret for their own image (ghcr.io is public).`
			`rbac = {`
			`enabled = true`
			`}`
			`resources = {`
			`requests = { cpu = "50m", memory = "64Mi" }`
			`limits = { memory = "256Mi" }`
			`}`
			`})]`
			`}`