Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).
Changes:
* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
0/0. Keep off until either match-tag is root-caused or every enrolled
workload migrates to a content-addressed (SHA) pin.
* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
deployment label (matches Kyverno's exclude rule so the inject-keel-
annotations ClusterPolicy stops mutating) AND the annotation (so Keel
itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
so TF owns it as `never` and can't drift back to `force`.
* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
on both seed-config and workbench containers (was :latest, Keel rolled
to :0.1.0).
* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
by Keel from the prior live :3.2.1).
* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
100% and blocked every new pod create with FailedCreate. Raising the cap
unblocked the four affected DaemonSets in one shot.
* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
face-detection burst behaviour.
* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
(matches the 21 other stacks that already declare it).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
105 lines
4.2 KiB
HCL
105 lines
4.2 KiB
HCL
# Keel — automated Kubernetes Deployment image updates.
|
|
# Design: docs/plans/2026-05-16-auto-upgrade-apps-design.md
|
|
# Plan: docs/plans/2026-05-16-auto-upgrade-apps-plan.md
|
|
#
|
|
# Operation: Keel polls each watched workload's registry hourly (default
|
|
# schedule below; overridable per-workload via keel.sh/pollSchedule).
|
|
# Detection of a new digest under the watched tag triggers a Deployment
|
|
# update (pod template hash bump → rolling restart). Workloads opt in by
|
|
# carrying keel.sh/policy + keel.sh/trigger annotations — those are
|
|
# injected cluster-wide by the inject-keel-annotations ClusterPolicy
|
|
# (stacks/kyverno/modules/kyverno/keel-annotations.tf) on namespaces
|
|
# labeled keel.sh/enrolled=true.
|
|
|
|
# Slack bot token for posting upgrade notifications. Existing token in
|
|
# Vault — same one used elsewhere — see secret/viktor -> slack_bot_token.
|
|
data "vault_kv_secret_v2" "viktor" {
|
|
mount = "secret"
|
|
name = "viktor"
|
|
}
|
|
|
|
resource "kubernetes_namespace" "keel" {
|
|
metadata {
|
|
name = "keel"
|
|
labels = {
|
|
tier = local.tiers.cluster
|
|
}
|
|
}
|
|
lifecycle {
|
|
# KYVERNO_LIFECYCLE_V1
|
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
|
}
|
|
}
|
|
|
|
resource "helm_release" "keel" {
|
|
name = "keel"
|
|
namespace = kubernetes_namespace.keel.metadata[0].name
|
|
repository = "https://charts.keel.sh"
|
|
chart = "keel"
|
|
# Latest stable per `helm search repo keel/keel -l` 2026-05-16
|
|
# (app version 0.21.1). 1.0.6 doesn't exist — verify before bumping.
|
|
version = "1.2.0"
|
|
|
|
# Atomic mitigates partial-deploy state. Keel itself is exempt from
|
|
# auto-update (Kyverno mutate excludes the keel namespace), so it only
|
|
# rolls when this stack applies — making atomic safe here.
|
|
atomic = true
|
|
|
|
values = [yamlencode({
|
|
# EMERGENCY STOP — scaled to 0 on 2026-05-26 16:42 UTC. Keel was actively
|
|
# rewriting tag strings (not just digests) despite the
|
|
# `keel.sh/match-tag=true` annotation injected by Kyverno that's supposed
|
|
# to constrain it to digest-only watches. Known casualties this round:
|
|
# uptime-kuma (2 → 1, 4h CrashLoopBackOff), n8n (1.80.5 → 0.1.2, silent
|
|
# degradation), beads-server/dolt-workbench (0.3.73 → 0.1.0), and ~10
|
|
# other deployments with downgrade-flavored change-cause annotations.
|
|
# Re-enable only after root-causing why match-tag isn't being enforced,
|
|
# OR after migrating each app to a content-addressed (SHA) tag pin.
|
|
replicaCount = 0
|
|
# Prometheus pod-annotation scrape — picks up Keel-specific metrics
|
|
# (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})
|
|
# on container port 9300 /metrics. The cluster's `kubernetes-pods`
|
|
# Prometheus job keys on these annotations. Used by
|
|
# infra/scripts/upgrade_state.sh (the /upgrade-state skill).
|
|
podAnnotations = {
|
|
"prometheus.io/scrape" = "true"
|
|
"prometheus.io/port" = "9300"
|
|
"prometheus.io/path" = "/metrics"
|
|
}
|
|
polling = {
|
|
enabled = true
|
|
# Default poll cadence for workloads that don't override per-Deployment
|
|
# via keel.sh/pollSchedule. Decision #8 in the design doc.
|
|
defaultSchedule = "@every 1h"
|
|
}
|
|
helmProvider = {
|
|
enabled = false # We use annotations, not Helm hooks
|
|
}
|
|
notificationLevel = "info"
|
|
persistence = {
|
|
enabled = false
|
|
}
|
|
# Slack notifications: post every rollout to the configured channel.
|
|
# Bot token from Vault (secret/viktor -> slack_bot_token). The Keel
|
|
# chart sets SLACK_BOT_TOKEN, SLACK_CHANNELS, etc. on the deployment
|
|
# from these values.
|
|
slack = {
|
|
enabled = true
|
|
botToken = data.vault_kv_secret_v2.viktor.data["slack_bot_token"]
|
|
channel = "general"
|
|
# No approval flow — opt-out-pure means everything auto-rolls.
|
|
# If we ever introduce gated rollouts, set approvalsChannel here.
|
|
}
|
|
# Keel uses each watched Deployment's own imagePullSecrets to query
|
|
# its registry. Forgejo creds (`registry-credentials`) are auto-synced
|
|
# to every namespace by Kyverno already, so Keel pods don't need a
|
|
# separate pull-secret for their own image (ghcr.io is public).
|
|
rbac = {
|
|
enabled = true
|
|
}
|
|
resources = {
|
|
requests = { cpu = "50m", memory = "64Mi" }
|
|
limits = { memory = "256Mi" }
|
|
}
|
|
})]
|
|
}
|