cluster-health: emergency-stop Keel + roll back image downgrades + quota raises

Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).

Changes:

* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
  0/0. Keep off until either match-tag is root-caused or every enrolled
  workload migrates to a content-addressed (SHA) pin.

* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
  bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
  deployment label (matches Kyverno's exclude rule so the inject-keel-
  annotations ClusterPolicy stops mutating) AND the annotation (so Keel
  itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
  so TF owns it as `never` and can't drift back to `force`.

* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
  on both seed-config and workbench containers (was :latest, Keel rolled
  to :0.1.0).

* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
  by Keel from the prior live :3.2.1).

* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
  grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
  per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
  DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
  100% and blocked every new pod create with FailedCreate. Raising the cap
  unblocked the four affected DaemonSets in one shot.

* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
  32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
  face-detection burst behaviour.

* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
  updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
  (matches the 21 other stacks that already declare it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-26 18:48:50 +00:00
parent 41fb7c4a76
commit 60b2b1cdfc
12 changed files with 97 additions and 24 deletions

View file

@ -81,9 +81,23 @@ resource "kubernetes_deployment" "uptime-kuma" {
labels = {
app = "uptime-kuma"
tier = var.tier
# Opt out of Kyverno's inject-keel-annotations ClusterPolicy. The Kyverno
# rule excludes any workload with this LABEL (see
# stacks/kyverno/modules/kyverno/keel-annotations.tf, exclude.any
# matchLabels keel.sh/policy=never). Without the label, Kyverno would
# silently re-add `keel.sh/policy=force` after every reconcile, undoing
# the annotation below.
"keel.sh/policy" = "never"
}
annotations = {
"reloader.stakater.com/search" = "true"
# Stop Keel polling for this workload. Even with match-tag=true,
# Keel auto-downgraded :2 :1 on 2026-05-26 12:14, which v1 booted
# into SQLite mode and couldn't read the existing MariaDB store
# (db-config.json) 4h CrashLoopBackOff. Pinning the image string
# alone isn't enough because Keel kept fighting the apply. Combined
# with the matching LABEL above, this fully bypasses Keel.
"keel.sh/policy" = "never"
}
}
spec {
@ -108,7 +122,14 @@ resource "kubernetes_deployment" "uptime-kuma" {
}
spec {
container {
image = "louislam/uptime-kuma:2"
# Pinned to 2.3.2 because Keel auto-downgraded :2 :1 on 2026-05-26
# 12:14 UTC despite the Kyverno-injected `keel.sh/match-tag=true` +
# `keel.sh/policy=force` annotation pair (which is supposed to gate
# digest changes only). The v1 image opens kuma.db (SQLite) at boot
# and can't read the v2 db-config.json 4h CrashLoopBackOff while
# the MariaDB store sat intact. Until the keel-match-tag regression
# is root-caused, pin minor versions explicitly.
image = "louislam/uptime-kuma:2.3.2"
name = "uptime-kuma"
resources {
@ -167,9 +188,12 @@ resource "kubernetes_deployment" "uptime-kuma" {
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
# `keel.sh/policy` is intentionally NOT ignored we want TF to own it
# as `never` so a Kyverno reconcile (or manual kubectl) can't flip it
# back to `force` and re-enable auto-updates.
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"], # injected by Kyverno
]
}
}