upgrades: fix hourly gotenberg error + cap update notifications at weekly

Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/*, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:16:50 +00:00 · 2026-07-02 07:16:50 +00:00 · a64d2ba2b9
commit a64d2ba2b9
parent 5d5d9752cb
5 changed files with 39 additions and 25 deletions
--- a/stacks/monitoring/modules/monitoring/loki.tf
+++ b/stacks/monitoring/modules/monitoring/loki.tf
@ -225,6 +225,27 @@ resource "kubernetes_config_map" "loki_alert_rules" {
            },
          ]
        },
+        {
+          # App auto-upgrades (Keel). Keel's direct Slack notifier was disabled
+          # 2026-07-02 after a stuck update (gotenberg vs require-trusted-
+          # registries) re-posted an identical failure to #general on every
+          # hourly poll for days. This log alert is the replacement failure
+          # signal: alert-on-change routing notifies ONCE and the daily digest
+          # carries it while it persists — never an hourly drip.
+          name = "App auto-upgrades (Keel)"
+          rules = [
+            {
+              alert  = "KeelUpdateFailing"
+              expr   = "sum(count_over_time({namespace=\"keel\"} |= \"level=error\" |= \"got error while updating resource\" [3h])) > 2"
+              for    = "10m"
+              labels = { severity = "warning" }
+              annotations = {
+                summary     = "Keel repeatedly failing to roll out an image update"
+                description = "Keel failed the same resource update >2 times in 3h (its poll is hourly, so this means a persistently stuck rollout, not a blip). kubectl -n keel logs deploy/keel | grep level=error. Common causes: kyverno require-trusted-registries denying the new tag (extend the allowlist in stacks/kyverno/modules/kyverno/security-policies.tf), a ResourceQuota rejecting the surge pod, or a bad imagePullSecret."
+              }
+            },
+          ]
+        },
        {
          # t3 session-auth + auto-upgrade health (devvm host scripts → journald →
          # Loki). Backstops the gated-nightly t3 tracker: the dispatch logs every