nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor

Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9 bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around both failure modes: - F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true; Job deadline bumped 120->600s so it isn't killed mid-migration. - F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade CrashLoop): chart_values renders the live tag via a plural kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on fresh install/DR), so a re-render never downgrades below live. Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and its background-controller overrides a TF-set value, and patch == minor for Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the per-workload keel.sh/policy override resources to avoid perpetual drift; ns enrollment + Kyverno now own the keel annotations like other workloads. Also bumps the external-storage bootstrap Job create timeout 1m->12m to match its own 10m pod-wait, since Keel bumps now roll the pod mid-apply. Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
2026-06-01 18:43:51 +00:00 · 2026-06-01 18:43:51 +00:00 · fb1e47a20a
commit fb1e47a20a
parent 50d0f1affa
4 changed files with 133 additions and 56 deletions
--- a/docs/plans/2026-05-16-auto-upgrade-apps-design.md
+++ b/docs/plans/2026-05-16-auto-upgrade-apps-design.md
@ -101,7 +101,7 @@ This is added per workload as we phase in. Mechanical, grep-able.
 | 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
 | 3 | Exporters, sidecars, utilities | Stateless |
 | 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
-| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
+| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk. **Nextcloud enrolled 2026-06-01** with two safeguards for the migration risk: F1 — `nextcloud-watchdog` CronJob runs `occ upgrade` when occ reports `needsDbUpgrade=true` (recovers an interrupted entrypoint upgrade); F2 — `chart_values.yaml` renders the live (Keel-bumped) image tag with a floor, so a helm re-render never downgrades below live. Scope is `patch` (Kyverno-stamped) == `minor` for Nextcloud (32.0.x only). See `stacks/nextcloud/main.tf`. |
 | 6 | Authentik | Auth outage |
 | 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
 | 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
--- a/stacks/nextcloud/chart_values.yaml
+++ b/stacks/nextcloud/chart_values.yaml
@ -1,13 +1,16 @@
-# Pin the image to 32.0.9 (apache). On 2026-05-26 Keel bumped the live
-# Deployment 32.0.3 → 32.0.9-apache and the DATA migrated to 32.0.9.2; Keel
-# was then disabled but chart_values was never pinned, so it kept defaulting
-# to the chart's appVersion (32.0.3). A 2026-06-01 `terragrunt apply`
-# reconciled that drift, rolled a 32.0.3 pod, and Nextcloud refused to
-# downgrade (data 32.0.9.2 > image 32.0.3.2) → CrashLoopBackOff. Pinning here
-# keeps TF the source of truth and matches the on-disk data version.
+# image.tag is rendered dynamically (templatefile var `image_tag`) from the
+# CURRENT live Deployment tag, falling back to var.nextcloud_image_tag_floor
+# (32.0.9) on fresh install / DR — see stacks/nextcloud/main.tf
+# `data.kubernetes_resource.nextcloud_live` + locals. This makes helm upgrades
+# image-no-ops in steady state and means a re-render can NEVER downgrade below
+# the Keel-bumped live tag (the 2026-06-01 CrashLoop: a pinned 32.0.3 lost to
+# live 32.0.9 and Nextcloud refused the downgrade). Keel (keel.sh/policy=minor)
+# bumps the live tag upward within major 32; the next apply just follows it.
+# flavor=apache renders the bare apache-default tag (live image is
+# `nextcloud:<tag>`, no -apache suffix).
 image:
  flavor: apache
-  tag: "32.0.9"
+  tag: "${image_tag}"

 nextcloud:
  host: nextcloud.viktorbarzin.me
--- a/stacks/nextcloud/external_storage.tf
+++ b/stacks/nextcloud/external_storage.tf
@ -116,6 +116,16 @@ resource "kubernetes_role_binding" "nextcloud_external_storage_bootstrap" {
 # ── Bootstrap Job ────────────────────────────────────────────────────────────

 resource "kubernetes_job_v1" "nextcloud_external_storage_bootstrap" {
+  # The bootstrap script (below) waits up to 10m for the NC pod to be Ready.
+  # kubernetes_job_v1's default create timeout is only 1m, which spuriously
+  # fails the apply whenever the NC pod takes >1m to come up — e.g. now that
+  # Keel auto-upgrades nextcloud, a bump mid-apply runs `occ upgrade` in the
+  # entrypoint and delays readiness past 1m (observed 2026-06-01). Match the
+  # script's 10m wait plus margin.
+  timeouts {
+    create = "12m"
+  }
+
  metadata {
    name      = "nextcloud-external-storage-bootstrap"
    namespace = kubernetes_namespace.nextcloud.metadata[0].name
--- a/stacks/nextcloud/main.tf
+++ b/stacks/nextcloud/main.tf
@ -6,13 +6,60 @@ variable "nfs_server" { type = string }
 variable "redis_host" { type = string }
 variable "mysql_host" { type = string }

+# FLOOR only — Keel bumps the LIVE image tag upward (minor policy); the
+# data source below renders the current live tag so a helm apply never
+# downgrades below what Keel installed. This floor only wins on a fresh
+# install / DR (no live Deployment) or after deliberately restoring an
+# OLDER DB snapshot (bump this to match — see comment on the data source).
+variable "nextcloud_image_tag_floor" {
+  type    = string
+  default = "32.0.9"
+}
+
 data "vault_kv_secret_v2" "secrets" {
  mount = "secret"
  name  = "nextcloud"
 }

+# Render the CURRENT live image tag so helm upgrades are image-no-ops and
+# can NEVER downgrade below the Keel-bumped live tag (failure mode F2: the
+# 2026-06-01 CrashLoop where a pinned 32.0.3 re-render lost to live 32.0.9).
+# Helm-managed workloads can't use the raw-Deployment KEEL_IGNORE_IMAGE
+# `lifecycle.ignore_changes` trick (immich/freshrss main.tf), so we feed the
+# live tag back into the chart instead.
+#
+# Use the PLURAL `kubernetes_resources` (field-selected to name=nextcloud), NOT
+# the singular `kubernetes_resource`: in kubernetes provider 3.1.0 the singular
+# data source ERRORS ("Provider produced null object") when the target is
+# absent, and try() can't rescue it (the failure is at the provider read, not
+# the expression). The plural returns an empty `objects` list on no match, so
+# objects[0] + try() cleanly falls back to var.nextcloud_image_tag_floor on
+# fresh install / DR. (Verified empirically against provider 3.1.0.)
+#
+# namespace is the LITERAL "nextcloud", NOT
+# kubernetes_namespace.nextcloud.metadata[0].name, on purpose: referencing the
+# namespace resource makes Terraform defer this data read to apply time
+# whenever the namespace has a pending change (e.g. the keel.sh/enrolled label
+# add) — "(depends on a resource ... with changes pending)" — which leaves the
+# tag unknown at plan, turning every helm plan into an unverifiable
+# (known after apply) values churn. A static namespace decouples the read so it
+# resolves at plan time.
+data "kubernetes_resources" "nextcloud_live" {
+  api_version    = "apps/v1"
+  kind           = "Deployment"
+  namespace      = "nextcloud"
+  field_selector = "metadata.name=nextcloud"
+}
+
 locals {
  homepage_credentials = jsondecode(data.vault_kv_secret_v2.secrets.data["homepage_credentials"])
+
+  _live_image = try(data.kubernetes_resources.nextcloud_live.objects[0].spec.template.spec.containers[0].image, "")
+  # Last colon-segment is the tag (handles registry:port/repo:tag); strip the
+  # optional `-apache` flavor suffix so it round-trips through the chart's
+  # `image.flavor=apache` (which renders the bare apache-default tag).
+  _live_tag           = try(replace(element(split(":", local._live_image), length(split(":", local._live_image)) - 1), "-apache", ""), "")
+  nextcloud_image_tag = local._live_tag != "" ? local._live_tag : var.nextcloud_image_tag_floor
 }


@ -30,14 +77,26 @@ resource "kubernetes_namespace" "nextcloud" {
      tier                                    = local.tiers.edge
      "resource-governance/custom-limitrange" = "true"
      "resource-governance/custom-quota"      = "true"
-      # Keel disabled for nextcloud: the 2026-05-26 Keel-driven bump
-      # 32.0.3-apache → 32.0.9-apache left the pod in maintenance mode
-      # (needsDbUpgrade=true) for ~22h because Keel doesn't run
-      # `occ upgrade` after rolling the image. Defense-in-depth:
-      # (a) namespace not enrolled here, (b) workload carries the
-      # `keel.sh/policy=never` label + annotation below so even if the
-      # ns label gets re-added, Kyverno excludes this Deployment.
-      # "keel.sh/enrolled"                    = "true"
+      # Keel re-enabled 2026-06-01 (was disabled after the 2026-05-26 bump
+      # 32.0.3→32.0.9 stuck the pod in maintenance mode for ~22h). Two
+      # safeguards make auto-upgrade safe, engineered around BOTH failure modes:
+      #   F1 — interrupted `occ upgrade` (entrypoint copies version.php before
+      #        occ upgrade finishes, so a probe-restart mid-upgrade leaves the
+      #        DB half-migrated → 503): the nextcloud-watchdog CronJob below
+      #        self-heals by running `occ upgrade` when occ reports
+      #        needsDbUpgrade=true.
+      #   F2 — helm re-renders a tag BELOW the Keel-bumped live image →
+      #        Nextcloud refuses the downgrade → CrashLoop (the 2026-06-01
+      #        incident): chart_values renders the live tag with a floor, so a
+      #        re-render is never below live.
+      # Scope: the shared Kyverno `inject-keel-annotations` policy stamps
+      # keel.sh/policy=patch (+ trigger=poll + pollSchedule) on enrolled
+      # workloads. For Nextcloud patch == minor in practice — it only ships
+      # 32.0.x maintenance releases (never 32.1.x), and major 33 needs `major`
+      # policy and stays manual (the entrypoint's +1-major limit enforces that
+      # anyway). We deliberately do NOT override the policy per-workload — see
+      # the note where the old override resources used to live, below.
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -46,40 +105,24 @@ resource "kubernetes_namespace" "nextcloud" {
  }
 }

-# Workload-level Keel opt-out (see namespace comment above).
-# Keel reads the ANNOTATION `keel.sh/policy` (it's what un-tracks the
-# image watcher); the LABEL exists for the Kyverno exclude rule in
-# `inject-keel-annotations` (defense-in-depth in case the namespace
-# label gets re-added later). Both are set via these helper resources
-# because the nextcloud chart 8.8.1 doesn't expose Deployment-level
-# commonLabels / commonAnnotations.
-resource "kubernetes_labels" "nextcloud_keel_optout" {
-  api_version = "apps/v1"
-  kind        = "Deployment"
-  metadata {
-    name      = "nextcloud"
-    namespace = kubernetes_namespace.nextcloud.metadata[0].name
-  }
-  labels = {
-    "keel.sh/policy" = "never"
-  }
-  force      = true
-  depends_on = [helm_release.nextcloud]
-}
-
-resource "kubernetes_annotations" "nextcloud_keel_optout" {
-  api_version = "apps/v1"
-  kind        = "Deployment"
-  metadata {
-    name      = "nextcloud"
-    namespace = kubernetes_namespace.nextcloud.metadata[0].name
-  }
-  annotations = {
-    "keel.sh/policy" = "never"
-  }
-  force      = true
-  depends_on = [helm_release.nextcloud]
-}
+# No per-workload Keel override resources here, on purpose. Nextcloud is
+# enrolled via the namespace label above; the shared Kyverno
+# `inject-keel-annotations` policy then stamps keel.sh/policy=patch +
+# trigger=poll + pollSchedule, and Keel auto-upgrades within 32.0.x.
+#
+# This stack used to carry kubernetes_labels + kubernetes_annotations
+# resources forcing keel.sh/policy=minor (and before that =never, for the
+# opt-out). Both were removed 2026-06-01 after re-enabling Keel because each
+# produced perpetual drift:
+#   - Kyverno's background-controller overwrites a TF-set policy back to
+#     `patch` despite the policy's `+(keel.sh/policy)` add-if-missing anchor
+#     (observed live: the annotation's field manager was background-controller
+#     with value patch right after a Keel-bump admission).
+#   - The helm release strips the deployment's keel.sh/policy LABEL on every
+#     roll, so TF re-added it on every apply.
+# patch == minor for Nextcloud (32.0.x only; major 33 needs `major` and stays
+# manual), so letting Kyverno own the keel annotations — exactly like every
+# other enrolled workload (immich, freshrss) — is both correct and drift-free.

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
@ -191,7 +234,7 @@ resource "helm_release" "nextcloud" {
  atomic     = true
  version    = "8.8.1"

-  values     = [templatefile("${path.module}/chart_values.yaml", { tls_secret_name = var.tls_secret_name, mysql_host = var.mysql_host })]
+  values     = [templatefile("${path.module}/chart_values.yaml", { tls_secret_name = var.tls_secret_name, mysql_host = var.mysql_host, image_tag = local.nextcloud_image_tag })]
  timeout    = 6000
  depends_on = [kubernetes_manifest.db_external_secret]
 }
@ -457,9 +500,12 @@ resource "kubernetes_config_map" "backup-script" {
  }
 }

-# Watchdog: auto-restart Nextcloud when Apache workers go runaway
-# Checks every 5 minutes if Apache has >40 active workers (normal is 5-15).
-# If runaway detected, restarts the deployment to recover node CPU.
+# Watchdog: runs every 5 minutes with two jobs:
+#  1. Apache runaway recovery — if >40 workers (normal 5-15), rollout-restart
+#     to recover node CPU.
+#  2. F1 Keel self-heal — if occ reports needsDbUpgrade=true (an interrupted
+#     `occ upgrade` after a Keel image bump left the app in maintenance mode),
+#     re-run `occ upgrade` and clear maintenance mode.
 resource "kubernetes_service_account" "nextcloud_watchdog" {
  metadata {
    name      = "nextcloud-watchdog"
@ -521,7 +567,9 @@ resource "kubernetes_cron_job_v1" "nextcloud_watchdog" {
    job_template {
      metadata {}
      spec {
-        active_deadline_seconds = 120
+        # 600s (was 120s) so the F1 self-heal `occ upgrade` isn't killed
+        # mid-migration. concurrency_policy=Forbid prevents overlap.
+        active_deadline_seconds = 600
        template {
          metadata {}
          spec {
@ -554,6 +602,22 @@ resource "kubernetes_cron_job_v1" "nextcloud_watchdog" {
                else
                  echo "Apache workers within normal range ($WORKERS <= 40)"
                fi
+
+                # F1 self-heal: a Keel image bump runs `occ upgrade` in the
+                # entrypoint, but if that's interrupted (e.g. a probe restart
+                # mid-upgrade) occ reports needsDbUpgrade=true and the app sits
+                # in maintenance mode (503). Re-run the upgrade and clear
+                # maintenance mode. Gated on needsDbUpgrade only, so a
+                # deliberate manual maintenance window is left untouched.
+                ST=$(kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ status --output=json 2>/dev/null || true)
+                if echo "$ST" | grep -q '"needsDbUpgrade":true'; then
+                  echo "$(date): needsDbUpgrade=true → running occ upgrade"
+                  kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ upgrade --no-interaction || true
+                  kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ maintenance:mode --off || true
+                  echo "$(date): self-heal occ upgrade complete"
+                else
+                  echo "$(date): occ status healthy (no DB upgrade pending)"
+                fi
              EOF
              ]
            }