nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor

Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:

- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
  `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
  Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
  CrashLoop): chart_values renders the live tag via a plural
  kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
  fresh install/DR), so a re-render never downgrades below live.

Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.

Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.

Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
This commit is contained in:
Viktor Barzin 2026-06-01 18:43:51 +00:00
parent 50d0f1affa
commit fb1e47a20a
4 changed files with 133 additions and 56 deletions

View file

@ -1,13 +1,16 @@
# Pin the image to 32.0.9 (apache). On 2026-05-26 Keel bumped the live
# Deployment 32.0.3 → 32.0.9-apache and the DATA migrated to 32.0.9.2; Keel
# was then disabled but chart_values was never pinned, so it kept defaulting
# to the chart's appVersion (32.0.3). A 2026-06-01 `terragrunt apply`
# reconciled that drift, rolled a 32.0.3 pod, and Nextcloud refused to
# downgrade (data 32.0.9.2 > image 32.0.3.2) → CrashLoopBackOff. Pinning here
# keeps TF the source of truth and matches the on-disk data version.
# image.tag is rendered dynamically (templatefile var `image_tag`) from the
# CURRENT live Deployment tag, falling back to var.nextcloud_image_tag_floor
# (32.0.9) on fresh install / DR — see stacks/nextcloud/main.tf
# `data.kubernetes_resource.nextcloud_live` + locals. This makes helm upgrades
# image-no-ops in steady state and means a re-render can NEVER downgrade below
# the Keel-bumped live tag (the 2026-06-01 CrashLoop: a pinned 32.0.3 lost to
# live 32.0.9 and Nextcloud refused the downgrade). Keel (keel.sh/policy=minor)
# bumps the live tag upward within major 32; the next apply just follows it.
# flavor=apache renders the bare apache-default tag (live image is
# `nextcloud:<tag>`, no -apache suffix).
image:
flavor: apache
tag: "32.0.9"
tag: "${image_tag}"
nextcloud:
host: nextcloud.viktorbarzin.me