nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor

Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:

- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
  `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
  Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
  CrashLoop): chart_values renders the live tag via a plural
  kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
  fresh install/DR), so a re-render never downgrades below live.

Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.

Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.

Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
This commit is contained in:
Viktor Barzin 2026-06-01 18:43:51 +00:00
parent 50d0f1affa
commit fb1e47a20a
4 changed files with 133 additions and 56 deletions

View file

@ -101,7 +101,7 @@ This is added per workload as we phase in. Mechanical, grep-able.
| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
| 3 | Exporters, sidecars, utilities | Stateless |
| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk. **Nextcloud enrolled 2026-06-01** with two safeguards for the migration risk: F1 — `nextcloud-watchdog` CronJob runs `occ upgrade` when occ reports `needsDbUpgrade=true` (recovers an interrupted entrypoint upgrade); F2 — `chart_values.yaml` renders the live (Keel-bumped) image tag with a floor, so a helm re-render never downgrades below live. Scope is `patch` (Kyverno-stamped) == `minor` for Nextcloud (32.0.x only). See `stacks/nextcloud/main.tf`. |
| 6 | Authentik | Auth outage |
| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |

View file

@ -1,13 +1,16 @@
# Pin the image to 32.0.9 (apache). On 2026-05-26 Keel bumped the live
# Deployment 32.0.3 → 32.0.9-apache and the DATA migrated to 32.0.9.2; Keel
# was then disabled but chart_values was never pinned, so it kept defaulting
# to the chart's appVersion (32.0.3). A 2026-06-01 `terragrunt apply`
# reconciled that drift, rolled a 32.0.3 pod, and Nextcloud refused to
# downgrade (data 32.0.9.2 > image 32.0.3.2) → CrashLoopBackOff. Pinning here
# keeps TF the source of truth and matches the on-disk data version.
# image.tag is rendered dynamically (templatefile var `image_tag`) from the
# CURRENT live Deployment tag, falling back to var.nextcloud_image_tag_floor
# (32.0.9) on fresh install / DR — see stacks/nextcloud/main.tf
# `data.kubernetes_resource.nextcloud_live` + locals. This makes helm upgrades
# image-no-ops in steady state and means a re-render can NEVER downgrade below
# the Keel-bumped live tag (the 2026-06-01 CrashLoop: a pinned 32.0.3 lost to
# live 32.0.9 and Nextcloud refused the downgrade). Keel (keel.sh/policy=minor)
# bumps the live tag upward within major 32; the next apply just follows it.
# flavor=apache renders the bare apache-default tag (live image is
# `nextcloud:<tag>`, no -apache suffix).
image:
flavor: apache
tag: "32.0.9"
tag: "${image_tag}"
nextcloud:
host: nextcloud.viktorbarzin.me

View file

@ -116,6 +116,16 @@ resource "kubernetes_role_binding" "nextcloud_external_storage_bootstrap" {
# Bootstrap Job
resource "kubernetes_job_v1" "nextcloud_external_storage_bootstrap" {
# The bootstrap script (below) waits up to 10m for the NC pod to be Ready.
# kubernetes_job_v1's default create timeout is only 1m, which spuriously
# fails the apply whenever the NC pod takes >1m to come up e.g. now that
# Keel auto-upgrades nextcloud, a bump mid-apply runs `occ upgrade` in the
# entrypoint and delays readiness past 1m (observed 2026-06-01). Match the
# script's 10m wait plus margin.
timeouts {
create = "12m"
}
metadata {
name = "nextcloud-external-storage-bootstrap"
namespace = kubernetes_namespace.nextcloud.metadata[0].name

View file

@ -6,13 +6,60 @@ variable "nfs_server" { type = string }
variable "redis_host" { type = string }
variable "mysql_host" { type = string }
# FLOOR only Keel bumps the LIVE image tag upward (minor policy); the
# data source below renders the current live tag so a helm apply never
# downgrades below what Keel installed. This floor only wins on a fresh
# install / DR (no live Deployment) or after deliberately restoring an
# OLDER DB snapshot (bump this to match see comment on the data source).
variable "nextcloud_image_tag_floor" {
type = string
default = "32.0.9"
}
data "vault_kv_secret_v2" "secrets" {
mount = "secret"
name = "nextcloud"
}
# Render the CURRENT live image tag so helm upgrades are image-no-ops and
# can NEVER downgrade below the Keel-bumped live tag (failure mode F2: the
# 2026-06-01 CrashLoop where a pinned 32.0.3 re-render lost to live 32.0.9).
# Helm-managed workloads can't use the raw-Deployment KEEL_IGNORE_IMAGE
# `lifecycle.ignore_changes` trick (immich/freshrss main.tf), so we feed the
# live tag back into the chart instead.
#
# Use the PLURAL `kubernetes_resources` (field-selected to name=nextcloud), NOT
# the singular `kubernetes_resource`: in kubernetes provider 3.1.0 the singular
# data source ERRORS ("Provider produced null object") when the target is
# absent, and try() can't rescue it (the failure is at the provider read, not
# the expression). The plural returns an empty `objects` list on no match, so
# objects[0] + try() cleanly falls back to var.nextcloud_image_tag_floor on
# fresh install / DR. (Verified empirically against provider 3.1.0.)
#
# namespace is the LITERAL "nextcloud", NOT
# kubernetes_namespace.nextcloud.metadata[0].name, on purpose: referencing the
# namespace resource makes Terraform defer this data read to apply time
# whenever the namespace has a pending change (e.g. the keel.sh/enrolled label
# add) "(depends on a resource ... with changes pending)" which leaves the
# tag unknown at plan, turning every helm plan into an unverifiable
# (known after apply) values churn. A static namespace decouples the read so it
# resolves at plan time.
data "kubernetes_resources" "nextcloud_live" {
api_version = "apps/v1"
kind = "Deployment"
namespace = "nextcloud"
field_selector = "metadata.name=nextcloud"
}
locals {
homepage_credentials = jsondecode(data.vault_kv_secret_v2.secrets.data["homepage_credentials"])
_live_image = try(data.kubernetes_resources.nextcloud_live.objects[0].spec.template.spec.containers[0].image, "")
# Last colon-segment is the tag (handles registry:port/repo:tag); strip the
# optional `-apache` flavor suffix so it round-trips through the chart's
# `image.flavor=apache` (which renders the bare apache-default tag).
_live_tag = try(replace(element(split(":", local._live_image), length(split(":", local._live_image)) - 1), "-apache", ""), "")
nextcloud_image_tag = local._live_tag != "" ? local._live_tag : var.nextcloud_image_tag_floor
}
@ -30,14 +77,26 @@ resource "kubernetes_namespace" "nextcloud" {
tier = local.tiers.edge
"resource-governance/custom-limitrange" = "true"
"resource-governance/custom-quota" = "true"
# Keel disabled for nextcloud: the 2026-05-26 Keel-driven bump
# 32.0.3-apache 32.0.9-apache left the pod in maintenance mode
# (needsDbUpgrade=true) for ~22h because Keel doesn't run
# `occ upgrade` after rolling the image. Defense-in-depth:
# (a) namespace not enrolled here, (b) workload carries the
# `keel.sh/policy=never` label + annotation below so even if the
# ns label gets re-added, Kyverno excludes this Deployment.
# "keel.sh/enrolled" = "true"
# Keel re-enabled 2026-06-01 (was disabled after the 2026-05-26 bump
# 32.0.332.0.9 stuck the pod in maintenance mode for ~22h). Two
# safeguards make auto-upgrade safe, engineered around BOTH failure modes:
# F1 interrupted `occ upgrade` (entrypoint copies version.php before
# occ upgrade finishes, so a probe-restart mid-upgrade leaves the
# DB half-migrated 503): the nextcloud-watchdog CronJob below
# self-heals by running `occ upgrade` when occ reports
# needsDbUpgrade=true.
# F2 helm re-renders a tag BELOW the Keel-bumped live image
# Nextcloud refuses the downgrade CrashLoop (the 2026-06-01
# incident): chart_values renders the live tag with a floor, so a
# re-render is never below live.
# Scope: the shared Kyverno `inject-keel-annotations` policy stamps
# keel.sh/policy=patch (+ trigger=poll + pollSchedule) on enrolled
# workloads. For Nextcloud patch == minor in practice it only ships
# 32.0.x maintenance releases (never 32.1.x), and major 33 needs `major`
# policy and stays manual (the entrypoint's +1-major limit enforces that
# anyway). We deliberately do NOT override the policy per-workload see
# the note where the old override resources used to live, below.
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -46,40 +105,24 @@ resource "kubernetes_namespace" "nextcloud" {
}
}
# Workload-level Keel opt-out (see namespace comment above).
# Keel reads the ANNOTATION `keel.sh/policy` (it's what un-tracks the
# image watcher); the LABEL exists for the Kyverno exclude rule in
# `inject-keel-annotations` (defense-in-depth in case the namespace
# label gets re-added later). Both are set via these helper resources
# because the nextcloud chart 8.8.1 doesn't expose Deployment-level
# commonLabels / commonAnnotations.
resource "kubernetes_labels" "nextcloud_keel_optout" {
api_version = "apps/v1"
kind = "Deployment"
metadata {
name = "nextcloud"
namespace = kubernetes_namespace.nextcloud.metadata[0].name
}
labels = {
"keel.sh/policy" = "never"
}
force = true
depends_on = [helm_release.nextcloud]
}
resource "kubernetes_annotations" "nextcloud_keel_optout" {
api_version = "apps/v1"
kind = "Deployment"
metadata {
name = "nextcloud"
namespace = kubernetes_namespace.nextcloud.metadata[0].name
}
annotations = {
"keel.sh/policy" = "never"
}
force = true
depends_on = [helm_release.nextcloud]
}
# No per-workload Keel override resources here, on purpose. Nextcloud is
# enrolled via the namespace label above; the shared Kyverno
# `inject-keel-annotations` policy then stamps keel.sh/policy=patch +
# trigger=poll + pollSchedule, and Keel auto-upgrades within 32.0.x.
#
# This stack used to carry kubernetes_labels + kubernetes_annotations
# resources forcing keel.sh/policy=minor (and before that =never, for the
# opt-out). Both were removed 2026-06-01 after re-enabling Keel because each
# produced perpetual drift:
# - Kyverno's background-controller overwrites a TF-set policy back to
# `patch` despite the policy's `+(keel.sh/policy)` add-if-missing anchor
# (observed live: the annotation's field manager was background-controller
# with value patch right after a Keel-bump admission).
# - The helm release strips the deployment's keel.sh/policy LABEL on every
# roll, so TF re-added it on every apply.
# patch == minor for Nextcloud (32.0.x only; major 33 needs `major` and stays
# manual), so letting Kyverno own the keel annotations exactly like every
# other enrolled workload (immich, freshrss) is both correct and drift-free.
resource "kubernetes_manifest" "external_secret" {
manifest = {
@ -191,7 +234,7 @@ resource "helm_release" "nextcloud" {
atomic = true
version = "8.8.1"
values = [templatefile("${path.module}/chart_values.yaml", { tls_secret_name = var.tls_secret_name, mysql_host = var.mysql_host })]
values = [templatefile("${path.module}/chart_values.yaml", { tls_secret_name = var.tls_secret_name, mysql_host = var.mysql_host, image_tag = local.nextcloud_image_tag })]
timeout = 6000
depends_on = [kubernetes_manifest.db_external_secret]
}
@ -457,9 +500,12 @@ resource "kubernetes_config_map" "backup-script" {
}
}
# Watchdog: auto-restart Nextcloud when Apache workers go runaway
# Checks every 5 minutes if Apache has >40 active workers (normal is 5-15).
# If runaway detected, restarts the deployment to recover node CPU.
# Watchdog: runs every 5 minutes with two jobs:
# 1. Apache runaway recovery if >40 workers (normal 5-15), rollout-restart
# to recover node CPU.
# 2. F1 Keel self-heal if occ reports needsDbUpgrade=true (an interrupted
# `occ upgrade` after a Keel image bump left the app in maintenance mode),
# re-run `occ upgrade` and clear maintenance mode.
resource "kubernetes_service_account" "nextcloud_watchdog" {
metadata {
name = "nextcloud-watchdog"
@ -521,7 +567,9 @@ resource "kubernetes_cron_job_v1" "nextcloud_watchdog" {
job_template {
metadata {}
spec {
active_deadline_seconds = 120
# 600s (was 120s) so the F1 self-heal `occ upgrade` isn't killed
# mid-migration. concurrency_policy=Forbid prevents overlap.
active_deadline_seconds = 600
template {
metadata {}
spec {
@ -554,6 +602,22 @@ resource "kubernetes_cron_job_v1" "nextcloud_watchdog" {
else
echo "Apache workers within normal range ($WORKERS <= 40)"
fi
# F1 self-heal: a Keel image bump runs `occ upgrade` in the
# entrypoint, but if that's interrupted (e.g. a probe restart
# mid-upgrade) occ reports needsDbUpgrade=true and the app sits
# in maintenance mode (503). Re-run the upgrade and clear
# maintenance mode. Gated on needsDbUpgrade only, so a
# deliberate manual maintenance window is left untouched.
ST=$(kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ status --output=json 2>/dev/null || true)
if echo "$ST" | grep -q '"needsDbUpgrade":true'; then
echo "$(date): needsDbUpgrade=true → running occ upgrade"
kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ upgrade --no-interaction || true
kubectl exec -n nextcloud "$POD" -c nextcloud -- php occ maintenance:mode --off || true
echo "$(date): self-heal occ upgrade complete"
else
echo "$(date): occ status healthy (no DB upgrade pending)"
fi
EOF
]
}