Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.
Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.
Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
34 KiB
External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc
Status: ✅ COMPLETE (2026-06-22). ESO at chart/app 2.6.0; all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returnsOK: cluster is safe to upgrade to 1.35.6(EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale.terraform.lock.hclfiles (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had brokenterragrunt applyfor ~28 stacks (this is what failed CI pipeline 332) — reconciled viainit -upgrade+ committed. Scope: Upgrade the ESO Helm chart0.12.1(appv0.12.1) to2.6.0(appv2.6.0) and migrate everyexternal-secrets.io/v1beta1custom resource toexternal-secrets.io/v1. Owner: Viktor Barzin. Author: Claude (research + design only — no changes applied).EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"): The cluster is already on k8s 1.34.9 (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative tested ranges, not hard limits). The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3. Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): 0.13.0 → 0.14.4 → 0.15.1 → 0.16.2 [rewrite all 104 CRs to
v1here] → 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0. Pre-flight done: CRDstoredVersionsare["v1beta1"]only (no v1alpha1 patch needed).EXECUTION LOG:
- ✅ Phase 1 DONE (2026-06-21): ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → 0.16.2, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead —
instagram-poster/instagram-poster-secretsFalse since 2026-05-10,payslip-ingest/payslip-ingest-secretsFalse since 2026-04-25, both missing Vault data, untouched). Addedatomic=true+timeout=600to the helm_release. At 0.16.2 bothv1beta1andv1are served (110 each) andstoredVersions = ["v1beta1","v1"]. Committed (eso: Phase 1 …); state auto-committed per hop byscripts/tg.- ⏳ Phase 2 PENDING — findings confirmed (decisive for execution): (a) bumping a
kubernetes_manifestExternalSecret's apiVersion v1beta1→v1 forces a REPLACE (verified live on instagram-poster:-/+ must be replaced), NOT in-place. (b) Our ExternalSecrets usecreationPolicy=Owner(default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can cascade-GC the Secret before ESO recreates it. → Phase 2 must be done carefully, NOT a blind bulk apply: (1) snapshot ALL target Secrets first (backstop); (2) empirically validate on the FIRST live stack — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase-target-then-full apply (the 15 plan-time-coupled stacks need-targetfirst). If validation shows GC wins, pivot tostate rm+import {}(adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied).- Phase 3 PENDING: hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing 0.17 is the point of no return.
1. Goal & why
ESO is the last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade (Kyverno was cleared to 1.18.1 earlier today). The installed ESO 0.12.x supports only Kubernetes 1.19 → 1.31 (support matrix); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The 2.x series supports k8s 1.34–1.35, which clears the gate.
The hard part is not the chart bump itself — it is that ESO removed the external-secrets.io/v1beta1 API, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared v1beta1. If we upgrade past the removal version without first rewriting the manifests to v1, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break).
Downtime tolerance: brief, recoverable downtime of the ESO controller is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes Secret objects that apps mount (DB creds, API keys). Those must survive continuously.
2. Current state
2.1 Versions
| Component | Current | Target |
|---|---|---|
Helm chart external-secrets |
0.12.1 | 2.6.0 |
| App / controller image | v0.12.1 | v2.6.0 |
| API version of all CRs | external-secrets.io/v1beta1 |
external-secrets.io/v1 |
Repo: https://charts.external-secrets.io |
(unchanged) | (unchanged) |
ESO stack: stacks/external-secrets/main.tf. helm_release.external_secrets pins version = "0.12.1", namespace external-secrets (separate kubernetes_namespace resource, not create_namespace), and the only chart value set is installCRDs = true (via yamlencode({ installCRDs = true })). No webhook/replica/resource overrides.
2.2 Inventory (live, from stacks/)
| Kind | Count | apiVersion | Where |
|---|---|---|---|
ExternalSecret (kubernetes_manifest) |
104 | all v1beta1 (0 mismatches) |
73 .tf files |
| ClusterSecretStore (definitions) | 2 | both v1beta1 |
stacks/external-secrets/main.tf |
| SecretStore | 0 | — | — |
| PushSecret | 0 | — | — |
| ClusterExternalSecret | 0 | — | — |
- Only ONE apiVersion string exists in the whole tree:
external-secrets.io/v1beta1(106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zerov1, zerov1alpha1. → a clean single-target rewrite. secretStoreRefsplit: 78 ExternalSecrets →vault-kv, 26 →vault-database(78 + 26 = 104). Thekind = "ClusterSecretStore"string also appears inside everysecretStoreRef, so a naivegrep 'kind = "ClusterSecretStore"'returns 106 — only 2 are real store definitions.- 22 files carry >1 ExternalSecret (max:
stacks/fire-planner/main.tf= 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files. - Nested-module ExternalSecrets (easy to miss when scripting the bump):
stacks/instagram-poster/modules/instagram-poster/main.tf,stacks/postiz/modules/postiz/main.tf,stacks/technitium/modules/technitium/main.tf,stacks/mailserver/modules/mailserver/main.tf,stacks/monitoring/modules/monitoring/grafana.tf,stacks/proxmox-csi/modules/proxmox-csi/main.tf. - Docs are STALE:
.claude/CLAUDE.mdsays "43 ExternalSecrets + 9 DB-creds". Live count is 104 ExternalSecrets / 73 files / 26 db-refs. Fix in the migration PR.
2.3 The two ClusterSecretStores (stacks/external-secrets/main.tf)
Both kubernetes_manifest, both external-secrets.io/v1beta1, both depends_on = [helm_release.external_secrets]:
vault-kv→ Vault KV v2 atpath = "secret", serverhttp://vault-active.vault.svc.cluster.local:8200, authkubernetesmountkubernetes, roleeso, SAexternal-secrets/external-secrets.vault-database→ identical exceptpath = "database",version = "v1"(Vault DB engine, KV-v1-style).
ESO's Vault auth role eso (stacks/vault/main.tf:486-511): policy eso-reader (secret/data/* read+list, deny secret/data/vault, database/static-creds/* read), token_ttl = token_period = 864000 (10d, periodic/auto-renew).
2.4 Tier-0 / state
ESO is Tier-0 (bootstrap) (.claude/CLAUDE.md "Terraform State — Two-Tier Backend"; root terragrunt.hcl tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]). Tier-0 ⇒ local SOPS-encrypted state in git (state/stacks/external-secrets/terraform.tfstate), NOT the PG backend. Workflow: git pull → scripts/tg plan → scripts/tg apply → git push; SOPS decrypt via Vault Transit (primary) → age fallback. Tier-0 must apply before PG is reachable, so the ESO upgrade cannot depend on PG.
2.5 Provider versions (stacks/external-secrets/providers.tf)
required_providersdeclares onlyvault = hashicorp/vault, ~> 4.0.provider "kubernetes"andprovider "helm"are declared without version constraints (resolve from root /.terraform.lock.hcl). Thehelmblock already uses the v3-style nestedkubernetes = {…}argument (not the legacykubernetes {}block) ⇒ helm provider is v3.x or v4.x in the lockfile. Nokubectlprovider in this stack. Norequired_versionpinned here.- ⚠️ Verify the resolved helm provider version in
.terraform.lock.hclbefore starting — the prompt referenced~> 4.0for helm; the stack only pins that forvault. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5).
2.6 Plan-time coupling (the cross-cutting risk)
15 stacks read ESO-created Secrets at plan time via data "kubernetes_secret" (avoids a Vault dependency at plan): actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium).
The documented first-apply gotcha (.claude/CLAUDE.md, docs/architecture/secrets.md:360, stacks/fire-planner/main.tf:574): the Secret must exist before the data "kubernetes_secret" plans, so on first creation you must terragrunt apply -target=kubernetes_manifest.<external_secret> first, then full apply. Why this matters for the migration: the kubernetes_manifest provider treats apiVersion as part of resource identity, so bumping v1beta1→v1 forces a replace of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's data "kubernetes_secret" plans → the two-phase -target apply is needed fleet-wide for the v1 rewrite step, not just fire-planner.
2.7 Vault DB rotation (rotation interplay)
stacks/vault/main.tf: 25 vault_database_secret_backend_static_role, every one rotation_period = 604800 (7 days) (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via vault-database → remoteRef.key = "static-creds/<role>". Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. Implication: any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly.
2.8 git-crypt landmine (adjacent, not in ESO stack)
.claude/CLAUDE.md:146 + docs/architecture/ci-cd.md:108 + stacks/kyverno/modules/kyverno/tls-secret-sync.tf: on a git-crypt-locked clone, kubernetes_secret.tls_secret reads secrets/fullchain.pem/privkey.pem via file() which returns ciphertext, corrupting the wildcard TLS secret Kyverno clones cluster-wide. The ESO stack itself has NO file() reads of git-crypt secrets — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an unlocked checkout.
3. Target
- Helm chart
external-secrets2.6.0 (app v2.6.0), repohttps://charts.external-secrets.io. - All ExternalSecret + ClusterSecretStore CRs on
external-secrets.io/v1. - Cluster ESO compatible with k8s 1.34–1.35 ⇒ unblocks the autonomous 1.35 upgrade.
4. Key findings (the decisive facts)
Sourced from ESO official docs + GitHub release notes; verbatim quotes below.
4.1 Chart version == app version (premise check)
The chart version and app version are released in lockstep and are the same number. Chart.yaml: version: 0.12.1 / appVersion: v0.12.1; version: 2.6.0 / appVersion: v2.6.0. The app series ran …0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0. Crucially, the v1.0.0 and v2.0.0 APP releases are NOT the external-secrets.io/v1 API — v1.0.0 is just "continuation after 0.20.4" (release diff v0.20.4...v1.0.0, no API change), and v2.0.0's only breaking change is removing the unmaintained Alibaba + Device42 providers (we use neither — only Vault). The API migration happened back at 0.16/0.17. Source: v1.0.0 notes · v2.0.0 notes.
4.2 Version path: NO skipping minors — step one minor at a time
Official policy, verbatim (stability-support):
"Upgrade version by version — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions."
Maintainer (issue #4785, @gusfcarvalho): "We are pre release… Every minor bump should be treated as a major bump until we go 1.0." ⇒ You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly. You must step each minor: 0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x.
4.3 k8s ↔ ESO must advance roughly in lockstep
Each ESO release targets a narrow k8s band (support matrix):
| ESO | k8s band |
|---|---|
| 0.12.x | 1.19 → 1.31 |
| 0.16.x | 1.32 |
| 0.17.x | 1.33 |
| 2.0 – 2.5 | 1.34 – 1.35 |
| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.34–1.35 — see Open Questions) |
This is the single most important sequencing constraint. ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a much newer k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be interleaved, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a coordinated ESO+k8s climb, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.)
4.4 API migration: must rewrite manifests to v1 FIRST — there is NO v1beta1→v1 conversion webhook
external-secrets.io/v1promoted to STORAGE version: v0.16.0. v0.16.0 release notes "BREAKING CHANGES": "Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts" and "Removal of Conversion Webhooks and …/v1alpha1…". From 0.16, etcd storesv1. Source: v0.16.0 notes.external-secrets.io/v1beta1STOPS BEING SERVED (hard cutoff): v0.17.0. Verbatim (v0.17.0 notes):"v0.17.0 Stops serving
v1beta1apis. You need to update your manifests fromv1beta1tov1prior to updating fromv0.16tov0.17. The only change needed is upgrading your manifests tov1(i.e. removing thebeta1fromv1beta1). … Be sure to do that to all your manifests prior to bumping tov0.17.0!v0.16.2already supportsv1so this process should be smooth."- No v1beta1→v1 conversion webhook. The only conversion webhook that ever existed was v1alpha1→v1beta1, removed in 0.16. Maintainer (issue #5478, @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — "This isn't really a conversion issue." ⇒ old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.
- Verdict: MUST-REWRITE-FIRST. Rewrite all CRs to
v1while on 0.16.x (which serves both v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue #4785, @Dutchy-): "I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17." - There is a deprecated escape hatch in chart 2.6.0 —
unsafeServeV1Beta1: truere-enables v1beta1 serving for stragglers — but its own values comment says "This flag will be removed on 2026.05.01" (i.e. already past, do not rely on it).
- Verdict: MUST-REWRITE-FIRST. Rewrite all CRs to
- Schema change is a PURE apiVersion string bump — ZERO field changes. CRD
openAPIV3Schemadiff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have byte-identical spec field sets between v1beta1 and v1 ({data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}for ExternalSecret). Maintainer (issue #4785, @Skarlso): "Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do." PushSecret only ever hadv1alpha1(no v1beta1) — unaffected (we have 0 anyway).
4.5 Helm chart values + CRD handling (0.12 → 2.6)
- No top-level values removed or renamed.
values.yamldiff 0.12.1↔2.6.0 is additive only (new keys:enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault). Our single valueinstallCRDs = truesurvives. installCRDsstill works in 2.6.0 (defaultstrue, "install and upgrade CRDs through helm chart"). CRDs are templated into the singleexternal-secretschart and upgraded byhelm upgradeautomatically — there is no separate CRDs subchart, and no manualkubectl applyof CRDs is required by default. (Out-of-band bundle, if ever needed, lives atdeploy/crds/bundle.yamlper release tag.) The only CRD-value change:crds.conversion.enableddefaultstruein 0.12.1 (for the old v1alpha1 webhook) →falsein 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine.- CRD storedVersions bookkeeping (the one real pre-flight check): v0.16.0 notes warn to ensure no CRD still lists
v1alpha1in.status.storedVersionsbefore/at 0.16, with akubectl patchto set it to["v1","v1beta1"]if needed. This is CRD metadata hygiene, NOT secret deletion. - Helm provider:
Chart.yaml apiVersion: v2(Helm 3 chart) in both 0.12.1 and 2.6.0; no minimum Helm version declared (onlykubeVersion: ">= 1.19.0-0"). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. The 2.x chart does NOT require a newer helm provider than 0.12 did — the v3-style helm block inproviders.tfalready satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.)
4.6 Data migration: downstream Secrets survive
The synced Kubernetes Secret objects are not deleted or force-resynced by these upgrades. The change is an apiVersion bump on the custom resources, whose spec is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal reconcile (re-assert, not delete). Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-generators change — we use no generators, so N/A.
5. Migration strategy (ordered, do-this-then-that)
Pre-reqs every step: run from an unlocked infra checkout (git-crypt unlocked);
vault login -method=oidc; ESO is Tier-0 so usescripts/tg plan/scripts/tg applyagainststacks/external-secretsandgit pushafter each apply (SOPS state). Claim presence before each apply:~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N". Wait for the controllerDeploymentto roll out healthy before the next hop.
Phase 0 — Pre-flight (no changes)
- Confirm cluster k8s version and the version-check chain's current target; coordinate with the k8s climb (see §4.3 / Open Questions). Decide who drives the interleave.
kubectl get crd | grep external-secrets.ioand for each:kubectl get crd <name> -o jsonpath='{.status.storedVersions}'— confirm none still listv1alpha1. If any do, plan thekubectl patch …/status storedVersions=["v1beta1"]per the v0.16.0 note (do this before reaching 0.16).- Snapshot all ESO-managed Secrets (rollback safety net):
kubectl get externalsecrets -A(record the 104) andfor ns/secret in <targets>: kubectl get secret -n <ns> <name> -o yaml > backup/<ns>-<name>.yaml. Keep outside git-crypt or encrypt. - Inspect
.terraform.lock.hclinstacks/external-secrets— record resolvedhelm+kubernetesprovider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first. - Read
docs/architecture/secrets.md+ the fire-planner first-apply comment to re-confirm the-targetpattern for the v1 rewrite step.
Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet)
ESO 0.16.x is the transition version that serves both v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as v1beta1:
6. For v in 0.13.0, 0.14.0, 0.15.x, 0.16.2 (use latest patch of each minor): set helm_release.external_secrets.version = "<v>", scripts/tg plan (expect: chart upgrade + CRD upgrade in place; no kubernetes_manifest replacements — apiVersion unchanged), scripts/tg apply, git push, wait for rollout, verify kubectl get externalsecrets -A all SecretSynced=True.
- Interleave k8s as required: before/at 0.16 the cluster should be on k8s 1.32 (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point.
- Watch the 0.14.0 notes (generators) — N/A for us, but eyeball the plan diff anyway.
- Land on 0.16.2 and STOP. Verify both APIs are served:
kubectl get externalsecrets.v1.external-secrets.io -Aandkubectl get externalsecrets.v1beta1.external-secrets.io -Aboth work.
Phase 2 — Rewrite all 104 CRs + 2 stores to v1 (while on 0.16.2)
This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served.
8. Mechanical rewrite across stacks/: replace the apiVersion string external-secrets.io/v1beta1 → external-secrets.io/v1 in every ExternalSecret and ClusterSecretStore kubernetes_manifest (104 + 2 = 106 occurrences across 73 files, including the 6 nested-module files in §2.2). No other field changes (schema identical). Do this in a worktree, committed file-by-file.
- Leave
secretStoreRef.kind = "ClusterSecretStore"(that's a kind reference, not an apiVersion — unaffected).
- Two-phase apply because
kubernetes_manifestreplace + plan-timedata "kubernetes_secret": a. Stores first:scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'instacks/external-secrets(they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing).git push. b. ExternalSecrets, per stack: for each of the 73 stacks,scripts/tg apply -target=kubernetes_manifest.<external_secret_name>FIRST (materializes the replaced v1 CR + its Secret), THEN a fullscripts/tg applyfor that stack (lets the 15 plan-timedata "kubernetes_secret"reads resolve against the now-existing Secret). The 15 plan-time-coupled stacks (§2.6) absolutely need the-targetfirst; the rest are lower-risk but follow the same pattern for safety.git pushper stack (Tier-1 stacks use PG state; ESO stack is Tier-0).- Because the spec is identical, the replace re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout.
- Verify the rewrite fully landed:
grep -rc 'external-secrets.io/v1beta1' stacks/returns 0;kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1; allSecretSynced=True; spot-check a rotated DB cred (e.g.nextcloud-db-creds) still valid.
Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0
Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd):
11. Bump chart 0.16.2 → 0.17.x. scripts/tg plan (expect chart/CRD upgrade; no manifest replacements — already v1), apply, push, rollout, verify all synced. k8s should be 1.33 (0.17 band) around here.
12. Continue one minor at a time: 0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0. At each: bump version, plan, apply, push, rollout, verify synced. k8s reaches 1.34 then 1.35 across the 2.x steps.
- At 2.0.0: confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op).
13. Land on 2.6.0. Verify: controller image v2.6.0, all 104 ExternalSecrets SecretSynced=True, both ClusterSecretStores Valid=True.
Phase 4 — Close the gate + docs
- Advance k8s to 1.35 via the version-check chain if not already; confirm the compat-gate now lists ESO as compatible and 1.35 is unblocked.
- Update
.claude/CLAUDE.mdSecrets Management section: correct counts (104 ExternalSecrets / 73 files / 26 db-refs), apiVersion nowv1. Updatedocs/architecture/secrets.md. Commit as part of the work (audit trail).
6. Risks & mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| Secret-sync outage → app DB/API auth failures during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces per stack (small blast radius); the 15 plan-time stacks use -target first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. |
| Crossing 0.17 with any CR still v1beta1 → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: grep -rc v1beta1 stacks/ must be 0 AND kubectl get …v1beta1… returns nothing live before Phase 3. Do not skip 0.16. |
| CRD removal/replace by helm dropping data | Low | Chart manages CRDs in-place via installCRDs=true (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD upgrade. Snapshot anyway. Never helm uninstall (that can GC CRDs). |
| No conversion webhook safety net (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated unsafeServeV1Beta1 is already past its 2026-05-01 removal — do NOT rely on it. |
kubernetes_manifest forces replace on apiVersion bump → transient gap + plan-time read failures |
High | Two-phase -target apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. |
| Vault 7-day DB rotation lands mid-migration → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. |
| git-crypt tls-secret-sync landmine | Low (not in ESO stack) | ESO stack has no file() git-crypt reads; run from an unlocked checkout; do not piggyback kyverno applies during this work. |
| helm/k8s provider in lockfile too old for 2.x chart | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). |
| k8s/ESO band mismatch (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. |
| Many small applies = long, error-prone session | Med | Script the per-stack -target-then-full loop; checkpoint with kubectl get externalsecrets -A after each; the rewrite itself is a single sed-class change so low semantic risk. |
7. Rollback plan (per hop)
- During Phase 1 (chart climb, still v1beta1): revert
versionto the previous minor instacks/external-secrets/main.tf,scripts/tg apply,git push. Helm rolls the controller back; CRs unchanged. Clean. - During Phase 2 (v1 rewrite, on 0.16.2): 0.16.2 serves both APIs, so you can
git revertthe apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the last point of easy rollback. - After Phase 3 (≥0.17, v1beta1 no longer served): rollback is HARD — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back (general guidance + maintainer position). Treat crossing 0.17 as the point of no return. If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight.
- Always available: the Phase-0.3 Secret backups let you
kubectl applythe last-good Secret to keep an app authenticating while you fix ESO.
8. Verification
Per hop:
kubectl -n external-secrets get deploy,pohealthy; controller image tag == target.kubectl get externalsecrets -A→ all 104STATUS=SecretSynced/READY=True.kubectl get clustersecretstores→vault-kv+vault-databaseValid=True.
After Phase 2 (v1 rewrite):
grep -rc 'external-secrets.io/v1beta1' stacks/→ 0.kubectl get externalsecrets.v1beta1.external-secrets.io -A→ still served on 0.16 (sanity), butkubectl get externalsecrets.v1.external-secrets.io -Ais the real check.- Spot-check a rotated DB cred end-to-end: e.g.
nextcloud-db-credsvalue matchesvault read database/static-creds/mysql-nextcloudand the app authenticates.
Final (2.6.0):
- Controller image
v2.6.0; all ExternalSecrets synced; both stores valid. - Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof).
- App health: spot-check 3–4 high-value consumers (nextcloud, immich, grafana, a
vault-databaseconsumer) — pods running, no auth errors in logs. - Compat-gate: run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds.
9. Open questions
- k8s/ESO interleave ownership. §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. Who drives the interleave — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.)
- 2.6.0 ↔ k8s 1.35 explicit support. The support matrix table currently ends at 2.5 (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a strong inference not a quoted row. Confirm via
Chart.yamlkubeVersionof 2.6.0 or a 2.6 release note before relying on it. (matrix) - Resolved helm provider version. The stack only pins
vault ~> 4.0; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.) - Intermediate-minor patch selection. Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.2 specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch.
- Per-stack apply automation. 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first.
- Stateful generators / advanced features. Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3.
10. Sources (decisive facts)
- Skip-version policy + k8s support matrix: https://external-secrets.io/latest/introduction/stability-support/
v1promoted to storage version (0.16.0): https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0v1beta1removed / "rewrite manifests to v1 first" (0.17.0): https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0- No conversion webhook / "not a conversion issue" (#5478): https://github.com/external-secrets/external-secrets/issues/5478
- v1beta1↔v1 schema identical / "nothing fancy" (#4785): https://github.com/external-secrets/external-secrets/issues/4785
- App v1.0.0 ≠ API v1: https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0
- v2.0.0 only removes Alibaba/Device42: https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0
- Chart 2.6.0 on ArtifactHub: https://artifacthub.io/packages/helm/external-secrets-operator/external-secrets