diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md new file mode 100644 index 00000000..9a42d3ed --- /dev/null +++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md @@ -0,0 +1,90 @@ +# Post-mortem: kubeadm-config OIDC drift crash-looped the v1.35 apiserver upgrade (2026-06-24) + +**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached +the master control-plane phase for the first time — preflight passed, etcd +snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the +kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute +static-pod-hash window across all internal retries, then auto-rolled-back to +v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but +the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`** +(which correctly blocks subsequent runs). No data loss; no user-facing outage +(the master carries control-plane taints, so no workloads were displaced). + +**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35). +Patch upgrades never hit this because the apiserver manifest content is identical +across patches; a minor upgrade is the first time kubeadm regenerates the +manifest with a new image. + +## Root cause + +apiserver authentication was configured in **two** places that were allowed to +drift from a **third**: + +1. `/etc/kubernetes/pki/auth-config.yaml` — a structured `AuthenticationConfiguration` + (apiserver.config.k8s.io/v1) carrying **two** JWT issuers (`kubernetes` for + kubectl/kubelogin + `k8s-dashboard` for the dashboard's oauth2-proxy), added + 2026-06-19 (`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`). +2. the **live** kube-apiserver static-pod manifest — referenced it via + `--authentication-config=/etc/kubernetes/pki/auth-config.yaml`. +3. the **kubeadm-config `ClusterConfiguration` ConfigMap** — still carried the + **legacy single-issuer `--oidc-*` extraArgs** (`oidc-issuer-url`, + `oidc-client-id`, `oidc-username-claim`, `oidc-groups-claim`). Never updated + when (1)+(2) switched to structured auth. + +`kubeadm upgrade apply` **regenerates the static-pod manifests from +kubeadm-config**. So it dropped `--authentication-config` and re-added the four +`--oidc-*` flags. Proven by `kubeadm upgrade diff v1.35.6`: + +```diff +- - --authentication-config=/etc/kubernetes/pki/auth-config.yaml ++ - --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/ ++ - --oidc-client-id=kubernetes ++ - --oidc-username-claim=email ++ - --oidc-groups-claim=groups +``` + +The regenerated apiserver crash-looped (`CrashLoopBackOff`, `back-off 10s`, 8 +probe failures in the kubelet journal) — it exited within seconds, repeatedly, so +kubeadm's hash-watch never saw a stable new pod and timed out → rollback. (The +`--oidc-*` flags are NOT removed in 1.35; the crash is the auth-config swap in the +live control-plane environment, the only functional delta in the diff. Image +pull, etcd, OOM, and disk were all ruled out: all v1.35.6 images were pre-pulled, +etcd upgraded cleanly, no OOM, master root disk at 73%.) + +**Why the existing safety net missed it:** `stacks/rbac/modules/rbac/apiserver-oidc.tf` +already *knew* kubeadm drops `--authentication-config` and published a +`apiserver-oidc-restore` ConfigMap for the chain to re-run **after** the upgrade. +But the apiserver crashes *during* `kubeadm upgrade apply`, which never returns +success, so the post-upgrade restore step is never reached. + +## Resolution + +1. **Reconciled kubeadm-config live** (2026-06-24, zero cluster impact — the CM is + only read during an upgrade): rewrote `apiServer.extraArgs` to drop the + `--oidc-*` args and add `--authentication-config`, via `kubeadm init phase + upload-config kubeadm`. `kubeadm upgrade diff v1.35.6` then showed **only** the + control-plane image bumps — no auth-flag changes. +2. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + + namespace annotation. + +## Prevention (all landed in this change) + +| Gap | Fix | +|-----|-----| +| kubeadm-config not managed alongside the live manifest | `apiserver-oidc.tf`'s remote script now **also** reconciles kubeadm-config (`kubeadm init phase upload-config`). It reaches the cluster two ways: the published `apiserver-oidc-restore` ConfigMap (a plain k8s resource — CI applies it with no ssh) which the chain's `phase_master` re-runs, and a local `-replace` apply with `TF_VAR_ssh_private_key`. (The null_resource trigger deliberately does NOT hash the script: CI has no ssh key, so it must stay a no-op on a plain CI apply.) | +| The chain drained the master into a crash with no pre-check | new **preflight gate 4b** in `upgrade-step.sh`: runs `kubeadm upgrade diff v$TARGET` and `block`s (k8s_upgrade_blocked=1 → K8sUpgradeBlocked alert) BEFORE snapshot/in-flight/drain if a `-` line would drop `--authentication-config`. Fails safe — blocks only on a positive drift signal. | +| The live fix had to be applied out-of-band (only `default` Vault policy on the workstation; CI can't ssh) | kubeadm-config reconciled live via `kubeadm init phase upload-config` on the master (2026-06-24); the committed code makes it durable for future upgrades. | + +## Lessons + +- **Out-of-band control-plane edits must be written back to kubeadm-config.** + Anything that edits a static-pod manifest directly (auth, admission, audit, API + flags) is silently reverted on the next `kubeadm upgrade` unless kubeadm-config + itself carries it. `kubeadm upgrade diff ` is the authoritative + pre-flight check for "what will the upgrade change?" and is non-mutating. +- **A post-upgrade fixup can't repair something that breaks the upgrade itself.** + The restore-after-upgrade design assumed the apiserver would come up (degraded) + and be fixed afterward; it actually crash-looped, so the fix has to be in + kubeadm-config *before* `apply`, plus a preflight gate. +- **Minor upgrades exercise manifest regeneration; patch upgrades don't.** First + minor bump is where this whole class of drift surfaces. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 08d43926..86d1d31e 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -41,6 +41,7 @@ Job 0 — preflight (pinned: k8s-node1) ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) + ├── apiserver-OIDC drift gate: kubeadm upgrade diff must NOT drop --authentication-config (else BLOCK+alert) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) @@ -222,22 +223,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) +### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -and drops the `--authentication-config` flag**, silently disabling apiserver -OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get -401). This used to require a manual re-apply after **every** control-plane bump. +from kubeadm-config**. apiserver auth uses a structured multi-issuer +`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to +still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade +reverted the flag. On the **1.34→1.35** bump that regenerated apiserver +**crash-looped and stalled the whole upgrade mid-flight** (master cordoned, etcd +already bumped); the post-upgrade restore below never ran because `kubeadm +upgrade apply` itself never returned success. Post-mortem: +`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. -**Now automated:** the `rbac` stack publishes its OIDC restore script to the -`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's -`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` -(while tigera-operator is still quiesced, so the flag-add apiserver restart can't -crashloop the operator). It's idempotent, health-gates `/livez` with -auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac -apply (the version upgrade itself already succeeded). So a chain-driven -control-plane bump no longer breaks SSO. The master phase self-skips when master -is already at target, so this only runs when master was actually upgraded. +**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now +**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting +`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of +its remote script. So kubeadm regenerates a **correct** manifest and the apiserver +upgrades with a pure image bump — `kubeadm upgrade diff ` shows only the +image change. Zero live impact (the CM is read only during an upgrade). + +**Backstops:** +- **Preflight gate 4b** runs `kubeadm upgrade diff` and BLOCKs (k8s_upgrade_blocked=1 + → alert) BEFORE draining the master if `--authentication-config` would still be + dropped — so this can never again drain into a crash. +- The `rbac` stack still publishes its restore script to the + `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on + master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with + auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* + re-reconciles kubeadm-config. Self-skips when master is already at target. **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the chain logged `WARN: --authentication-config absent after re-apply`: diff --git a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh index 76fdf157..5bafd241 100644 --- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh +++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh @@ -416,6 +416,25 @@ phase_preflight() { fi fi + # 4b. apiserver-OIDC drift gate (backstop for the rbac stack's kubeadm-config + # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from + # kubeadm-config; if kubeadm-config still carries the legacy single-issuer + # --oidc-* args instead of --authentication-config, the regenerated apiserver + # reverts structured multi-issuer auth and CRASH-LOOPS — stalling the chain + # mid-flight with the master cordoned and etcd already bumped (the 2026-06-24 + # v1.35 stall; docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). + # `kubeadm upgrade diff` shows exactly what the manifest regen will change; a + # '-' line dropping --authentication-config means the drift is still present. + # Skip on an at-target master (resume — no apiserver regen). Best-effort: blocks + # only on a POSITIVE drift signal, never merely because diff is unavailable. + if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then + local apiserver_diff + apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true) + if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then + block "kubeadm upgrade would DROP --authentication-config from kube-apiserver (kubeadm-config OIDC drift → apiserver crash-loop). Re-apply the rbac stack (apiserver-oidc.tf reconciles kubeadm-config), then retry. Master NOT drained." + fi + fi + # 5. Push in-flight + started_timestamp metrics + ns annotations $KUBECTL annotate ns "$NS" \ "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \ diff --git a/stacks/rbac/modules/rbac/apiserver-oidc.tf b/stacks/rbac/modules/rbac/apiserver-oidc.tf index 5165a7d1..a5ad08c8 100644 --- a/stacks/rbac/modules/rbac/apiserver-oidc.tf +++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf @@ -10,16 +10,26 @@ # match the existing RBAC subjects (kind: User, name: ; group names # verbatim). Do NOT add a prefix or existing bindings break. # -# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single -# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this -# is exactly how OIDC silently broke before — the flag was wiped and the -# content-hash trigger never re-fired). After any k8s control-plane upgrade, -# re-apply the rbac stack to restore apiserver OIDC. See -# docs/plans/2026-06-04-k8s-dashboard-sso-design.md. +# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places +# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod +# manifest from kubeadm-config: +# 1. /etc/kubernetes/pki/auth-config.yaml — the structured authn file +# 2. the live kube-apiserver static-pod manifest — references it via the flag +# 3. the kubeadm-config ClusterConfiguration CM — what kubeadm regenerates from +# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the +# manifest from the STALE CM, reverting --authentication-config to single-issuer +# --oidc-* flags. On k8s 1.35 that regenerated apiserver CRASH-LOOPED and stalled +# the whole upgrade mid-flight (master cordoned, etcd already bumped) — see +# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md. The +# remote script below now ALSO reconciles (3) via `kubeadm init phase +# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The +# k8s-version-upgrade chain additionally GATES on `kubeadm upgrade diff` in +# preflight and blocks+alerts if --authentication-config would still be dropped. # # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the # manifest from a timestamped backup if the apiserver does not recover, so a -# malformed config cannot leave the single master down. +# malformed config cannot leave the single master down. Reconciling kubeadm-config +# is zero-impact on the running cluster (the CM is only read during an upgrade). variable "k8s_master_host" { type = string @@ -97,6 +107,40 @@ locals { print('flag-inserted' if done else 'ANCHOR-NOT-FOUND') PY + # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs: + # drops the stale single-issuer --oidc-* args and ensures --authentication-config + # is present (anchored after --authorization-mode). Stdlib-only (the master is + # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other + # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the + # authorization-mode anchor is missing (fail loud, leave the CM untouched). + kubeadm_oidc_reconcile_py = <<-PY + import sys + lines = sys.stdin.read().split('\n') + out, i, n = [], 0, len(lines) + have_authn = any('name: authentication-config' in l for l in lines) + inserted = have_authn + while i < n: + ln = lines[i]; s = ln.strip() + if s.startswith('- name: oidc-'): + i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1 + continue + out.append(ln) + if (not inserted) and s == '- name: authorization-mode': + indent = ln[:len(ln) - len(ln.lstrip())] + if i + 1 < n and lines[i + 1].strip().startswith('value:'): + out.append(lines[i + 1]); i += 2 + else: + i += 1 + out.append(indent + '- name: authentication-config') + out.append(indent + ' value: /etc/kubernetes/pki/auth-config.yaml') + inserted = True + continue + i += 1 + if not inserted: + sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3) + sys.stdout.write('\n'.join(out)) + PY + # Whole remote operation, base64-embedded for byte-exact transfer (no # heredoc/escaping hazards across SSH). apiserver_auth_remote_script = <<-SH @@ -137,6 +181,30 @@ locals { echo "rolled back to previous manifest"; exit 1 fi echo "kube-apiserver healthy with multi-issuer --authentication-config" + + # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the + # apiserver manifest WITH --authentication-config instead of reverting to + # the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the + # manifest from kubeadm-config on every control-plane upgrade and the + # regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall). + # Zero live impact (the CM is only read at upgrade time); idempotent; + # best-effort (the chain's `kubeadm upgrade diff` preflight gate is the + # backstop if this cannot run). + KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf" + CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true) + if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then + echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth" + echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py + if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \ + && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then + echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config" + else + echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade" + fi + rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml + else + echo "kubeadm-config already uses --authentication-config (no oidc drift)" + fi SH } @@ -155,6 +223,14 @@ resource "null_resource" "apiserver_oidc_config" { } triggers = { + # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies + # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of + # this SSH provisioner in CI would fail — hence the null_resource must stay a + # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config + # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap + # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force + # this provisioner to re-run after a script change, apply locally with + # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md). auth_config = sha256(local.apiserver_auth_config_yaml) } }