k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
c6bba1da6e
commit
60a1cb9a25
4 changed files with 218 additions and 20 deletions
|
|
@ -416,6 +416,25 @@ phase_preflight() {
|
|||
fi
|
||||
fi
|
||||
|
||||
# 4b. apiserver-OIDC drift gate (backstop for the rbac stack's kubeadm-config
|
||||
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
|
||||
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
|
||||
# --oidc-* args instead of --authentication-config, the regenerated apiserver
|
||||
# reverts structured multi-issuer auth and CRASH-LOOPS — stalling the chain
|
||||
# mid-flight with the master cordoned and etcd already bumped (the 2026-06-24
|
||||
# v1.35 stall; docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md).
|
||||
# `kubeadm upgrade diff` shows exactly what the manifest regen will change; a
|
||||
# '-' line dropping --authentication-config means the drift is still present.
|
||||
# Skip on an at-target master (resume — no apiserver regen). Best-effort: blocks
|
||||
# only on a POSITIVE drift signal, never merely because diff is unavailable.
|
||||
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||
local apiserver_diff
|
||||
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
|
||||
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
|
||||
block "kubeadm upgrade would DROP --authentication-config from kube-apiserver (kubeadm-config OIDC drift → apiserver crash-loop). Re-apply the rbac stack (apiserver-oidc.tf reconciles kubeadm-config), then retry. Master NOT drained."
|
||||
fi
|
||||
fi
|
||||
|
||||
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
||||
$KUBECTL annotate ns "$NS" \
|
||||
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue