k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps

kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00 · 2026-06-19 06:04:30 +00:00 · 077ac97df5
commit 077ac97df5
parent 48b63ffa6f
4 changed files with 80 additions and 10 deletions
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -221,6 +221,15 @@ resource "kubernetes_cluster_role" "k8s_upgrade_job" {
    resource_names = [local.namespace]
    verbs          = ["get", "patch", "update"]
  }
+  # Read the apiserver-OIDC restore script (published by the rbac stack to
+  # kube-system) so phase_master can re-apply --authentication-config after a
+  # kubeadm control-plane upgrade drops it. Name-scoped get only.
+  rule {
+    api_groups     = [""]
+    resources      = ["configmaps"]
+    resource_names = ["apiserver-oidc-restore"]
+    verbs          = ["get"]
+  }
 }

 resource "kubernetes_cluster_role_binding" "k8s_upgrade_job" {
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -521,6 +521,33 @@ phase_master() {
  alerts=$(halt_on_alert_query "RecentNodeReboot|IngressTTFBCritical")
  [ -n "$alerts" ] && { slack "ABORT master — alerts firing post-upgrade: $alerts"; exit 1; }

+  # Re-apply apiserver OIDC. `kubeadm upgrade apply` regenerates the apiserver
+  # static-pod manifest and DROPS --authentication-config, silently breaking SSO
+  # (kubectl/kubelogin + the dashboard) until re-applied — historically a manual
+  # `tg apply` of the rbac stack after every control-plane bump. Automate it here
+  # while tigera-operator is STILL quiesced, so the flag-add apiserver restart
+  # cannot crashloop the operator. Single source of truth: the rbac stack
+  # publishes the exact script its own null_resource runs to a kube-system
+  # ConfigMap; it is idempotent and health-gates /livez with auto-rollback, and a
+  # failure here is NON-FATAL (the version upgrade already succeeded — only SSO
+  # would lag until the next rbac apply).
+  local oidc_restore
+  oidc_restore=$($KUBECTL -n kube-system get configmap apiserver-oidc-restore \
+    -o jsonpath='{.data.restore\.sh}' 2>/dev/null || true)
+  if [ -n "$oidc_restore" ]; then
+    slack "Re-applying apiserver OIDC after master upgrade"
+    printf '%s' "$oidc_restore" | ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" 'bash -s' \
+      || slack "WARN: apiserver OIDC re-apply exited non-zero — verify SSO"
+    if ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
+         'sudo grep -q -- "--authentication-config=" /etc/kubernetes/manifests/kube-apiserver.yaml'; then
+      slack "apiserver OIDC restored (--authentication-config present)"
+    else
+      slack "WARN: --authentication-config absent after re-apply — SSO down; run the rbac apiserver_oidc_config apply"
+    fi
+  else
+    slack "WARN: apiserver-oidc-restore ConfigMap missing — skipping OIDC re-apply (apply the rbac stack)"
+  fi
+
  # Restore tigera-operator (happy path) + clear the safety-net EXIT trap.
  echo "Restoring tigera-operator"
  $KUBECTL -n tigera-operator scale deploy tigera-operator --replicas=1 2>&1 || true