rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The SSO restore script backed up the live manifest with `cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/. The kubelet treats every file in that dir as a static pod, so the .bak became a SECOND kube-apiserver static pod. While both copies were identical it was harmless, but the instant `kubeadm upgrade` changed the real manifest's image to v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped (pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed out on "static Pod hash did not change after 5m" and rolled back. THIS was the real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a downstream symptom of the flip-flopping apiserver hammering etcd). Fix: write backups to a dedicated dir OUTSIDE the static-pod dir (/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The stray .bak that planted the landmine on 2026-06-18 was moved out manually 2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh, which is the same script) from ever re-creating it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
5d33327c30
commit
196d0db4bd
1 changed files with 11 additions and 2 deletions
|
|
@ -150,6 +150,15 @@ locals {
|
||||||
MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml
|
MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml
|
||||||
AUTHCFG=/etc/kubernetes/pki/auth-config.yaml
|
AUTHCFG=/etc/kubernetes/pki/auth-config.yaml
|
||||||
TS=$(date +%s)
|
TS=$(date +%s)
|
||||||
|
# Manifest backups MUST live OUTSIDE /etc/kubernetes/manifests/ — the kubelet
|
||||||
|
# treats EVERY file in that dir as a static pod, so a kube-apiserver.yaml.bak
|
||||||
|
# there becomes a SECOND apiserver static pod. On a kubeadm upgrade (when the
|
||||||
|
# real manifest's image changes) the two conflict, the kubelet flip-flops, the
|
||||||
|
# new apiserver never stabilises → kubeadm "static Pod hash did not change" →
|
||||||
|
# rollback. This stalled the 1.34->1.35 upgrade for days (root cause found
|
||||||
|
# 2026-06-26; the old `cp "$MANIFEST" "$MANIFEST.bak"` planted it on 2026-06-18).
|
||||||
|
BAKDIR=/etc/kubernetes/apiserver-oidc-bak
|
||||||
|
sudo install -d -m 700 "$BAKDIR"
|
||||||
|
|
||||||
# 1. Write the structured AuthenticationConfiguration (hot-reloaded by the
|
# 1. Write the structured AuthenticationConfiguration (hot-reloaded by the
|
||||||
# apiserver on change; mounted into the pod via the existing pki hostPath).
|
# apiserver on change; mounted into the pod via the existing pki hostPath).
|
||||||
|
|
@ -159,7 +168,7 @@ locals {
|
||||||
# 2. Ensure the apiserver references it. Only touch the manifest (→ restart)
|
# 2. Ensure the apiserver references it. Only touch the manifest (→ restart)
|
||||||
# when the flag is missing; otherwise the file write above hot-reloads.
|
# when the flag is missing; otherwise the file write above hot-reloads.
|
||||||
if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then
|
if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then
|
||||||
sudo cp "$MANIFEST" "$MANIFEST.bak.$TS"
|
sudo cp "$MANIFEST" "$BAKDIR/kube-apiserver.yaml.$TS"
|
||||||
sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST"
|
sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST"
|
||||||
echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST"
|
echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST"
|
||||||
fi
|
fi
|
||||||
|
|
@ -178,7 +187,7 @@ locals {
|
||||||
done
|
done
|
||||||
if [ "$ok" != "1" ]; then
|
if [ "$ok" != "1" ]; then
|
||||||
echo "kube-apiserver UNHEALTHY after change — rolling back"
|
echo "kube-apiserver UNHEALTHY after change — rolling back"
|
||||||
BAK=$(ls -t "$MANIFEST".bak.* 2>/dev/null | head -1)
|
BAK=$(ls -t "$BAKDIR"/kube-apiserver.yaml.* 2>/dev/null | head -1)
|
||||||
if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi
|
if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi
|
||||||
for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done
|
for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done
|
||||||
echo "rolled back to previous manifest"; exit 1
|
echo "rolled back to previous manifest"; exit 1
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue