infra/stacks/rbac/modules/rbac
Viktor Barzin 196d0db4bd
All checks were successful
ci/woodpecker/push/default Pipeline was successful
rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).

Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:29:19 +00:00
..
apiserver-oidc.tf rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests 2026-06-26 10:29:19 +00:00
audit-policy.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
dashboard-sa.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
etcd-tuning.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
main.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00