infra/docs/post-mortems
Viktor Barzin 60a1cb9a25
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-03-16-nfs-csi-cascade-failure.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-14-postmortem-pipeline-test.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-18-authentik-outpost-shm-full.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-19-registry-orphan-index.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-04-22-vault-raft-leader-deadlock.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-09-io-pressure-stale-nfs.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-17-gpu-driver-ubuntu2604-mismatch.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-25-immich-anca-elements-io-storm.md immich: remove one-shot anca-elements-import Job + its PVC 2026-06-16 22:11:27 +00:00
2026-05-30-redis-split-brain.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-05-31-kured-sentinel-gate-oom.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-cloudflared-stale-traefik-origin.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-01-keel-match-tag-image-swap.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
2026-06-09-t3-nightly-autoupdate-auth-outage.md t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog) 2026-06-16 11:33:49 +00:00
2026-06-10-authentik-downgrade-boot-storm.md authentik: incident hardening after the signin-speedup rollout storm 2026-06-11 00:26:52 +00:00
2026-06-10-forgejo-retention-orphaned-indexes.md forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] 2026-06-10 09:22:47 +00:00
2026-06-10-tuya-bridge-forgejo-pull-hairpin.md coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] 2026-06-10 16:21:34 +00:00
2026-06-11-devvm-qemu-io-stall.md apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] 2026-06-11 18:00:08 +00:00
2026-06-22-devvm-mem-io-overload-containment.md workstation: switch devvm OOM backstop from systemd-oomd to earlyoom 2026-06-22 10:39:16 +00:00
2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade 2026-06-25 14:16:04 +00:00
index.html fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00