k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00 · 2026-06-25 15:23:15 +00:00 · 9c68d147e0
commit 9c68d147e0
parent 60a1cb9a25
4 changed files with 112 additions and 87 deletions
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,25 +416,39 @@ phase_preflight() {
    fi
  fi

-  # 4b. apiserver-OIDC drift gate (backstop for the rbac stack's kubeadm-config
+  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
  # --oidc-* args instead of --authentication-config, the regenerated apiserver
-  # reverts structured multi-issuer auth and CRASH-LOOPS — stalling the chain
-  # mid-flight with the master cordoned and etcd already bumped (the 2026-06-24
-  # v1.35 stall; docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md).
-  # `kubeadm upgrade diff` shows exactly what the manifest regen will change; a
-  # '-' line dropping --authentication-config means the drift is still present.
-  # Skip on an at-target master (resume — no apiserver regen). Best-effort: blocks
-  # only on a POSITIVE drift signal, never merely because diff is unavailable.
+  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
+  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
+  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
+  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
+  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
+  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
+  # Skip on an at-target master (resume — no apiserver regen).
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    local apiserver_diff
    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
-      block "kubeadm upgrade would DROP --authentication-config from kube-apiserver (kubeadm-config OIDC drift → apiserver crash-loop). Re-apply the rbac stack (apiserver-oidc.tf reconciles kubeadm-config), then retry. Master NOT drained."
+      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
    fi
  fi

+  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
+  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
+  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
+  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
+  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
+  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
+  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
+  # never aborts the chain.
+  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
+    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
+      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
+      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
+  fi
+
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \