k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00 · 2026-06-25 15:23:15 +00:00 · 9c68d147e0
commit 9c68d147e0
parent 60a1cb9a25
4 changed files with 112 additions and 87 deletions
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -1,4 +1,8 @@
-# Post-mortem: kubeadm-config OIDC drift crash-looped the v1.35 apiserver upgrade (2026-06-24)
+# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
+
+> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
+> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
+> drift was a real *separate* latent bug fixed in the same change.

 **Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
 the master control-plane phase for the first time — preflight passed, etcd
@ -6,85 +10,88 @@ snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then
 kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
 static-pod-hash window across all internal retries, then auto-rolled-back to
 v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
-the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**
-(which correctly blocks subsequent runs). No data loss; no user-facing outage
-(the master carries control-plane taints, so no workloads were displaced).
+the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
+No data loss; no user-facing outage (the master carries control-plane taints, so
+no workloads were displaced).

-**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35).
-Patch upgrades never hit this because the apiserver manifest content is identical
-across patches; a minor upgrade is the first time kubeadm regenerates the
-manifest with a new image.
+**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
+first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
+static pods, i.e. the first time the upgrade pushes real write-IO at etcd.

-## Root cause
+## Root cause — etcd IO starvation on the shared HDD

-apiserver authentication was configured in **two** places that were allowed to
-drift from a **third**:
+The new kube-apiserver could not establish/keep a working connection to etcd
+during the upgrade because **etcd was IO-starved**. etcd's surviving container log
+from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:

-1. `/etc/kubernetes/pki/auth-config.yaml` — a structured `AuthenticationConfiguration`
-   (apiserver.config.k8s.io/v1) carrying **two** JWT issuers (`kubernetes` for
-   kubectl/kubelogin + `k8s-dashboard` for the dashboard's oauth2-proxy), added
-   2026-06-19 (`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`).
-2. the **live** kube-apiserver static-pod manifest — referenced it via
-   `--authentication-config=/etc/kubernetes/pki/auth-config.yaml`.
-3. the **kubeadm-config `ClusterConfiguration` ConfigMap** — still carried the
-   **legacy single-issuer `--oidc-*` extraArgs** (`oidc-issuer-url`,
-   `oidc-client-id`, `oidc-username-claim`, `oidc-groups-claim`). Never updated
-   when (1)+(2) switched to structured auth.
+- **1,180** `apply request took too long` warnings in 16 minutes;
+- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
+  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
+  to bring the new apiserver up.

-`kubeadm upgrade apply` **regenerates the static-pod manifests from
-kubeadm-config**. So it dropped `--authentication-config` and re-added the four
-`--oidc-*` flags. Proven by `kubeadm upgrade diff v1.35.6`:
+A reproduced 1.35.6 apiserver with no etcd dies with
+`F instance.go:233 Error creating leases: error creating storage factory: context
+deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
+lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
+shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
+that spindle:

-```diff
-    - --authentication-config=/etc/kubernetes/pki/auth-config.yaml
-+    - --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
-+    - --oidc-client-id=kubernetes
-+    - --oidc-username-claim=email
-+    - --oidc-groups-claim=groups
-```
+1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
+2. kubeadm dumping a full **~400MB etcd DB backup** to
+   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
+   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
+   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
+   image-GC threshold, so image GC churned during the drain too;
+3. master-drain pod evictions.

-The regenerated apiserver crash-looped (`CrashLoopBackOff`, `back-off 10s`, 8
-probe failures in the kubelet journal) — it exited within seconds, repeatedly, so
-kubeadm's hash-watch never saw a stable new pod and timed out → rollback. (The
-`--oidc-*` flags are NOT removed in 1.35; the crash is the auth-config swap in the
-live control-plane environment, the only functional delta in the diff. Image
-pull, etcd, OOM, and disk were all ruled out: all v1.35.6 images were pre-pulled,
-etcd upgraded cleanly, no OOM, master root disk at 73%.)
+### Correction — it was NOT the OIDC flag swap

-**Why the existing safety net missed it:** `stacks/rbac/modules/rbac/apiserver-oidc.tf`
-already *knew* kubeadm drops `--authentication-config` and published a
-`apiserver-oidc-restore` ConfigMap for the chain to re-run **after** the upgrade.
-But the apiserver crashes *during* `kubeadm upgrade apply`, which never returns
-success, so the post-upgrade restore step is never reached.
+`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
+`--authentication-config` (structured multi-issuer OIDC) back to legacy
+single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
+was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
+those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
+(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
+etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
+the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
+were also ruled out.
+
+## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
+
+apiserver auth is configured in three places that must agree:
+(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
+(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
+which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
+the manifest from (3), so it would have reverted structured auth → **dashboard +
+kubectl SSO break after a successful upgrade** (recoverable: the chain's
+post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.

 ## Resolution

-1. **Reconciled kubeadm-config live** (2026-06-24, zero cluster impact — the CM is
-   only read during an upgrade): rewrote `apiServer.extraArgs` to drop the
-   `--oidc-*` args and add `--authentication-config`, via `kubeadm init phase
-   upload-config kubeadm`. `kubeadm upgrade diff v1.35.6` then showed **only** the
-   control-plane image bumps — no auth-flag changes.
-2. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge +
-   namespace annotation.
+1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
+2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
+3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).

-## Prevention (all landed in this change)
+## Prevention (landed in this change)

 | Gap | Fix |
 |-----|-----|
-| kubeadm-config not managed alongside the live manifest | `apiserver-oidc.tf`'s remote script now **also** reconciles kubeadm-config (`kubeadm init phase upload-config`). It reaches the cluster two ways: the published `apiserver-oidc-restore` ConfigMap (a plain k8s resource — CI applies it with no ssh) which the chain's `phase_master` re-runs, and a local `-replace` apply with `TF_VAR_ssh_private_key`. (The null_resource trigger deliberately does NOT hash the script: CI has no ssh key, so it must stay a no-op on a plain CI apply.) |
-| The chain drained the master into a crash with no pre-check | new **preflight gate 4b** in `upgrade-step.sh`: runs `kubeadm upgrade diff v$TARGET` and `block`s (k8s_upgrade_blocked=1 → K8sUpgradeBlocked alert) BEFORE snapshot/in-flight/drain if a `-` line would drop `--authentication-config`. Fails safe — blocks only on a positive drift signal. |
-| The live fix had to be applied out-of-band (only `default` Vault policy on the workstation; CI can't ssh) | kubeadm-config reconciled live via `kubeadm init phase upload-config` on the master (2026-06-24); the committed code makes it durable for future upgrades. |
+| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
+| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
+| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |

 ## Lessons

- **Out-of-band control-plane edits must be written back to kubeadm-config.**
-  Anything that edits a static-pod manifest directly (auth, admission, audit, API
-  flags) is silently reverted on the next `kubeadm upgrade` unless kubeadm-config
-  itself carries it. `kubeadm upgrade diff <target>` is the authoritative
-  pre-flight check for "what will the upgrade change?" and is non-mutating.
- **A post-upgrade fixup can't repair something that breaks the upgrade itself.**
-  The restore-after-upgrade design assumed the apiserver would come up (degraded)
-  and be fixed afterward; it actually crash-looped, so the fix has to be in
-  kubeadm-config *before* `apply`, plus a preflight gate.
- **Minor upgrades exercise manifest regeneration; patch upgrades don't.** First
-  minor bump is where this whole class of drift surfaces.
+- **Capture the failing component's own logs before concluding.** The `kubeadm
+  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
+  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
+  "what config changes," not "why it crashed."
+- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
+  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
+  backup copy + drain) onto that spindle. code-oflt is the real fix.
+- **Tools that leave per-operation scratch must be reaped.** kubeadm's
+  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
+  GC'd; 28GB had silently accumulated.
+- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
+  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,7 +41,8 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
-  ├── apiserver-OIDC drift gate: kubeadm upgrade diff must NOT drop --authentication-config (else BLOCK+alert)
+  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
+  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -229,10 +230,10 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
 from kubeadm-config**. apiserver auth uses a structured multi-issuer
 `--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
 still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
-reverted the flag. On the **1.34→1.35** bump that regenerated apiserver
-**crash-looped and stalled the whole upgrade mid-flight** (master cordoned, etcd
-already bumped); the post-upgrade restore below never ran because `kubeadm
-upgrade apply` itself never returned success. Post-mortem:
+reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
+NOT crash on this — verified by isolated repro; it's recoverable via the restore
+script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
+etcd IO starvation**, not this drift; post-mortem:
 `docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.

 **Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
@ -243,9 +244,9 @@ upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only t
 image change. Zero live impact (the CM is read only during an upgrade).

 **Backstops:**
- **Preflight gate 4b** runs `kubeadm upgrade diff` and BLOCKs (k8s_upgrade_blocked=1
-  → alert) BEFORE draining the master if `--authentication-config` would still be
-  dropped — so this can never again drain into a crash.
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
+  NOT block — the drift only breaks SSO, which is recoverable) if
+  `--authentication-config` would still be dropped.
 - The `rbac` stack still publishes its restore script to the
  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,25 +416,39 @@ phase_preflight() {
    fi
  fi

-  # 4b. apiserver-OIDC drift gate (backstop for the rbac stack's kubeadm-config
+  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
  # --oidc-* args instead of --authentication-config, the regenerated apiserver
-  # reverts structured multi-issuer auth and CRASH-LOOPS — stalling the chain
-  # mid-flight with the master cordoned and etcd already bumped (the 2026-06-24
-  # v1.35 stall; docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md).
-  # `kubeadm upgrade diff` shows exactly what the manifest regen will change; a
-  # '-' line dropping --authentication-config means the drift is still present.
-  # Skip on an at-target master (resume — no apiserver regen). Best-effort: blocks
-  # only on a POSITIVE drift signal, never merely because diff is unavailable.
+  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
+  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
+  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
+  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
+  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
+  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
+  # Skip on an at-target master (resume — no apiserver regen).
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    local apiserver_diff
    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
-      block "kubeadm upgrade would DROP --authentication-config from kube-apiserver (kubeadm-config OIDC drift → apiserver crash-loop). Re-apply the rbac stack (apiserver-oidc.tf reconciles kubeadm-config), then retry. Master NOT drained."
+      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
    fi
  fi

+  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
+  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
+  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
+  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
+  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
+  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
+  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
+  # never aborts the chain.
+  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
+    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
+      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
+      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
+  fi
+
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -18,13 +18,16 @@
 #   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
 # Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
 # manifest from the STALE CM, reverting --authentication-config to single-issuer
-# --oidc-* flags. On k8s 1.35 that regenerated apiserver CRASH-LOOPED and stalled
-# the whole upgrade mid-flight (master cordoned, etcd already bumped) — see
-# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md. The
+# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
+# dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
+# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
+# separate etcd IO-starvation issue, see
+# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
 # remote script below now ALSO reconciles (3) via `kubeadm init phase
 # upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
-# k8s-version-upgrade chain additionally GATES on `kubeadm upgrade diff` in
-# preflight and blocks+alerts if --authentication-config would still be dropped.
+# k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
+# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
+# would still be dropped.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a