k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed

Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-25 15:23:15 +00:00
parent 60a1cb9a25
commit 9c68d147e0
4 changed files with 112 additions and 87 deletions

View file

@ -1,4 +1,8 @@
# Post-mortem: kubeadm-config OIDC drift crash-looped the v1.35 apiserver upgrade (2026-06-24)
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
@ -6,85 +10,88 @@ snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**
(which correctly blocks subsequent runs). No data loss; no user-facing outage
(the master carries control-plane taints, so no workloads were displaced).
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35).
Patch upgrades never hit this because the apiserver manifest content is identical
across patches; a minor upgrade is the first time kubeadm regenerates the
manifest with a new image.
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause
## Root cause — etcd IO starvation on the shared HDD
apiserver authentication was configured in **two** places that were allowed to
drift from a **third**:
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
1. `/etc/kubernetes/pki/auth-config.yaml` — a structured `AuthenticationConfiguration`
(apiserver.config.k8s.io/v1) carrying **two** JWT issuers (`kubernetes` for
kubectl/kubelogin + `k8s-dashboard` for the dashboard's oauth2-proxy), added
2026-06-19 (`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`).
2. the **live** kube-apiserver static-pod manifest — referenced it via
`--authentication-config=/etc/kubernetes/pki/auth-config.yaml`.
3. the **kubeadm-config `ClusterConfiguration` ConfigMap** — still carried the
**legacy single-issuer `--oidc-*` extraArgs** (`oidc-issuer-url`,
`oidc-client-id`, `oidc-username-claim`, `oidc-groups-claim`). Never updated
when (1)+(2) switched to structured auth.
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
`kubeadm upgrade apply` **regenerates the static-pod manifests from
kubeadm-config**. So it dropped `--authentication-config` and re-added the four
`--oidc-*` flags. Proven by `kubeadm upgrade diff v1.35.6`:
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
```diff
- - --authentication-config=/etc/kubernetes/pki/auth-config.yaml
+ - --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
+ - --oidc-client-id=kubernetes
+ - --oidc-username-claim=email
+ - --oidc-groups-claim=groups
```
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
The regenerated apiserver crash-looped (`CrashLoopBackOff`, `back-off 10s`, 8
probe failures in the kubelet journal) — it exited within seconds, repeatedly, so
kubeadm's hash-watch never saw a stable new pod and timed out → rollback. (The
`--oidc-*` flags are NOT removed in 1.35; the crash is the auth-config swap in the
live control-plane environment, the only functional delta in the diff. Image
pull, etcd, OOM, and disk were all ruled out: all v1.35.6 images were pre-pulled,
etcd upgraded cleanly, no OOM, master root disk at 73%.)
### Correction — it was NOT the OIDC flag swap
**Why the existing safety net missed it:** `stacks/rbac/modules/rbac/apiserver-oidc.tf`
already *knew* kubeadm drops `--authentication-config` and published a
`apiserver-oidc-restore` ConfigMap for the chain to re-run **after** the upgrade.
But the apiserver crashes *during* `kubeadm upgrade apply`, which never returns
success, so the post-upgrade restore step is never reached.
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reconciled kubeadm-config live** (2026-06-24, zero cluster impact — the CM is
only read during an upgrade): rewrote `apiServer.extraArgs` to drop the
`--oidc-*` args and add `--authentication-config`, via `kubeadm init phase
upload-config kubeadm`. `kubeadm upgrade diff v1.35.6` then showed **only** the
control-plane image bumps — no auth-flag changes.
2. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge +
namespace annotation.
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (all landed in this change)
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm-config not managed alongside the live manifest | `apiserver-oidc.tf`'s remote script now **also** reconciles kubeadm-config (`kubeadm init phase upload-config`). It reaches the cluster two ways: the published `apiserver-oidc-restore` ConfigMap (a plain k8s resource — CI applies it with no ssh) which the chain's `phase_master` re-runs, and a local `-replace` apply with `TF_VAR_ssh_private_key`. (The null_resource trigger deliberately does NOT hash the script: CI has no ssh key, so it must stay a no-op on a plain CI apply.) |
| The chain drained the master into a crash with no pre-check | new **preflight gate 4b** in `upgrade-step.sh`: runs `kubeadm upgrade diff v$TARGET` and `block`s (k8s_upgrade_blocked=1 → K8sUpgradeBlocked alert) BEFORE snapshot/in-flight/drain if a `-` line would drop `--authentication-config`. Fails safe — blocks only on a positive drift signal. |
| The live fix had to be applied out-of-band (only `default` Vault policy on the workstation; CI can't ssh) | kubeadm-config reconciled live via `kubeadm init phase upload-config` on the master (2026-06-24); the committed code makes it durable for future upgrades. |
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Out-of-band control-plane edits must be written back to kubeadm-config.**
Anything that edits a static-pod manifest directly (auth, admission, audit, API
flags) is silently reverted on the next `kubeadm upgrade` unless kubeadm-config
itself carries it. `kubeadm upgrade diff <target>` is the authoritative
pre-flight check for "what will the upgrade change?" and is non-mutating.
- **A post-upgrade fixup can't repair something that breaks the upgrade itself.**
The restore-after-upgrade design assumed the apiserver would come up (degraded)
and be fixed afterward; it actually crash-looped, so the fix has to be in
kubeadm-config *before* `apply`, plus a preflight gate.
- **Minor upgrades exercise manifest regeneration; patch upgrades don't.** First
minor bump is where this whole class of drift surfaces.
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).