2026-06-25 15:23:15 +00:00
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
2026-06-25 14:16:04 +00:00
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped** . kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
2026-06-25 15:23:15 +00:00
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`** .
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved** . etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log` , 23:04– 23:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is < 100ms ),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full ** ~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%** , above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313` , no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard` , added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config` ); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
2026-06-25 14:16:04 +00:00
## Resolution
2026-06-25 15:23:15 +00:00
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*` ) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*` , added `--authentication-config` via `kubeadm init phase upload-config kubeadm` . `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
2026-06-25 14:16:04 +00:00
2026-06-25 15:23:15 +00:00
## Prevention (landed in this change)
2026-06-25 14:16:04 +00:00
| Gap | Fix |
|-----|-----|
2026-06-25 15:23:15 +00:00
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | ** `upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf` 's remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config` ), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config` . |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc` ). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
2026-06-25 14:16:04 +00:00
## Lessons
2026-06-25 15:23:15 +00:00
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).