Viktor Barzin 9c68d147e0

ci/woodpecker/push/postmortem-todos Pipeline was successful

Details

ci/woodpecker/push/default Pipeline failed

Details

k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-25 15:23:15 +00:00

6.5 KiB

Raw Blame History

Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)

Filename kept for inbound links. The originally-suspected cause (kubeadm-config OIDC drift) turned out not to be the crash — see "Correction" below. The OIDC drift was a real separate latent bug fixed in the same change.

Impact: The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached the master control-plane phase for the first time — preflight passed, etcd snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the kube-apiserver upgrade to v1.35.6 crash-looped. kubeadm waited its 5-minute static-pod-hash window across all internal retries, then auto-rolled-back to v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but the run left k8s-master cordoned and the chain wedged on in_flight=1. No data loss; no user-facing outage (the master carries control-plane taints, so no workloads were displaced).

Trigger: the first minor upgrade the chain ever attempted (1.34→1.35) — the first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane static pods, i.e. the first time the upgrade pushes real write-IO at etcd.

Root cause — etcd IO starvation on the shared HDD

The new kube-apiserver could not establish/keep a working connection to etcd during the upgrade because etcd was IO-starved. etcd's surviving container log from the crash window (/var/log/pods/.../etcd/0.log, 23:04–23:20 UTC) shows:

1,180 apply request took too long warnings in 16 minutes;
individual applies of 4.3s / 2.9s / 2.7s / 1.8s (healthy is <100ms), clustered at 23:18:51 UTC — exactly when kubeadm's final attempt was trying to bring the new apiserver up.

A reproduced 1.35.6 apiserver with no etcd dies with F instance.go:233 Error creating leases: error creating storage factory: context deadline exceeded — the same failure mode a multi-second etcd produces. etcd lives on the contended sdc HDD (beads code-oflt: "etcd/critical VM disks on shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto that spindle:

etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
kubeadm dumping a full ~400MB etcd DB backup to /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ (on the same HDD) before the etcd upgrade — and 145 of these had accumulated to 28GB (kubeadm never cleans them up), pushing master root fs to 73%, above the 70% kubelet image-GC threshold, so image GC churned during the drain too;
master-drain pod evictions.

Correction — it was NOT the OIDC flag swap

kubeadm upgrade diff v1.35.6 showed the regenerated manifest also swaps --authentication-config (structured multi-issuer OIDC) back to legacy single-issuer --oidc-* flags (kubeadm-config drift, see secondary finding). That was the first hypothesis — but an isolated repro of the 1.35.6 apiserver with those exact --oidc-* flags and authentik reachable initialised OIDC cleanly (oidc.go:313, no error) and ran fine until it hit the (deliberately dead) test etcd. So the auth swap does not crash the apiserver; it was a red herring for the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full were also ruled out.

Secondary finding (real, fixed separately) — kubeadm-config OIDC drift

apiserver auth is configured in three places that must agree: (1) /etc/kubernetes/pki/auth-config.yaml (structured, two issuers: kubernetes

k8s-dashboard, added 2026-06-19); (2) the live static-pod manifest (--authentication-config); (3) the kubeadm-config ClusterConfiguration CM — which still carried the legacy --oidc-* extraArgs. kubeadm upgrade regenerates the manifest from (3), so it would have reverted structured auth → dashboard + kubectl SSO break after a successful upgrade (recoverable: the chain's post-master restore.sh re-adds the flag). This is a real bug, just not the crash.

Resolution

Reclaimed the 28GB kubeadm scratch on master (/etc/kubernetes/tmp/kubeadm-backup-*) — root fs 73% → 23%.
Reconciled kubeadm-config live (zero cluster impact — CM only read at upgrade time): dropped --oidc-*, added --authentication-config via kubeadm init phase upload-config kubeadm. kubeadm upgrade diff then shows only the control-plane image bumps.
Recovered: uncordoned k8s-master, cleared the stuck in_flight gauge + annotation, deleted last night's Complete/Failed 1-35-6 phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).

Prevention (landed in this change)

Gap	Fix
kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle)	`upgrade-step.sh` preflight now prunes `/etc/kubernetes/tmp/kubeadm-backup-` + `kubeadm-upgraded-manifests` older than 3 days on master, every run. Best-effort, never aborts.
kubeadm-config drift would silently break SSO after an upgrade	`apiserver-oidc.tf`'s remote script now also reconciles kubeadm-config (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight alerts (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`.
etcd on the contended `sdc` HDD starves under upgrade IO	Durable fix is beads code-oflt (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable.

Lessons

Capture the failing component's own logs before concluding. The kubeadm upgrade diff made the OIDC swap look like the cause; only etcd's log (multi-second applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is "what config changes," not "why it crashed."
etcd on shared HDD is the cluster's recurring fragility (immich IO storm 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB backup copy + drain) onto that spindle. code-oflt is the real fix.
Tools that leave per-operation scratch must be reaped. kubeadm's /etc/kubernetes/tmp etcd backups are throwaway (real backups → NFS) but never GC'd; 28GB had silently accumulated.
Out-of-band control-plane edits must be written back to kubeadm-config — else kubeadm upgrade silently reverts them (here: SSO; could be admission/audit/API flags).

6.5 KiB Raw Blame History Unescape Escape