k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").
- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
`--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
--skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
our custom split-horizon Corefile with kubeadm's default AND downgrade the
image; --skip-phases leaves CoreDNS 100% untouched while the control plane
upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
042d1ce1ac
commit
037a609f27
4 changed files with 50 additions and 5 deletions
|
|
@ -118,6 +118,26 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
|||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
|
||||
- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
|
||||
### CoreDNS is NOT upgraded by kubeadm here
|
||||
|
||||
CoreDNS runs a **custom split-horizon Corefile** (owned by the technitium stack)
|
||||
and its image is tracked separately — it must NOT be touched by kubeadm. The
|
||||
master `kubeadm upgrade apply` therefore runs with
|
||||
`--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
|
||||
--skip-phases=addon/coredns` (in `scripts/update_k8s.sh`), so kubeadm upgrades
|
||||
the control plane but leaves CoreDNS 100% untouched (image + Corefile). Without
|
||||
the `--skip-phases`, forcing past the preflight makes kubeadm overwrite the
|
||||
Corefile with its default and downgrade the image (verified via
|
||||
`kubeadm upgrade apply --dry-run`).
|
||||
|
||||
**Keep CoreDNS off Keel.** On 2026-06-12 Keel had auto-bumped CoreDNS
|
||||
v1.12.1 → v1.12.4 (kube-system out-of-band annotation from the 2026-05-26 Keel
|
||||
cascade), and 1.12.4 is ahead of kubeadm 1.34.9's corefile-migration table —
|
||||
which is what blocked the 1.34.9 upgrade. CoreDNS is now `keel.sh/policy=never`
|
||||
(`kubectl -n kube-system annotate deploy/coredns keel.sh/policy=never`). If a
|
||||
future kubeadm minor ships a CoreDNS that DOES know the running version, drop the
|
||||
`--skip-phases` for that run to let kubeadm re-take ownership.
|
||||
|
||||
### Vault secrets
|
||||
|
||||
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue