k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.
Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:
preflight (k8s-node1)
→ master (k8s-node1) drains k8s-master
→ worker × 4 (k8s-node1) drains k8s-node{4,3,2}
→ worker (k8s-master + control-plane toleration) drains k8s-node1
→ postflight (no pinning)
Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.
Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).
Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).
Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
8e13f1528e
commit
448bc0c0f6
7 changed files with 1063 additions and 394 deletions
88
stacks/k8s-version-upgrade/job-template.yaml
Normal file
88
stacks/k8s-version-upgrade/job-template.yaml
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# k8s-upgrade-chain Job template.
|
||||
#
|
||||
# Rendered by `envsubst` inside upgrade-step.sh (and the detection CronJob)
|
||||
# before `kubectl apply`. All ${VAR} placeholders are envsubst-side; this file
|
||||
# is NOT processed by Terraform.
|
||||
#
|
||||
# Required environment for envsubst:
|
||||
# JOB_NAME unique-per-(phase, target_version[, target_node])
|
||||
# PHASE_NEXT phase the Job runs (preflight|master|worker|postflight)
|
||||
# TARGET_NODE_NEXT node the Job operates on (empty for preflight/postflight)
|
||||
# TARGET_VERSION X.Y.Z
|
||||
# TARGET_VERSION_LABEL X-Y-Z (label-safe)
|
||||
# KIND patch | minor
|
||||
# IMAGE container image to run upgrade-step.sh
|
||||
# SCHEDULING_BLOCK YAML fragment with nodeSelector/tolerations (may be empty)
|
||||
#
|
||||
# Idempotency: name is deterministic per (phase, target_version[, target_node])
|
||||
# so `kubectl apply` reconciles to a single Job per run.
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: ${JOB_NAME}
|
||||
namespace: k8s-upgrade
|
||||
labels:
|
||||
app: k8s-upgrade-chain
|
||||
phase: ${PHASE_NEXT}
|
||||
target-version: "${TARGET_VERSION_LABEL}"
|
||||
spec:
|
||||
ttlSecondsAfterFinished: 604800 # 7 days for postmortem review
|
||||
backoffLimit: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: k8s-upgrade-chain
|
||||
phase: ${PHASE_NEXT}
|
||||
spec:
|
||||
serviceAccountName: k8s-upgrade-job
|
||||
restartPolicy: Never
|
||||
${SCHEDULING_BLOCK}
|
||||
imagePullSecrets:
|
||||
- name: registry-credentials
|
||||
containers:
|
||||
- name: upgrade-step
|
||||
image: ${IMAGE}
|
||||
env:
|
||||
- name: PHASE
|
||||
value: "${PHASE_NEXT}"
|
||||
- name: TARGET_NODE
|
||||
value: "${TARGET_NODE_NEXT}"
|
||||
- name: TARGET_VERSION
|
||||
value: "${TARGET_VERSION}"
|
||||
- name: KIND
|
||||
value: "${KIND}"
|
||||
- name: IMAGE
|
||||
value: "${IMAGE}"
|
||||
- name: HOME
|
||||
value: "/tmp"
|
||||
command: ["/bin/bash", "/scripts/upgrade-step.sh"]
|
||||
volumeMounts:
|
||||
- name: creds
|
||||
mountPath: /secrets/k8s-upgrade
|
||||
readOnly: true
|
||||
- name: scripts
|
||||
mountPath: /scripts
|
||||
readOnly: true
|
||||
- name: template
|
||||
mountPath: /template
|
||||
readOnly: true
|
||||
resources:
|
||||
requests:
|
||||
cpu: "100m"
|
||||
memory: "256Mi"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
volumes:
|
||||
- name: creds
|
||||
secret:
|
||||
secretName: k8s-upgrade-creds
|
||||
# 0444 so the non-root container can read; upgrade-step.sh copies
|
||||
# the SSH key to /tmp/ssh_key with mode 0400 for openssh.
|
||||
defaultMode: 0444
|
||||
- name: scripts
|
||||
configMap:
|
||||
name: k8s-upgrade-scripts
|
||||
defaultMode: 0755
|
||||
- name: template
|
||||
configMap:
|
||||
name: k8s-upgrade-job-template
|
||||
|
|
@ -1,44 +1,48 @@
|
|||
# k8s-version-upgrade — Automated K8s component (kubeadm/kubelet/kubectl) upgrade
|
||||
#
|
||||
# Detects new patch/minor versions via a weekly CronJob, then dispatches the
|
||||
# `k8s-version-upgrade` agent (infra/.claude/agents/k8s-version-upgrade.md)
|
||||
# through claude-agent-service for the actual rolling upgrade.
|
||||
# Architecture: detection CronJob → chain of small Jobs, one per phase. Each
|
||||
# Job's pod runs on a node that is NOT its drain target — eliminates the
|
||||
# self-preemption bug that killed the agent-based v1 (2026-05-11 incident).
|
||||
#
|
||||
# Chain (Job 0 → Job 6):
|
||||
# preflight (pinned: k8s-node1)
|
||||
# master (pinned: k8s-node1; drains k8s-master)
|
||||
# worker (pinned: k8s-node1; drains k8s-node4 → 3 → 2)
|
||||
# worker (pinned: k8s-master + control-plane toleration; drains k8s-node1 last)
|
||||
# postflight (no pinning)
|
||||
#
|
||||
# Each phase Job's container runs scripts/upgrade-step.sh which:
|
||||
# - dispatches on $PHASE
|
||||
# - spawns the next Job via envsubst on job-template.yaml
|
||||
# - uses deterministic naming (k8s-upgrade-${phase}-${target_version}[-${node}])
|
||||
# so re-running on failure reconciles to a single Job per run.
|
||||
#
|
||||
# Reuse points:
|
||||
# - claude-agent-service.claude-agent.svc:8080 — agent job runner
|
||||
# - Vault secret/k8s-upgrade/* — operator populates ssh_key + slack_webhook
|
||||
# - Prometheus + Pushgateway + Upgrade Gates alert group (in monitoring stack)
|
||||
# - update_k8s.sh — library script the agent shells into nodes with
|
||||
#
|
||||
# Notes:
|
||||
# - Schedule is Sun 12:00 UTC — well outside the kured Mon-Fri 02:00-06:00
|
||||
# London window so OS reboots and K8s version rollouts can't overlap.
|
||||
# - Patch detection uses `apt-cache madison kubeadm` on master via SSH.
|
||||
# Minor detection probes the next-minor apt repo URL with HEAD.
|
||||
# - claude-agent-service image (kubectl + ssh + jq + curl + envsubst)
|
||||
# - Vault secret/k8s-upgrade/* (ssh_key, slack_webhook)
|
||||
# - Prometheus + Pushgateway + Upgrade Gates alerts
|
||||
# - default/backup-etcd CronJob (snapshot trigger)
|
||||
# - infra/scripts/update_k8s.sh (per-node upgrade body)
|
||||
|
||||
variable "schedule" {
|
||||
type = string
|
||||
default = "0 12 * * 0" # Sunday 12:00 UTC
|
||||
default = "0 12 * * 0" # Sunday 12:00 UTC — outside kured window
|
||||
}
|
||||
|
||||
# Toggle to suspend the detection CronJob without dropping the stack.
|
||||
variable "enabled" {
|
||||
type = bool
|
||||
default = true
|
||||
}
|
||||
|
||||
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in
|
||||
# sync when the claude-agent-service image is rebuilt. Reused here because the
|
||||
# detection CronJob only needs kubectl, ssh-client, curl, jq — all of which
|
||||
# the claude-agent-service image already ships.
|
||||
variable "claude_agent_service_image_tag" {
|
||||
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — bump
|
||||
# in lockstep with claude-agent-service rebuilds. The image ships kubectl,
|
||||
# ssh-client, curl, jq, envsubst — everything the upgrade Jobs need.
|
||||
variable "image_tag" {
|
||||
type = string
|
||||
default = "2fd7670d"
|
||||
}
|
||||
|
||||
# If true, the CronJob runs the detection sequence but does NOT POST to
|
||||
# claude-agent-service. Used for Test 1 to confirm detection works without
|
||||
# firing a real upgrade.
|
||||
# When true, detection runs but does NOT spawn the preflight Job.
|
||||
variable "detection_dry_run" {
|
||||
type = bool
|
||||
default = false
|
||||
|
|
@ -46,9 +50,9 @@ variable "detection_dry_run" {
|
|||
|
||||
locals {
|
||||
namespace = "k8s-upgrade"
|
||||
ca_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
|
||||
image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.image_tag}"
|
||||
labels = {
|
||||
app = "k8s-version-check"
|
||||
app = "k8s-version-upgrade"
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -62,21 +66,19 @@ resource "kubernetes_namespace" "k8s_upgrade" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
# --- ExternalSecret: ssh_key + slack_webhook + agent-service bearer ---
|
||||
# --- ExternalSecret: SSH key + Slack webhook ---
|
||||
#
|
||||
# Operator populates Vault `secret/k8s-upgrade/` with:
|
||||
# - ssh_key (PEM-encoded ed25519 private key)
|
||||
# - ssh_key_pub (the matching public key — distributed to nodes' authorized_keys)
|
||||
# - slack_webhook (Slack incoming-webhook URL, separate channel from kured for clean alerting)
|
||||
# - ssh_key (ed25519 PRIVATE key, used to SSH wizard@<node> from Jobs)
|
||||
# - ssh_key_pub (matching public key, deployed to nodes' authorized_keys)
|
||||
# - slack_webhook (incoming-webhook URL)
|
||||
#
|
||||
# The claude-agent-service bearer token comes from secret/claude-agent-service
|
||||
# (reused — no parallel token needed).
|
||||
|
||||
# No claude-agent bearer needed — the chain no longer POSTs to that service.
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
|
|
@ -109,191 +111,157 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
property = "slack_webhook"
|
||||
}
|
||||
},
|
||||
{
|
||||
secretKey = "api_bearer_token"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# --- ServiceAccount + RBAC for the detection CronJob ---
|
||||
|
||||
resource "kubernetes_service_account" "k8s_version_check" {
|
||||
metadata {
|
||||
name = "k8s-version-check"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Cluster-wide read on nodes (for kubeletVersion comparison)
|
||||
resource "kubernetes_cluster_role" "k8s_version_check" {
|
||||
metadata {
|
||||
name = "k8s-version-check"
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["nodes"]
|
||||
verbs = ["get", "list"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role_binding" "k8s_version_check" {
|
||||
metadata {
|
||||
name = "k8s-version-check"
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = kubernetes_cluster_role.k8s_version_check.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.k8s_version_check.metadata[0].name
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Namespace-scoped: detection CronJob reads its own creds Secret.
|
||||
resource "kubernetes_role" "k8s_version_check_secrets" {
|
||||
metadata {
|
||||
name = "k8s-version-check-secrets"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["secrets"]
|
||||
resource_names = ["k8s-upgrade-creds"]
|
||||
verbs = ["get"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "k8s_version_check_secrets" {
|
||||
metadata {
|
||||
name = "k8s-version-check-secrets"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.k8s_version_check_secrets.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.k8s_version_check.metadata[0].name
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# --- Cross-namespace RBAC: claude-agent SA reads k8s-upgrade-creds + annotates ns ---
|
||||
# --- Unified ServiceAccount + RBAC ---
|
||||
#
|
||||
# The k8s-version-upgrade agent runs inside the claude-agent-service pod (SA
|
||||
# `claude-agent` in `claude-agent` ns). It needs:
|
||||
# - GET on this namespace's k8s-upgrade-creds Secret (to fetch ssh_key + slack)
|
||||
# - PATCH on the k8s-upgrade Namespace annotations (in-flight marker)
|
||||
# One SA serves BOTH the detection CronJob and every phase Job:
|
||||
# - detection CronJob: needs nodes:get/list + secrets:get + jobs:create
|
||||
# (to spawn Job 0 = preflight)
|
||||
# - phase Jobs: same + pods/eviction:create + pods:delete + namespaces:patch
|
||||
#
|
||||
# Cluster-scoped because the chain spans the whole cluster (drain works on
|
||||
# any node, and the preflight Job creates a Job in `default` ns from
|
||||
# `cronjob/backup-etcd`).
|
||||
|
||||
resource "kubernetes_role" "claude_agent_reads_creds" {
|
||||
resource "kubernetes_service_account" "k8s_upgrade_job" {
|
||||
metadata {
|
||||
name = "claude-agent-reads-creds"
|
||||
name = "k8s-upgrade-job"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["secrets"]
|
||||
resource_names = ["k8s-upgrade-creds"]
|
||||
verbs = ["get"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "claude_agent_reads_creds" {
|
||||
resource "kubernetes_cluster_role" "k8s_upgrade_job" {
|
||||
metadata {
|
||||
name = "claude-agent-reads-creds"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
name = "k8s-upgrade-job"
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.claude_agent_reads_creds.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = "claude-agent"
|
||||
namespace = "claude-agent"
|
||||
}
|
||||
}
|
||||
|
||||
# The base claude-agent ClusterRole grants get/list/watch on most resources
|
||||
# but not the mutating verbs the upgrade agent needs. Rather than fork the
|
||||
# upstream stack, we add a sibling ClusterRole here scoped to exactly the
|
||||
# verbs+resources required:
|
||||
# - patch on namespace k8s-upgrade (in-flight annotation)
|
||||
# - create on batch/jobs (trigger etcd snapshot Job from cronjob/backup-etcd)
|
||||
# - patch on nodes (cordon/uncordon — drain needs this)
|
||||
# - create on pods/eviction (drain evicts pods)
|
||||
resource "kubernetes_cluster_role" "claude_agent_upgrade_ops" {
|
||||
metadata {
|
||||
name = "claude-agent-upgrade-ops"
|
||||
}
|
||||
# Annotate the k8s-upgrade namespace
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["namespaces"]
|
||||
resource_names = ["k8s-upgrade"]
|
||||
verbs = ["patch", "update"]
|
||||
}
|
||||
# Trigger etcd snapshot Jobs (from cronjob/backup-etcd in default ns).
|
||||
# Cluster-scoped because we may also create test Jobs in k8s-upgrade ns.
|
||||
rule {
|
||||
api_groups = ["batch"]
|
||||
resources = ["jobs"]
|
||||
verbs = ["create", "delete"]
|
||||
}
|
||||
# Cordon / uncordon nodes
|
||||
# Read nodes (version comparison + readiness check)
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["nodes"]
|
||||
verbs = ["patch", "update"]
|
||||
verbs = ["get", "list", "patch", "update"]
|
||||
}
|
||||
# Drain (evict pods)
|
||||
# Drain — evict pods
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods/eviction"]
|
||||
verbs = ["create"]
|
||||
}
|
||||
# Delete pods stuck during drain (sometimes evict isn't enough)
|
||||
# Drain fallback — direct delete (predrain_unstick bypasses PDBs)
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods"]
|
||||
verbs = ["delete"]
|
||||
verbs = ["get", "list", "delete"]
|
||||
}
|
||||
# Read PDBs to find drain-blocking pods
|
||||
rule {
|
||||
api_groups = ["policy"]
|
||||
resources = ["poddisruptionbudgets"]
|
||||
verbs = ["get", "list"]
|
||||
}
|
||||
# Chain dispatch — create the next Job; reconcile via apply on retry.
|
||||
# In `default` ns to also create the etcd-snapshot Job from cronjob/backup-etcd.
|
||||
rule {
|
||||
api_groups = ["batch"]
|
||||
resources = ["jobs"]
|
||||
verbs = ["create", "get", "list", "delete", "patch", "watch"]
|
||||
}
|
||||
# Pull CronJob spec for `kubectl create job --from=cronjob/backup-etcd`
|
||||
rule {
|
||||
api_groups = ["batch"]
|
||||
resources = ["cronjobs"]
|
||||
verbs = ["get", "list"]
|
||||
}
|
||||
# Annotate the k8s-upgrade namespace (in-flight marker + snapshot path)
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["namespaces"]
|
||||
resource_names = [local.namespace]
|
||||
verbs = ["get", "patch", "update"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role_binding" "claude_agent_upgrade_ops" {
|
||||
resource "kubernetes_cluster_role_binding" "k8s_upgrade_job" {
|
||||
metadata {
|
||||
name = "claude-agent-upgrade-ops"
|
||||
name = "k8s-upgrade-job"
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = kubernetes_cluster_role.claude_agent_upgrade_ops.metadata[0].name
|
||||
name = kubernetes_cluster_role.k8s_upgrade_job.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = "claude-agent"
|
||||
namespace = "claude-agent"
|
||||
name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Namespaced: read the credentials Secret in k8s-upgrade (SSH key + Slack URL)
|
||||
resource "kubernetes_role" "k8s_upgrade_job_ns" {
|
||||
metadata {
|
||||
name = "k8s-upgrade-job-ns"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["secrets"]
|
||||
resource_names = ["k8s-upgrade-creds"]
|
||||
verbs = ["get"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "k8s_upgrade_job_ns" {
|
||||
metadata {
|
||||
name = "k8s-upgrade-job-ns"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.k8s_upgrade_job_ns.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# --- ConfigMaps: scripts + Job template ---
|
||||
|
||||
resource "kubernetes_config_map" "k8s_upgrade_scripts" {
|
||||
metadata {
|
||||
name = "k8s-upgrade-scripts"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
data = {
|
||||
"upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
|
||||
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map" "k8s_upgrade_job_template" {
|
||||
metadata {
|
||||
name = "k8s-upgrade-job-template"
|
||||
namespace = kubernetes_namespace.k8s_upgrade.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
data = {
|
||||
"job-template.yaml" = file("${path.module}/job-template.yaml")
|
||||
}
|
||||
}
|
||||
|
||||
# --- Detection CronJob ---
|
||||
#
|
||||
# Weekly: compares running cluster version against latest available patch
|
||||
# (apt-cache madison kubeadm on master) and latest available minor (HEAD on
|
||||
# next-minor pkgs.k8s.io repo). When a target is detected, POSTs to
|
||||
# claude-agent-service to kick the upgrade agent.
|
||||
# Probes for available patch/minor targets weekly. When one is found, renders
|
||||
# Job 0 (preflight) from the same job-template the chain uses. The CronJob no
|
||||
# longer POSTs to claude-agent-service; the whole pipeline now runs inside the
|
||||
# cluster via Job-chaining.
|
||||
|
||||
resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
||||
metadata {
|
||||
|
|
@ -320,33 +288,36 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.k8s_version_check.metadata[0].name
|
||||
service_account_name = kubernetes_service_account.k8s_upgrade_job.metadata[0].name
|
||||
restart_policy = "Never"
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
volume {
|
||||
name = "creds"
|
||||
secret {
|
||||
secret_name = "k8s-upgrade-creds"
|
||||
# 0444 — non-root container needs read; SSH key gets re-installed
|
||||
# with mode 0400 in the inline command before any ssh call.
|
||||
default_mode = "0444"
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "template"
|
||||
config_map {
|
||||
name = kubernetes_config_map.k8s_upgrade_job_template.metadata[0].name
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "version-check"
|
||||
image = local.ca_image
|
||||
image = local.image
|
||||
command = ["/bin/bash", "-c", <<-EOT
|
||||
set -euo pipefail
|
||||
echo "==> k8s-version-check ($(date -u +%FT%TZ))"
|
||||
|
||||
# 1. Load SSH key from K8s Secret
|
||||
mkdir -p /tmp
|
||||
/usr/local/bin/kubectl get secret k8s-upgrade-creds \
|
||||
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
|
||||
chmod 400 /tmp/k8s-upgrade-ssh-key
|
||||
|
||||
SLACK=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
|
||||
-o jsonpath='{.data.slack_webhook}' | base64 -d)
|
||||
|
||||
AGENT_TOKEN=$(/usr/local/bin/kubectl get secret k8s-upgrade-creds \
|
||||
-o jsonpath='{.data.api_bearer_token}' | base64 -d)
|
||||
|
||||
SSH="ssh -i /tmp/k8s-upgrade-ssh-key \
|
||||
-o StrictHostKeyChecking=accept-new \
|
||||
-o UserKnownHostsFile=/tmp/known_hosts"
|
||||
SLACK=$(cat /secrets/k8s-upgrade/slack_webhook)
|
||||
install -m 0400 /secrets/k8s-upgrade/ssh_key /tmp/ssh_key
|
||||
SSH="ssh -i /tmp/ssh_key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts -o ConnectTimeout=10"
|
||||
|
||||
slack() {
|
||||
curl -sS -X POST -H 'Content-Type: application/json' \
|
||||
|
|
@ -354,17 +325,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
"$SLACK" || true
|
||||
}
|
||||
|
||||
# 2. Detect running version
|
||||
# 1. Detect running version
|
||||
RUNNING=$(/usr/local/bin/kubectl get nodes \
|
||||
-o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' \
|
||||
| tr -d v)
|
||||
-o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' | tr -d v)
|
||||
RUNNING_MINOR=$(echo "$RUNNING" | awk -F. '{print $1"."$2}')
|
||||
echo "Running version: v$RUNNING (minor $RUNNING_MINOR)"
|
||||
|
||||
# 3. Detect highest available patch within the running minor track.
|
||||
# Refresh the local apt cache first — without this, a newly-published
|
||||
# patch won't show up via `apt-cache madison` until something else
|
||||
# triggers an `apt-get update`.
|
||||
# 2. Latest patch within current minor (refresh master's apt cache)
|
||||
LATEST_PATCH=$($SSH wizard@k8s-master \
|
||||
"sudo apt-get update -qq -o Dir::Etc::sourcelist='sources.list.d/kubernetes.list' -o Dir::Etc::sourceparts='-' -o APT::Get::List-Cleanup='0' >/dev/null 2>&1 ; \
|
||||
apt-cache madison kubeadm 2>/dev/null \
|
||||
|
|
@ -372,9 +339,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
| sed 's/-.*//' \
|
||||
| grep '^$RUNNING_MINOR\\.' \
|
||||
| sort -V | tail -1" || echo "")
|
||||
echo "Latest patch (apt): v$LATEST_PATCH"
|
||||
echo "Latest patch: v$LATEST_PATCH"
|
||||
|
||||
# 4. Detect next available minor by probing the apt repo URL.
|
||||
# 3. Next-minor probe
|
||||
NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
|
||||
NEXT_MINOR="1.$NEXT_MINOR_NUM"
|
||||
NEXT_MINOR_AVAILABLE="no"
|
||||
|
|
@ -385,14 +352,13 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
fi
|
||||
echo "Next minor v$NEXT_MINOR available: $NEXT_MINOR_AVAILABLE"
|
||||
|
||||
# 5. Decide what to do
|
||||
# 4. Choose target
|
||||
TARGET=""
|
||||
KIND=""
|
||||
if [ -n "$LATEST_PATCH" ] && [ "$LATEST_PATCH" != "$RUNNING" ]; then
|
||||
TARGET="$LATEST_PATCH"
|
||||
KIND="patch"
|
||||
elif [ "$NEXT_MINOR_AVAILABLE" = "yes" ]; then
|
||||
# Probe the minor track to get its latest patch.
|
||||
NEXT_MINOR_PATCH=$($SSH wizard@k8s-master \
|
||||
"curl -sf 'https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Packages' \
|
||||
| grep -oE 'Version: [0-9.-]+' \
|
||||
|
|
@ -404,7 +370,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
fi
|
||||
fi
|
||||
|
||||
# 6. Push the discovery metric to Pushgateway
|
||||
# 5. Pushgateway discovery metric
|
||||
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-check'
|
||||
{
|
||||
echo "# TYPE k8s_upgrade_available gauge"
|
||||
|
|
@ -417,64 +383,61 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
echo "k8s_version_check_last_run_timestamp $(date +%s)"
|
||||
} | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
|
||||
# 7. Decide whether to dispatch
|
||||
# 6. Decide whether to spawn Job 0
|
||||
if [ -z "$TARGET" ]; then
|
||||
echo "No upgrade needed (running=$RUNNING, latest_patch=$LATEST_PATCH, next_minor_available=$NEXT_MINOR_AVAILABLE)"
|
||||
echo "No upgrade needed"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
|
||||
|
||||
# DRY_RUN_OVERRIDE wins over DRY_RUN — but a Job copied from
|
||||
# this CronJob can't add new env vars (spec is immutable). The
|
||||
# operator path for "trigger detection without dispatch" is
|
||||
# toggling the CronJob's `var.detection_dry_run` then applying.
|
||||
# Documented in the runbook.
|
||||
EFFECTIVE_DRY_RUN="$${DRY_RUN_OVERRIDE:-$DRY_RUN}"
|
||||
if [ "$EFFECTIVE_DRY_RUN" = "true" ]; then
|
||||
echo "dry_run=true — not POSTing to claude-agent-service"
|
||||
slack "DRY_RUN — skipping agent dispatch"
|
||||
if [ "$DRY_RUN" = "true" ]; then
|
||||
slack "DRY_RUN — not spawning preflight Job"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 8. POST to claude-agent-service
|
||||
PAYLOAD=$(jq -nc \
|
||||
--arg target "$TARGET" \
|
||||
--arg kind "$KIND" \
|
||||
'{
|
||||
prompt: ("Run the k8s-version-upgrade agent. Inputs: " + ({target_version: $target, kind: $kind, dry_run: false, stages: "all"} | tostring)),
|
||||
agent: ".claude/agents/k8s-version-upgrade",
|
||||
max_budget_usd: 30
|
||||
}')
|
||||
# 7. Spawn Job 0 (preflight) via envsubst on the job-template
|
||||
# Idempotency: deterministic name reconciles via `apply`.
|
||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||
|
||||
echo "Dispatching agent: $PAYLOAD"
|
||||
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
|
||||
-H "Authorization: Bearer $AGENT_TOKEN" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "$PAYLOAD" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
||||
CODE=$(printf '%s' "$RESP" | tail -n1)
|
||||
BODY=$(printf '%s' "$RESP" | sed '$d')
|
||||
|
||||
if [ "$CODE" = "200" ] || [ "$CODE" = "202" ]; then
|
||||
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // .id // "unknown"')
|
||||
slack "Agent dispatched: job=$JOB_ID (target=v$TARGET kind=$KIND)"
|
||||
echo "OK — job=$JOB_ID"
|
||||
else
|
||||
slack "ERROR dispatching agent: HTTP $CODE — $BODY"
|
||||
echo "dispatch failed: HTTP $CODE — $BODY" >&2
|
||||
exit 1
|
||||
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
|
||||
slack "Preflight Job $JOB_NAME already exists (rerunning detection mid-flight?)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
export JOB_NAME PHASE_NEXT=preflight TARGET_NODE_NEXT="" \
|
||||
TARGET_VERSION="$TARGET" TARGET_VERSION_LABEL="$${TARGET//./-}" \
|
||||
KIND="$KIND" IMAGE="$${IMAGE}" \
|
||||
SCHEDULING_BLOCK=$' nodeSelector:\n kubernetes.io/hostname: k8s-node1'
|
||||
|
||||
envsubst < /template/job-template.yaml \
|
||||
| /usr/local/bin/kubectl apply -f -
|
||||
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
name = "DRY_RUN"
|
||||
value = tostring(var.detection_dry_run)
|
||||
}
|
||||
env {
|
||||
name = "IMAGE"
|
||||
value = local.image
|
||||
}
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/tmp"
|
||||
}
|
||||
volume_mount {
|
||||
name = "creds"
|
||||
mount_path = "/secrets/k8s-upgrade"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "template"
|
||||
mount_path = "/template"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
|
|
|
|||
438
stacks/k8s-version-upgrade/scripts/upgrade-step.sh
Normal file
438
stacks/k8s-version-upgrade/scripts/upgrade-step.sh
Normal file
|
|
@ -0,0 +1,438 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# Universal upgrade-step body. Each Job in the k8s-version-upgrade chain runs
|
||||
# this once, dispatching on $PHASE. On success it computes the next phase and
|
||||
# spawns the next Job. The chain is:
|
||||
#
|
||||
# preflight (run on k8s-node1)
|
||||
# ↓
|
||||
# master (drains k8s-master; run on k8s-node1)
|
||||
# ↓
|
||||
# worker k8s-node4 (run on k8s-node1)
|
||||
# ↓
|
||||
# worker k8s-node3 (run on k8s-node1)
|
||||
# ↓
|
||||
# worker k8s-node2 (run on k8s-node1)
|
||||
# ↓
|
||||
# worker k8s-node1 (drains k8s-node1; run on k8s-master with control-plane toleration)
|
||||
# ↓
|
||||
# postflight (no node pinning)
|
||||
#
|
||||
# k8s-node1 hosts every Job except the one that drains k8s-node1 itself.
|
||||
# k8s-node1 is therefore upgraded LAST.
|
||||
#
|
||||
# Required env vars (set on the Job pod by job-template.yaml):
|
||||
# PHASE preflight | master | worker | postflight
|
||||
# TARGET_NODE k8s-master | k8s-nodeN (empty for preflight/postflight)
|
||||
# TARGET_VERSION X.Y.Z
|
||||
# KIND patch | minor
|
||||
# IMAGE container image to use for next Job in the chain
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
NS=k8s-upgrade
|
||||
SSH_KEY=/secrets/k8s-upgrade/ssh_key
|
||||
SLACK_FILE=/secrets/k8s-upgrade/slack_webhook
|
||||
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
|
||||
PROM='http://prometheus-server.monitoring.svc.cluster.local:80'
|
||||
KUBECTL=kubectl
|
||||
JOB_TEMPLATE=/template/job-template.yaml
|
||||
UPDATE_K8S_SH=/scripts/update_k8s.sh
|
||||
|
||||
# SSH key must be 0400 — refresh from secret mount (defaultMode does this but
|
||||
# bind-mount semantics can preserve loose perms; chmod is idempotent).
|
||||
install -m 0400 "$SSH_KEY" /tmp/ssh_key
|
||||
SSH_KEY=/tmp/ssh_key
|
||||
|
||||
SSH_OPTS=(-i "$SSH_KEY"
|
||||
-o StrictHostKeyChecking=accept-new
|
||||
-o UserKnownHostsFile=/tmp/known_hosts
|
||||
-o ConnectTimeout=10)
|
||||
|
||||
SLACK_URL="$(cat "$SLACK_FILE")"
|
||||
|
||||
slack() {
|
||||
local msg="$1"
|
||||
curl -sS -X POST -H 'Content-Type: application/json' \
|
||||
--data "$(jq -nc --arg t "[k8s-upgrade-${PHASE}${TARGET_NODE:+:$TARGET_NODE}] $msg" \
|
||||
'{text: $t}')" \
|
||||
"$SLACK_URL" >/dev/null || echo "warn: slack post failed"
|
||||
}
|
||||
|
||||
push() {
|
||||
printf '# TYPE %s gauge\n%s %s\n' "$1" "$1" "$2" \
|
||||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
}
|
||||
|
||||
halt_on_alert_query() {
|
||||
local extra_ignore="${1:-}"
|
||||
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor'
|
||||
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
|
||||
regex="$regex)$"
|
||||
|
||||
curl -sf "$PROM/api/v1/alerts" \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| grep -vE "$regex" | sort -u
|
||||
}
|
||||
|
||||
wait_for_node_ready() {
|
||||
local node="$1" want_version="$2" deadline=$(( $(date +%s) + 900 )) # 15 min
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
local status kubelet
|
||||
status=$($KUBECTL get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
|
||||
kubelet=$($KUBECTL get node "$node" -o jsonpath='{.status.nodeInfo.kubeletVersion}' 2>/dev/null | tr -d v || true)
|
||||
if [ "$status" = "True" ] && [ "$kubelet" = "$want_version" ]; then
|
||||
return 0
|
||||
fi
|
||||
sleep 15
|
||||
done
|
||||
return 1
|
||||
}
|
||||
|
||||
# Pre-drain: find pods on $node whose PDB has zero disruptionsAllowed and
|
||||
# delete them directly. Drain's eviction API respects PDBs and will loop
|
||||
# forever on single-replica deployments with `minAvailable: 1` — common
|
||||
# pattern on this cluster (e.g. Anubis instances default to replicas=1). A
|
||||
# direct delete bypasses eviction; the parent Deployment recreates the pod
|
||||
# elsewhere (the node is already cordoned by drain).
|
||||
predrain_unstick() {
|
||||
local node="$1"
|
||||
$KUBECTL get pdb -A -o json | jq -r '
|
||||
.items[]
|
||||
| select(.status.disruptionsAllowed == 0)
|
||||
| "\(.metadata.namespace) \(.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(","))"
|
||||
' | while read -r ns selector; do
|
||||
[ -z "$selector" ] && continue
|
||||
$KUBECTL -n "$ns" get pods --field-selector "spec.nodeName=$node,status.phase=Running" \
|
||||
-l "$selector" -o name 2>/dev/null \
|
||||
| while read -r pod; do
|
||||
echo "predrain_unstick: deleting PDB-blocked $ns/$pod (drain would loop on it)"
|
||||
$KUBECTL -n "$ns" delete "$pod" --wait=false || true
|
||||
done
|
||||
done
|
||||
}
|
||||
|
||||
# Drain wrapper: kick predrain_unstick before drain, then again every 60s in
|
||||
# the background while drain runs (in case new pods land mid-drain). Drain
|
||||
# exits when the node has no non-daemonset workload.
|
||||
drain_node() {
|
||||
local node="$1"
|
||||
predrain_unstick "$node"
|
||||
( while kill -0 $$ 2>/dev/null; do sleep 60; predrain_unstick "$node"; done ) &
|
||||
local watcher=$!
|
||||
trap "kill $watcher 2>/dev/null || true" EXIT
|
||||
$KUBECTL drain "$node" --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
|
||||
kill $watcher 2>/dev/null || true
|
||||
trap - EXIT
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Chain definition — what comes after the current phase
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
NEXT_PHASE=""
|
||||
NEXT_TARGET_NODE=""
|
||||
NEXT_RUN_ON=""
|
||||
|
||||
case "${PHASE}:${TARGET_NODE:-}" in
|
||||
preflight:)
|
||||
NEXT_PHASE=master
|
||||
NEXT_RUN_ON=k8s-node1 ;;
|
||||
master:)
|
||||
NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node4
|
||||
NEXT_RUN_ON=k8s-node1 ;;
|
||||
worker:k8s-node4)
|
||||
NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node3
|
||||
NEXT_RUN_ON=k8s-node1 ;;
|
||||
worker:k8s-node3)
|
||||
NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node2
|
||||
NEXT_RUN_ON=k8s-node1 ;;
|
||||
worker:k8s-node2)
|
||||
NEXT_PHASE=worker; NEXT_TARGET_NODE=k8s-node1
|
||||
NEXT_RUN_ON=k8s-master ;; # control-plane toleration required
|
||||
worker:k8s-node1)
|
||||
NEXT_PHASE=postflight
|
||||
NEXT_RUN_ON="" ;; # no node pinning for postflight
|
||||
postflight:)
|
||||
NEXT_PHASE="" ;; # end of chain
|
||||
*)
|
||||
echo "ERROR: unknown phase/target combo: ${PHASE}/${TARGET_NODE:-}" >&2
|
||||
exit 2 ;;
|
||||
esac
|
||||
|
||||
spawn_next() {
|
||||
[ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }
|
||||
|
||||
local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
|
||||
[ -n "${NEXT_TARGET_NODE:-}" ] && job_name="${job_name}-${NEXT_TARGET_NODE}"
|
||||
|
||||
if $KUBECTL -n "$NS" get job "$job_name" >/dev/null 2>&1; then
|
||||
echo "Next Job $job_name already exists; idempotent skip."
|
||||
return 0
|
||||
fi
|
||||
|
||||
local scheduling_block=""
|
||||
case "${NEXT_RUN_ON:-}" in
|
||||
k8s-master)
|
||||
scheduling_block=$' nodeSelector:\n kubernetes.io/hostname: k8s-master\n tolerations:\n - key: node-role.kubernetes.io/control-plane\n operator: Exists\n effect: NoSchedule' ;;
|
||||
"")
|
||||
scheduling_block="" ;;
|
||||
*)
|
||||
scheduling_block=$' nodeSelector:\n kubernetes.io/hostname: '"$NEXT_RUN_ON" ;;
|
||||
esac
|
||||
|
||||
export JOB_NAME="$job_name"
|
||||
export PHASE_NEXT="$NEXT_PHASE"
|
||||
export TARGET_NODE_NEXT="${NEXT_TARGET_NODE:-}"
|
||||
export TARGET_VERSION_LABEL="${TARGET_VERSION//./-}"
|
||||
export SCHEDULING_BLOCK="$scheduling_block"
|
||||
# TARGET_VERSION, KIND, IMAGE inherited from current env
|
||||
|
||||
echo "Spawning next Job: $job_name (phase=$NEXT_PHASE target=${NEXT_TARGET_NODE:-} run_on=${NEXT_RUN_ON:-anywhere})"
|
||||
envsubst <"$JOB_TEMPLATE" | $KUBECTL apply -f -
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase bodies
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
phase_preflight() {
|
||||
slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
|
||||
|
||||
# 1. All nodes Ready + no pressure
|
||||
local bad_nodes
|
||||
bad_nodes=$($KUBECTL get nodes -o json | jq -r '
|
||||
.items[]
|
||||
| select(
|
||||
(.status.conditions[] | select(.type=="Ready").status) != "True"
|
||||
or (.status.conditions[] | select(.type=="MemoryPressure").status) == "True"
|
||||
or (.status.conditions[] | select(.type=="DiskPressure").status) == "True")
|
||||
| .metadata.name')
|
||||
if [ -n "$bad_nodes" ]; then
|
||||
slack "ABORT preflight — nodes unhealthy: $bad_nodes"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 2. Halt-on-alert
|
||||
local alerts
|
||||
alerts=$(halt_on_alert_query)
|
||||
if [ -n "$alerts" ]; then
|
||||
slack "ABORT preflight — firing alerts:\n$alerts"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 3. 24h-quiet baseline
|
||||
local recent=0
|
||||
while IFS= read -r ts; do
|
||||
[ -z "$ts" ] && continue
|
||||
local diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
|
||||
if [ "$diff" -lt 86400 ]; then recent=1; break; fi
|
||||
done < <($KUBECTL get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
|
||||
if [ "$recent" -eq 1 ]; then
|
||||
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 4. kubeadm upgrade plan matches target
|
||||
local plan_target
|
||||
plan_target=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master 'sudo kubeadm upgrade plan' \
|
||||
| grep -oE 'kubeadm upgrade apply v[0-9]+\.[0-9]+\.[0-9]+' \
|
||||
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
|
||||
if [ "$plan_target" != "$TARGET_VERSION" ]; then
|
||||
slack "ABORT preflight — kubeadm plan target $plan_target ≠ requested $TARGET_VERSION"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
||||
$KUBECTL annotate ns "$NS" \
|
||||
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
||||
"viktorbarzin.me/k8s-upgrade-target=$TARGET_VERSION" \
|
||||
--overwrite
|
||||
push k8s_upgrade_in_flight 1
|
||||
push k8s_upgrade_started_timestamp "$(date +%s)"
|
||||
push k8s_upgrade_snapshot_taken 0
|
||||
|
||||
# 6. Trigger backup-etcd Job, wait, verify size
|
||||
local snap_job="pre-upgrade-etcd-${TARGET_VERSION//./-}-$(date +%s)"
|
||||
$KUBECTL -n default create job --from=cronjob/backup-etcd "$snap_job"
|
||||
if ! $KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$snap_job"; then
|
||||
$KUBECTL -n default describe "job/$snap_job" | tail -30
|
||||
slack "ABORT preflight — etcd snapshot Job did not complete in 10 min"
|
||||
exit 1
|
||||
fi
|
||||
local snap_log size snap_file
|
||||
snap_log=$($KUBECTL -n default logs "job/$snap_job" -c backup-manage --tail=20 || \
|
||||
$KUBECTL -n default logs "job/$snap_job" --tail=20)
|
||||
size=$(echo "$snap_log" | grep -E '^Backup done:' | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+' || true)
|
||||
snap_file=$(echo "$snap_log" | grep -E '^Backup done:' | awk '{print $3}' || true)
|
||||
if [ -z "$size" ] || [ "$size" -lt 1024 ]; then
|
||||
slack "ABORT preflight — etcd snapshot empty (size='${size:-unknown}')"
|
||||
exit 1
|
||||
fi
|
||||
$KUBECTL annotate ns "$NS" \
|
||||
"viktorbarzin.me/k8s-upgrade-snapshot-path=nfs://192.168.1.127:/srv/nfs/etcd-backup/$snap_file" \
|
||||
--overwrite
|
||||
push k8s_upgrade_snapshot_taken 1
|
||||
|
||||
# 7. Containerd skew fix on master (if master < workers)
|
||||
local master_ctr worker_max=0.0.0
|
||||
master_ctr=$(ssh "${SSH_OPTS[@]}" wizard@k8s-master "containerd --version | awk '{print \$3}' | tr -d v")
|
||||
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
local v
|
||||
v=$(ssh "${SSH_OPTS[@]}" "wizard@$n" "containerd --version | awk '{print \$3}' | tr -d v")
|
||||
[ "$(printf '%s\n%s' "$v" "$worker_max" | sort -V | tail -1)" = "$v" ] && worker_max="$v"
|
||||
done
|
||||
if [ "$(printf '%s\n%s' "$master_ctr" "$worker_max" | sort -V | head -1)" = "$master_ctr" ] \
|
||||
&& [ "$master_ctr" != "$worker_max" ]; then
|
||||
slack "Master containerd $master_ctr < workers $worker_max — bumping"
|
||||
ssh "${SSH_OPTS[@]}" wizard@k8s-master \
|
||||
"sudo apt-mark unhold containerd.io && sudo apt-get install -y containerd.io='$worker_max-1' \
|
||||
&& sudo apt-mark hold containerd.io && sudo systemctl restart containerd"
|
||||
wait_for_node_ready k8s-master "$($KUBECTL get node k8s-master -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)" \
|
||||
|| { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
|
||||
slack "Master containerd: $master_ctr → $worker_max. Master Ready."
|
||||
fi
|
||||
|
||||
# 8. Apt repo URL rewrite (minor only)
|
||||
if [ "$KIND" = "minor" ]; then
|
||||
local target_minor="${TARGET_VERSION%.*}"
|
||||
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh "${SSH_OPTS[@]}" "wizard@$n" \
|
||||
"echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
|
||||
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' \
|
||||
| sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
|
||||
&& sudo apt-get update"
|
||||
done
|
||||
slack "Apt repo rewritten to v$target_minor/deb on all 5 nodes"
|
||||
fi
|
||||
|
||||
slack "Preflight clean. Snapshot at nfs://...$snap_file ($size bytes). Dispatching master Job."
|
||||
}
|
||||
|
||||
phase_master() {
|
||||
slack "Draining k8s-master"
|
||||
|
||||
# Re-check halt-on-alert before drain
|
||||
local alerts
|
||||
alerts=$(halt_on_alert_query)
|
||||
[ -n "$alerts" ] && { slack "ABORT master — alerts firing pre-drain: $alerts"; exit 1; }
|
||||
|
||||
drain_node k8s-master
|
||||
|
||||
slack "Running update_k8s.sh on k8s-master (--role master --release $TARGET_VERSION)"
|
||||
ssh "${SSH_OPTS[@]}" wizard@k8s-master 'bash -s' \
|
||||
< "$UPDATE_K8S_SH" -- --role master --release "$TARGET_VERSION"
|
||||
|
||||
$KUBECTL uncordon k8s-master
|
||||
|
||||
wait_for_node_ready k8s-master "$TARGET_VERSION" \
|
||||
|| { slack "ABORT — k8s-master not Ready or wrong version after upgrade"; exit 1; }
|
||||
|
||||
local not_ready
|
||||
not_ready=$($KUBECTL -n kube-system get pods -l 'tier=control-plane' --no-headers 2>/dev/null \
|
||||
| grep -v Running | wc -l)
|
||||
if [ "$not_ready" -gt 0 ]; then
|
||||
slack "ABORT — $not_ready control-plane pods not Running after master upgrade"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
alerts=$(halt_on_alert_query RecentNodeReboot)
|
||||
[ -n "$alerts" ] && { slack "ABORT master — alerts firing post-upgrade: $alerts"; exit 1; }
|
||||
|
||||
slack "Master on v$TARGET_VERSION, control-plane Running. Dispatching worker chain."
|
||||
}
|
||||
|
||||
phase_worker() {
|
||||
[ -z "$TARGET_NODE" ] && { echo "ERROR: worker phase requires TARGET_NODE"; exit 2; }
|
||||
slack "Draining $TARGET_NODE"
|
||||
|
||||
# Halt-on-alert wait (up to 30 min)
|
||||
local attempt alerts
|
||||
for attempt in $(seq 1 30); do
|
||||
alerts=$(halt_on_alert_query)
|
||||
[ -z "$alerts" ] && break
|
||||
echo "Waiting for alerts to clear (attempt $attempt/30): $alerts"
|
||||
sleep 60
|
||||
done
|
||||
[ -n "$alerts" ] && { slack "ABORT $TARGET_NODE — alerts firing after 30min: $alerts"; exit 1; }
|
||||
|
||||
drain_node "$TARGET_NODE"
|
||||
|
||||
slack "Running update_k8s.sh on $TARGET_NODE (--role worker --release $TARGET_VERSION)"
|
||||
ssh "${SSH_OPTS[@]}" "wizard@$TARGET_NODE" 'bash -s' \
|
||||
< "$UPDATE_K8S_SH" -- --role worker --release "$TARGET_VERSION"
|
||||
|
||||
$KUBECTL uncordon "$TARGET_NODE"
|
||||
|
||||
wait_for_node_ready "$TARGET_NODE" "$TARGET_VERSION" \
|
||||
|| { slack "ABORT — $TARGET_NODE not Ready or wrong version"; exit 1; }
|
||||
|
||||
# Daemonsets back on the node
|
||||
local missing=0
|
||||
for ds in calico-node kube-proxy; do
|
||||
local count
|
||||
count=$($KUBECTL get pods -A -o wide --field-selector "spec.nodeName=$TARGET_NODE,status.phase=Running" --no-headers \
|
||||
| awk -v d="$ds" '$2 ~ d {n++} END{print n+0}')
|
||||
[ "$count" -lt 1 ] && missing=$((missing+1))
|
||||
done
|
||||
[ "$missing" -gt 0 ] && { slack "WARN $TARGET_NODE — $missing daemonset(s) missing"; }
|
||||
|
||||
# 10-min soak with halt-on-alert (RecentNodeReboot ignored — we know we restarted it)
|
||||
echo "Soaking $TARGET_NODE for 10 min..."
|
||||
for i in $(seq 1 10); do
|
||||
alerts=$(halt_on_alert_query RecentNodeReboot)
|
||||
[ -n "$alerts" ] && { slack "ABORT $TARGET_NODE mid-soak — alerts: $alerts"; exit 1; }
|
||||
sleep 60
|
||||
done
|
||||
|
||||
slack "$TARGET_NODE on v$TARGET_VERSION. Soaked clean (10 min)."
|
||||
}
|
||||
|
||||
phase_postflight() {
|
||||
slack "Running postflight"
|
||||
|
||||
# All 5 nodes at target
|
||||
local versions wrong
|
||||
versions=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
|
||||
wrong=$(echo "$versions" | grep -v ":v${TARGET_VERSION}\$" | wc -l)
|
||||
if [ "$wrong" -ne 0 ]; then
|
||||
slack "ABORT postflight — $wrong node(s) off target:\n$versions"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# No alerts firing
|
||||
local alerts
|
||||
alerts=$(halt_on_alert_query)
|
||||
[ -n "$alerts" ] && slack "Postflight WARN — alerts still firing (cluster on target, please check):\n$alerts"
|
||||
|
||||
# Pod-ready ratio
|
||||
local ratio
|
||||
ratio=$(curl -sf "$PROM/api/v1/query" \
|
||||
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
|
||||
| jq -r '.data.result[0].value[1] // "0"')
|
||||
|
||||
# Clear annotations + gauges
|
||||
$KUBECTL annotate ns "$NS" \
|
||||
'viktorbarzin.me/k8s-upgrade-in-flight-' \
|
||||
'viktorbarzin.me/k8s-upgrade-target-' \
|
||||
'viktorbarzin.me/k8s-upgrade-snapshot-path-' || true
|
||||
push k8s_upgrade_in_flight 0
|
||||
push k8s_upgrade_snapshot_taken 0
|
||||
push k8s_upgrade_started_timestamp 0
|
||||
|
||||
slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Dispatch
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
case "$PHASE" in
|
||||
preflight) phase_preflight ;;
|
||||
master) phase_master ;;
|
||||
worker) phase_worker ;;
|
||||
postflight) phase_postflight ;;
|
||||
*) echo "ERROR: unknown PHASE: $PHASE" >&2; exit 2 ;;
|
||||
esac
|
||||
|
||||
spawn_next
|
||||
|
|
@ -1917,6 +1917,21 @@ serverFiles:
|
|||
severity: critical
|
||||
annotations:
|
||||
summary: "K8s upgrade is in flight but no etcd snapshot was recorded — pipeline pre-flight failed silently"
|
||||
# K8sUpgradeStalled: the v2 Job-chain pushes `k8s_upgrade_started_timestamp`
|
||||
# in preflight and resets `k8s_upgrade_in_flight=0` in postflight. If
|
||||
# in_flight=1 persists for >90 min, a Job in the chain failed
|
||||
# (backoffLimit=1), got preempted/evicted, or is hung. Manual recovery:
|
||||
# `kubectl -n k8s-upgrade get jobs` → identify failed/stuck Job → delete
|
||||
# it → fix root cause → re-create the same Job. Next-Job creation in each
|
||||
# phase is idempotent (deterministic name = `k8s-upgrade-<phase>-<target>`)
|
||||
# so re-running won't duplicate downstream Jobs.
|
||||
- alert: K8sUpgradeStalled
|
||||
expr: k8s_upgrade_in_flight == 1 and (time() - k8s_upgrade_started_timestamp) > 5400
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "K8s upgrade has been in flight for >90 min — chain is stuck. Check: kubectl -n k8s-upgrade get jobs"
|
||||
- name: "Traefik Ingress"
|
||||
rules:
|
||||
- alert: TraefikDown
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue