infra/.claude/skills/upgrade-state/SKILL.md
Viktor Barzin c04efa3d3a
Some checks failed
ci/woodpecker/push/default Pipeline failed
k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00

8.7 KiB

name description author version date
upgrade-state Audit the three autonomous-upgrade pipelines (apps via Keel, OS via unattended-upgrades+kured, K8s components via the version-check chain). Use when: (1) User asks "/upgrade-state" or "are we current", (2) User asks "what's pending upgrade" or "what's the upgrade state", (3) User asks if Keel / kured / k8s-version-check is healthy, (4) User asks about kept-back / held packages or pending reboots, (5) Periodic survey before the next `k8s-version-check` daily run. Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled. Claude Code 1.0.0 2026-05-18

Upgrade-state

MANDATORY: Run the script first

When this skill is invoked, your first action must be to run upgrade_state.sh and reason over its output before doing anything else. Do NOT improvise individual kubectl / ssh calls — the script is the authoritative surface.

bash /home/wizard/code/infra/scripts/upgrade_state.sh

For programmatic use:

bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json

Then:

  1. Report the rendered table verbatim — it answers the user's "are we current" question in three lines.
  2. For every or row, surface the relevant drill-down lines underneath and propose a next action (links in the table below).
  3. Only reach for ad-hoc commands when investigating beyond what the script reported.

Exit codes: 0 healthy, 1 attention warranted, 2 stalled / broken.

What it covers (3 pipelines)

Layer What runs Cadence Data sources
Apps Keel polls every watched Deployment's container registry; rolls on new digest hourly Prom (pending_approvals, registries_scanned_total), Keel pod logs
OS unattended-upgrades in-release patching; kured reboots when /var/run/reboot-required is set daily 02:00-06:00 London SSH fan-out to all 5 nodes
K8s k8s-version-check CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node nightly 23:00 UTC Pushgateway (k8s_upgrade_*), kubectl get nodes

The K8s pipeline pushes a small set of gauges to the Prometheus Pushgateway (prometheus-prometheus-pushgateway.monitoring:9091):

  • k8s_upgrade_available{kind="patch"|"minor",target=…} — 1 if newer release detected
  • k8s_version_check_last_run_timestamp — when detection last ran
  • k8s_upgrade_in_flight — 0/1
  • k8s_upgrade_started_timestamp — when the current chain started (0 when idle)

K8sUpgradeStalled fires when in_flight=1 and the chain has been running

90 minutes. K8sUpgradeChainJobFailed fires when a phase Job terminally failed — including a preflight that aborted before in_flight was set (the gates exit pre-metric). The script raises for either, and reads the Jobs directly, so it also catches a Failed preflight that left no metric.

Status-icon legend

Icon Meaning
Healthy, fully current
Update available, not yet applied (K8s patch/minor)
In flight — chain currently running
Attention: held-with-bumps, recent errors, pending approvals
Broken: pod down, alert firing, chain stalled, or a chain Job failed

Drill-down — when a row trips, what to do

Apps — pending approvals or errors

# Read recent Keel log lines
kubectl -n keel logs deploy/keel --since=24h --tail=200

# What is Keel currently tracking?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
    wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'

# Is the scrape live?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
    wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'

Common Keel errors:

  • failed to add image watch job — image annotation mistyped (rare; Kyverno auto-injects)
  • registry authentication required — bad imagePullSecret on the watched Deployment
  • bad tag pattern — Keel can't parse the watched image's tag against its policy

OS — held packages with bumps

The script flags any package held via apt-mark hold that ALSO appears in apt list --upgradable — excluding k8s components (the K8s pipeline owns those) and the kernel (kured handles the reboot half).

Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2, runc 1.1 → 1.4). These are held because they need cluster-wide coordination, not silent in-release patching.

# Inspect the situation on the flagged node
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'

# Unhold + upgrade a specific package
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'

Node IPs: master=100, node1=101, node2=102, node3=103, node4=104.

OS — pending reboot

A node has /var/run/reboot-required. Kured will reboot it inside the next 02:00-06:00 London window (any day of the week).

# Force a manual reboot inside the window (rare)
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
ssh wizard@10.0.20.10X sudo systemctl reboot

OS — kured not Running

kubectl -n kured get pods
kubectl -n kured logs daemonset/kured --tail=100
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
kubectl -n kured get pods -l name=kured-sentinel-gate

K8s — patch/minor available

Detection ran, target identified, chain NOT started. The chain spawns on the same daily detection cycle — typically within ~24h of the target first being detected.

# Inspect Pushgateway state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
    wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade

# Trigger a manual run of the detection CronJob
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)

K8s — in flight

The Job chain is running. Watch its progress:

kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix

K8s ✗ stalledK8sUpgradeStalled would fire

Chain in-flight >90m. The Job is most likely stuck on drain or a pre-flight check.

kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <stuck-job>
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300

# If you need to clear the in-flight flag (after diagnosing):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
    "printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
     wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
       --header='Content-Type: text/plain'"

K8s ✗ chain failed — a phase Job terminally failed

K8sUpgradeChainJobFailed would fire. Most often a preflight that aborted on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) — these exit before in_flight is set, so K8sUpgradeStalled never sees them, and the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).

kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <failed-job>    # check the Failed reason
# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
# them. Replay the gate instead — which critical alerts were firing at the
# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)

Recovery is now mostly automatic: the detection CronJob and spawn_next re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a transient gate clears within ~24h. To expedite, delete the Failed Job and trigger detection:

kubectl -n k8s-upgrade delete job <failed-job>
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)

K8s ✗ detection stale — last detection >9 days

kubectl -n k8s-upgrade get cronjob k8s-version-check
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5

If the CronJob hasn't fired on time, suspect:

  • suspend=true on the CronJob (var.enabled=false in the k8s-version-upgrade Terraform stack)
  • Image-pull failure on the version-check pod
  • Pushgateway scrape gone stale

Companion command-line flags

bash infra/scripts/upgrade_state.sh                 # rendered table (default)
bash infra/scripts/upgrade_state.sh --json          # machine output
bash infra/scripts/upgrade_state.sh --kubeconfig X  # override kubeconfig