upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit

Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.

- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
  container port 9300. Surfaces pending_approvals,
  poll_trigger_tracked_images, and registries_scanned_total{image}
  so the skill has a real timeseries (also opens the door to a
  future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
  (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
  SSH fan-out (parallel subshells) to all five nodes for apt
  state + reboot-required + uu log; Prometheus query for Keel;
  Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
  ~/.claude/skills/upgrade-state/SKILL.md so the skill is
  discoverable from both monorepo-rooted and global sessions.

Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-18 10:50:43 +00:00
parent a9cb806e86
commit 9e045e2c16
3 changed files with 804 additions and 0 deletions

View file

@ -46,6 +46,16 @@ resource "helm_release" "keel" {
atomic = true
values = [yamlencode({
# Prometheus pod-annotation scrape picks up Keel-specific metrics
# (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})
# on container port 9300 /metrics. The cluster's `kubernetes-pods`
# Prometheus job keys on these annotations. Used by
# infra/scripts/upgrade_state.sh (the /upgrade-state skill).
podAnnotations = {
"prometheus.io/scrape" = "true"
"prometheus.io/port" = "9300"
"prometheus.io/path" = "/metrics"
}
polling = {
enabled = true
# Default poll cadence for workloads that don't override per-Deployment