6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.
- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
(now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three autonomous-upgrade pipelines run independently — Keel for apps
(hourly registry polling), unattended-upgrades+kured for OS, and the
k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there
was no single place to see whether each was healthy, what's pending,
or whether anything's stuck. The /upgrade-state skill collapses the
state of all three into one table you can run before each Sunday's
k8s-version-check fires.
- stacks/keel/main.tf: add Prometheus pod-annotation scrape on
container port 9300. Surfaces pending_approvals,
poll_trigger_tracked_images, and registries_scanned_total{image}
so the skill has a real timeseries (also opens the door to a
future "pending_approvals > 0 for 24h" alert).
- scripts/upgrade_state.sh: collector + renderer. Three-row table
(Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2.
SSH fan-out (parallel subshells) to all five nodes for apt
state + reboot-required + uu log; Prometheus query for Keel;
Pushgateway parse for k8s_upgrade_* gauges. Read-only.
- .claude/skills/upgrade-state/SKILL.md: hardlinked to
~/.claude/skills/upgrade-state/SKILL.md so the skill is
discoverable from both monorepo-rooted and global sessions.
Verification: ran the script, stress-tested the ✗ stalled path by
pushing in_flight=1 + started_timestamp=-100min to Pushgateway and
resetting after — script correctly raised ✗ and exit 2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>