infra

Author	SHA1	Message	Date
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	fb638cd8ec	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Some checks failed ci/woodpecker/push/default Pipeline failed Details Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:10:18 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	9a06a76883	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 18:29:08 +00:00
Viktor Barzin	9e045e2c16	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 10:50:43 +00:00

6 commits