infra

Author	SHA1	Message	Date
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	7f63d35d0a	docs/plans: HA control plane — design + plan + deferral Investigated, designed, and planned the 3-master HA control plane migration triggered by 2026-05-21's autonomous k8s upgrade cascade. Locked 14 design decisions across two passes: - 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc) - 4 challenger-pass amendments (cloud-init template bump, rbac stack multi-master refactor, HTTPS /readyz health check, expanded blast radius to include /home/wizard/code/infra/config root kubeconfig, config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector, k8s-version-upgrade chain extension as Phase 7) Plan covers 11 phases end-to-end including panic-mode rollback. DEFERRED before execution. PVE host is 98% RAM-committed (262 GB allocated / 267 GB physical, 1.5 GB swap active); the planned 3 x 32 GB masters would push allocation to 326 GB and OOM the host. k8s-master currently uses only 4.6 GB of its 32 GB allocation (5-6x oversized). Revisit triggers documented in design doc: 1. Second PVE host added → hardware HA becomes possible. 2. Right-sizing pass OR planning masters at 16 GB each. 3. Cumulative manual upgrade nursing > ~10h. Standalone candidate worth lifting independently: Phase 1.5's rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning to loop over k8s_master_hosts list) — future-proofs the cluster without committing to the HA migration. Refs: code-n0ow (open, deferred via bd note). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:32:15 +00:00
Viktor Barzin	533a89a010	docs: HA control plane design (3 masters) Captures today's k8s-upgrade-pipeline session findings — root cause of repeated upgrade failures is the single-master apiserver outage window cascading into operator crashloops + storm I/O. HA control plane with 3 masters + apiserver LB removes the cascade entirely. Tracked in beads code-n0ow. Plan doc to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 09:41:20 +00:00

4 commits