The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:
1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
(no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
writes + a pod-creation admission webhook, purely to feed a dashboard. krr
(Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.
2. Disable kyverno reporting (admission/aggregate/background). policyReports were
already off, so the pipeline generated ephemeralreports + an hourly
all-resource etcd re-scan for NO user-facing output. Admission enforcement
(deny-* policies) and Keel mutation are unaffected; violations surface via
Loki->Slack.
3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).
Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.
Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.