The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:
1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
(no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
writes + a pod-creation admission webhook, purely to feed a dashboard. krr
(Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.
2. Disable kyverno reporting (admission/aggregate/background). policyReports were
already off, so the pipeline generated ephemeralreports + an hourly
all-resource etcd re-scan for NO user-facing output. Admission enforcement
(deny-* policies) and Keel mutation are unaffected; violations surface via
Loki->Slack.
3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).
Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.
Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Renamed from stacks/descheduler/module/values.yaml (Browse further)