- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
get VPA Off mode (recommend only, no evictions) to prevent downtime
on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
- Scale admission controller to 2 replicas with topology spread across nodes
- Rewrite inject-priority-class-from-tier: use namespaceSelector instead of
API call per pod admission (eliminates Kyverno→API server round-trip)
- Rewrite sync-tier-label-from-namespace: same namespaceSelector approach
- Extract governance_tiers local to DRY up tier definitions
Admission controller was restarting every ~5min due to API server timeouts
causing leader election loss. failurePolicy:Fail meant the webhook blocked
all pod creation cluster-wide when Kyverno was unavailable.
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.