CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
ISSUE RESOLVED:
- Root cause: Loki's 15GB iSCSI PVC was completely full
- Symptom: 'no space left on device' errors during TSDB operations
- Impact: Loki service completely down, logging unavailable
- Side effects: Contributed to node2 containerd corruption incident
SOLUTION APPLIED:
- Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch
- Triggered pod restart to complete filesystem resize
- Verified successful expansion and service recovery
CURRENT STATUS:
✅ PVC: 50Gi capacity (iscsi-truenas storage class)
✅ Loki StatefulSet: 1/1 ready
✅ Loki Pod: 2/2 containers running
✅ Service: Successfully processing log streams
✅ No storage errors in recent logs
TERRAFORM ALIGNED:
- Updated loki.yaml persistence.size to match actual PVC
- Infrastructure code now reflects deployed state
[ci skip] - Emergency fix applied locally first due to service outage
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Renamed from modules/kubernetes/monitoring/loki.yaml (Browse further)