DECISION: Disable Loki due to operational overhead vs benefit analysis
EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs
OPERATIONAL OVERHEAD ELIMINATED:
✅ 50GB iSCSI storage freed up (expensive)
✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster)
✅ ~2+ CPU cores freed up for actual workloads
✅ Reduced complexity - fewer services to maintain and troubleshoot
✅ Eliminated single point of failure that can cascade cluster-wide
CONFIGURATION PRESERVED:
✅ All Terraform resources commented out (not deleted)
✅ loki.yaml preserved with 50GB configuration
✅ alloy.yaml preserved with log shipping configuration
✅ Alert rules and Grafana datasource preserved (commented)
✅ Easy re-enabling: just uncomment resources and apply
ALTERNATIVE DEBUGGING APPROACH:
✅ kubectl logs (always works, no storage dependency)
✅ kubectl get events (built-in Kubernetes events)
✅ Prometheus metrics (more reliable for monitoring)
✅ Enhanced health check scripts (direct status verification)
RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.
[ci skip] - Infrastructure changes applied locally first due to resource cleanup