DECISION: Disable Loki due to operational overhead vs benefit analysis
EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs
OPERATIONAL OVERHEAD ELIMINATED:
✅ 50GB iSCSI storage freed up (expensive)
✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster)
✅ ~2+ CPU cores freed up for actual workloads
✅ Reduced complexity - fewer services to maintain and troubleshoot
✅ Eliminated single point of failure that can cascade cluster-wide
CONFIGURATION PRESERVED:
✅ All Terraform resources commented out (not deleted)
✅ loki.yaml preserved with 50GB configuration
✅ alloy.yaml preserved with log shipping configuration
✅ Alert rules and Grafana datasource preserved (commented)
✅ Easy re-enabling: just uncomment resources and apply
ALTERNATIVE DEBUGGING APPROACH:
✅ kubectl logs (always works, no storage dependency)
✅ kubectl get events (built-in Kubernetes events)
✅ Prometheus metrics (more reliable for monitoring)
✅ Enhanced health check scripts (direct status verification)
RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.
[ci skip] - Infrastructure changes applied locally first due to resource cleanup
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
on SQLite locks, making liveness probes fail even faster (131 restarts
vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.
Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).
Requires re-applying all stacks to propagate to existing PVs.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Renamed from modules/kubernetes/monitoring/loki.tf (Browse further)