infra/stacks/platform/modules/monitoring
OpenClaw 8154103ac4 feat(monitoring): Disable Loki centralized logging while preserving configuration
DECISION: Disable Loki due to operational overhead vs benefit analysis

EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs

OPERATIONAL OVERHEAD ELIMINATED:
 50GB iSCSI storage freed up (expensive)
 ~3.5GB memory freed up (Loki + Alloy agents across cluster)
 ~2+ CPU cores freed up for actual workloads
 Reduced complexity - fewer services to maintain and troubleshoot
 Eliminated single point of failure that can cascade cluster-wide

CONFIGURATION PRESERVED:
 All Terraform resources commented out (not deleted)
 loki.yaml preserved with 50GB configuration
 alloy.yaml preserved with log shipping configuration
 Alert rules and Grafana datasource preserved (commented)
 Easy re-enabling: just uncomment resources and apply

ALTERNATIVE DEBUGGING APPROACH:
 kubectl logs (always works, no storage dependency)
 kubectl get events (built-in Kubernetes events)
 Prometheus metrics (more reliable for monitoring)
 Enhanced health check scripts (direct status verification)

RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.

[ci skip] - Infrastructure changes applied locally first due to resource cleanup
2026-03-17 16:51:02 +00:00
..
dashboards [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability 2026-02-23 22:05:28 +00:00
caretta.tf resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
grafana.tf [ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) 2026-03-02 20:23:36 +00:00
grafana_chart_values.yaml resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf feat(monitoring): Disable Loki centralized logging while preserving configuration 2026-03-17 16:51:02 +00:00
loki.yaml fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion 2026-03-17 16:51:02 +00:00
main.tf resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert 2026-03-09 22:05:20 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
snmp_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00