infra

History

OpenClaw 8154103ac4 feat(monitoring): Disable Loki centralized logging while preserving configuration DECISION: Disable Loki due to operational overhead vs benefit analysis EVIDENCE FROM NODE2 INCIDENT: - Loki was the root cause of major cluster outage (PVC storage exhaustion) - Centralized logging was unavailable when needed most (Loki was down) - All debugging was accomplished with simpler tools (kubectl logs, events, describe) - Prometheus metrics proved more valuable than centralized logs OPERATIONAL OVERHEAD ELIMINATED: ✅ 50GB iSCSI storage freed up (expensive) ✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster) ✅ ~2+ CPU cores freed up for actual workloads ✅ Reduced complexity - fewer services to maintain and troubleshoot ✅ Eliminated single point of failure that can cascade cluster-wide CONFIGURATION PRESERVED: ✅ All Terraform resources commented out (not deleted) ✅ loki.yaml preserved with 50GB configuration ✅ alloy.yaml preserved with log shipping configuration ✅ Alert rules and Grafana datasource preserved (commented) ✅ Easy re-enabling: just uncomment resources and apply ALTERNATIVE DEBUGGING APPROACH: ✅ kubectl logs (always works, no storage dependency) ✅ kubectl get events (built-in Kubernetes events) ✅ Prometheus metrics (more reliable for monitoring) ✅ Enhanced health check scripts (direct status verification) RE-ENABLING: To restore Loki later, uncomment all /* ... */ blocks in loki.tf and apply via Terraform. All configuration is preserved. [ci skip] - Infrastructure changes applied locally first due to resource cleanup		2026-03-17 16:51:02 +00:00
..
dashboards	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
server-power-cycle	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
alloy.yaml	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability	2026-02-23 22:05:28 +00:00
caretta.tf	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
Dockerfile	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
goflow2.tf	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
grafana.tf	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3)	2026-03-02 20:23:36 +00:00
grafana_chart_values.yaml	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
idrac.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
loki.tf	feat(monitoring): Disable Loki centralized logging while preserving configuration	2026-03-17 16:51:02 +00:00
loki.yaml	fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion	2026-03-17 16:51:02 +00:00
main.tf	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
prometheus.tf	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history	2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl	revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert	2026-03-09 22:05:20 +00:00
prometheus_snmp_chart_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
pve_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
snmp_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00