infra

History

OpenClaw 8154103ac4 feat(monitoring): Disable Loki centralized logging while preserving configuration DECISION: Disable Loki due to operational overhead vs benefit analysis EVIDENCE FROM NODE2 INCIDENT: - Loki was the root cause of major cluster outage (PVC storage exhaustion) - Centralized logging was unavailable when needed most (Loki was down) - All debugging was accomplished with simpler tools (kubectl logs, events, describe) - Prometheus metrics proved more valuable than centralized logs OPERATIONAL OVERHEAD ELIMINATED: ✅ 50GB iSCSI storage freed up (expensive) ✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster) ✅ ~2+ CPU cores freed up for actual workloads ✅ Reduced complexity - fewer services to maintain and troubleshoot ✅ Eliminated single point of failure that can cascade cluster-wide CONFIGURATION PRESERVED: ✅ All Terraform resources commented out (not deleted) ✅ loki.yaml preserved with 50GB configuration ✅ alloy.yaml preserved with log shipping configuration ✅ Alert rules and Grafana datasource preserved (commented) ✅ Easy re-enabling: just uncomment resources and apply ALTERNATIVE DEBUGGING APPROACH: ✅ kubectl logs (always works, no storage dependency) ✅ kubectl get events (built-in Kubernetes events) ✅ Prometheus metrics (more reliable for monitoring) ✅ Enhanced health check scripts (direct status verification) RE-ENABLING: To restore Loki later, uncomment all /* ... */ blocks in loki.tf and apply via Terraform. All configuration is preserved. [ci skip] - Infrastructure changes applied locally first due to resource cleanup		2026-03-17 16:51:02 +00:00
..
modules	feat(monitoring): Disable Loki centralized logging while preserving configuration	2026-03-17 16:51:02 +00:00
.gitkeep	[ci skip] Add Terragrunt directory skeleton and root config	2026-02-22 13:01:37 +00:00
.terraform.lock.hcl	Woodpecker CI deploy commit [CI SKIP]	2026-03-07 20:47:22 +00:00
backend.tf	Woodpecker CI deploy commit [CI SKIP]	2026-03-07 20:47:22 +00:00
main.tf	deploy Sealed Secrets controller for encrypted secret management	2026-03-08 19:49:48 +00:00
providers.tf	[ci skip] fix false-positive sensitive=true on kube_config_path	2026-03-07 15:48:19 +00:00
redis-25.3.2.tgz	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache	2026-03-06 23:55:57 +00:00
secrets	[ci skip] Migrate 22 platform service states to stacks/platform	2026-02-22 13:35:10 +00:00
terragrunt.hcl	[ci skip] Add platform stack (core services) for Terragrunt migration	2026-02-22 13:21:09 +00:00
tiers.tf	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk	2026-02-28 19:08:06 +00:00