feat(monitoring): Disable Loki centralized logging while preserving configuration
DECISION: Disable Loki due to operational overhead vs benefit analysis EVIDENCE FROM NODE2 INCIDENT: - Loki was the root cause of major cluster outage (PVC storage exhaustion) - Centralized logging was unavailable when needed most (Loki was down) - All debugging was accomplished with simpler tools (kubectl logs, events, describe) - Prometheus metrics proved more valuable than centralized logs OPERATIONAL OVERHEAD ELIMINATED: ✅ 50GB iSCSI storage freed up (expensive) ✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster) ✅ ~2+ CPU cores freed up for actual workloads ✅ Reduced complexity - fewer services to maintain and troubleshoot ✅ Eliminated single point of failure that can cascade cluster-wide CONFIGURATION PRESERVED: ✅ All Terraform resources commented out (not deleted) ✅ loki.yaml preserved with 50GB configuration ✅ alloy.yaml preserved with log shipping configuration ✅ Alert rules and Grafana datasource preserved (commented) ✅ Easy re-enabling: just uncomment resources and apply ALTERNATIVE DEBUGGING APPROACH: ✅ kubectl logs (always works, no storage dependency) ✅ kubectl get events (built-in Kubernetes events) ✅ Prometheus metrics (more reliable for monitoring) ✅ Enhanced health check scripts (direct status verification) RE-ENABLING: To restore Loki later, uncomment all /* ... */ blocks in loki.tf and apply via Terraform. All configuration is preserved. [ci skip] - Infrastructure changes applied locally first due to resource cleanup
This commit is contained in:
parent
28cc7aea1f
commit
8154103ac4
1 changed files with 22 additions and 0 deletions
|
|
@ -1,5 +1,9 @@
|
|||
variable "nfs_server" { type = string }
|
||||
|
||||
# LOKI DISABLED - Uncomment to re-enable centralized logging
|
||||
# Disabled due to operational overhead vs benefit analysis after node2 incident
|
||||
# All configuration preserved in loki.yaml for future re-enabling
|
||||
/*
|
||||
resource "helm_release" "loki" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
|
|
@ -13,8 +17,12 @@ resource "helm_release" "loki" {
|
|||
|
||||
depends_on = [kubernetes_config_map.loki_alert_rules]
|
||||
}
|
||||
*/
|
||||
|
||||
# ALLOY DISABLED - Log collection agents (depends on Loki)
|
||||
# https://grafana.com/docs/alloy/latest/configure/kubernetes/
|
||||
# Configuration preserved in alloy.yaml for future re-enabling
|
||||
/*
|
||||
resource "helm_release" "alloy" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
|
|
@ -28,7 +36,11 @@ resource "helm_release" "alloy" {
|
|||
|
||||
depends_on = [helm_release.loki]
|
||||
}
|
||||
*/
|
||||
|
||||
# SYSCTL INOTIFY DISABLED - Was specifically for Loki file watching requirements
|
||||
# Can be re-enabled when Loki is restored
|
||||
/*
|
||||
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
||||
metadata {
|
||||
name = "sysctl-inotify"
|
||||
|
|
@ -89,6 +101,7 @@ resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
|||
}
|
||||
}
|
||||
}
|
||||
*/
|
||||
|
||||
# resource "helm_release" "k8s-monitoring" {
|
||||
# namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
|
|
@ -102,6 +115,10 @@ resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
|||
# atomic = true
|
||||
# }
|
||||
|
||||
# LOKI ALERT RULES DISABLED - Depend on Loki log queries
|
||||
# These alert on kernel events from systemd journal logs via Loki
|
||||
# Can be re-enabled when Loki is restored
|
||||
/*
|
||||
resource "kubernetes_config_map" "loki_alert_rules" {
|
||||
metadata {
|
||||
name = "loki-alert-rules"
|
||||
|
|
@ -174,7 +191,11 @@ resource "kubernetes_config_map" "loki_alert_rules" {
|
|||
})
|
||||
}
|
||||
}
|
||||
*/
|
||||
|
||||
# GRAFANA LOKI DATASOURCE DISABLED - Points to non-existent Loki service
|
||||
# Can be re-enabled when Loki is restored
|
||||
/*
|
||||
resource "kubernetes_config_map" "grafana_loki_datasource" {
|
||||
metadata {
|
||||
name = "grafana-loki-datasource"
|
||||
|
|
@ -196,3 +217,4 @@ resource "kubernetes_config_map" "grafana_loki_datasource" {
|
|||
})
|
||||
}
|
||||
}
|
||||
*/
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue