infra/docs/plans/2026-02-13-centralized-log-collection-plan.md

14 KiB

Centralized Log Collection Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.

Architecture: Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.

Tech Stack: Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana

Design doc: docs/plans/2026-02-13-centralized-log-collection-design.md


Task 1: Add sysctl DaemonSet for inotify limits

Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.

Files:

  • Modify: modules/kubernetes/monitoring/loki.tf (replace the comment block at lines 67-71)

Step 1: Write the sysctl DaemonSet resource

Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in loki.tf:

resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
  metadata {
    name      = "sysctl-inotify"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      app = "sysctl-inotify"
    }
  }
  spec {
    selector {
      match_labels = {
        app = "sysctl-inotify"
      }
    }
    template {
      metadata {
        labels = {
          app = "sysctl-inotify"
        }
      }
      spec {
        init_container {
          name  = "sysctl"
          image = "busybox:1.37"
          command = [
            "sh", "-c",
            "sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
          ]
          security_context {
            privileged = true
          }
        }
        container {
          name  = "pause"
          image = "registry.k8s.io/pause:3.10"
          resources {
            requests = {
              cpu    = "1m"
              memory = "4Mi"
            }
            limits = {
              cpu    = "1m"
              memory = "4Mi"
            }
          }
        }
        host_pid = true
        toleration {
          operator = "Exists"
        }
      }
    }
  }
}

Step 2: Run terraform fmt

Run: terraform fmt -recursive modules/kubernetes/monitoring/

Step 3: Run terraform plan to verify

Run: terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30 Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)

Step 4: Commit

git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"

Task 2: Update Loki Helm values with disk-friendly tuning

Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.

Files:

  • Modify: modules/kubernetes/monitoring/loki.yaml (full rewrite)

Step 1: Write updated loki.yaml

Replace entire contents of loki.yaml with:

loki:
  commonConfig:
    replication_factor: 1
  schemaConfig:
    configs:
      - from: "2025-04-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_idle_period: 12h
    max_chunk_age: 24h
    chunk_retain_period: 1m
    chunk_target_size: 1572864
    wal:
      dir: /loki-wal
  pattern_ingester:
    enabled: true
  limits_config:
    allow_structured_metadata: true
    volume_enabled: true
    retention_period: 168h
  compactor:
    retention_enabled: true
    working_directory: /loki/compactor
    compaction_interval: 1h
    delete_request_store: filesystem
  ruler:
    enable_api: true
    storage:
      type: local
      local:
        directory: /loki/rules
    alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
    ring:
      kvstore:
        store: inmemory
    rule_path: /loki/scratch
  storage:
    type: "filesystem"
  auth_enabled: false

minio:
  enabled: false

deploymentMode: SingleBinary

singleBinary:
  replicas: 1
  persistence:
    enabled: true
    size: 15Gi
    storageClass: ""
  extraVolumes:
    - name: wal
      emptyDir:
        medium: Memory
        sizeLimit: 2Gi
    - name: rules
      configMap:
        name: loki-alert-rules
  extraVolumeMounts:
    - name: wal
      mountPath: /loki-wal
    - name: rules
      mountPath: /loki/rules/fake
  resources:
    requests:
      cpu: 250m
      memory: 4Gi
    limits:
      cpu: "1"
      memory: 6Gi

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

Step 2: Commit

git add modules/kubernetes/monitoring/loki.yaml
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"

Task 3: Update Alloy Helm values with resource limits

The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.

Files:

  • Modify: modules/kubernetes/monitoring/alloy.yaml (add resource limits)

Step 1: Add resource limits to alloy.yaml

Append after the existing alloy.configMap.content block (after the last line):


  # Resource limits for DaemonSet pods
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 200m
      memory: 128Mi

The final file should have the alloy.configMap.content block unchanged, with alloy.resources added as a sibling under alloy:.

Step 2: Commit

git add modules/kubernetes/monitoring/alloy.yaml
git commit -m "[ci skip] Add resource limits to Alloy config"

Task 4: Uncomment Loki Helm release and PV in loki.tf

Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).

Files:

  • Modify: modules/kubernetes/monitoring/loki.tf (uncomment Loki resources, remove minio PV)

Step 1: Uncomment the Loki Helm release (lines 1-12)

Uncomment and update the helm_release to:

resource "helm_release" "loki" {
  namespace        = kubernetes_namespace.monitoring.metadata[0].name
  create_namespace = true
  name             = "loki"

  repository = "https://grafana.github.io/helm-charts"
  chart      = "loki"

  values  = [templatefile("${path.module}/loki.yaml", {})]
  timeout = 300

  depends_on = [kubernetes_config_map.loki_alert_rules]
}

Step 2: Uncomment the Loki NFS PV (lines 14-32)

Uncomment the kubernetes_persistent_volume.loki resource as-is.

Step 3: Remove the minio PV block (lines 34-52)

Delete the entire kubernetes_persistent_volume.loki-minio commented block — minio is disabled.

Step 4: Run terraform fmt

Run: terraform fmt -recursive modules/kubernetes/monitoring/

Step 5: Commit

git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"

Task 5: Uncomment Alloy Helm release in loki.tf

Enable the Alloy Helm release.

Files:

  • Modify: modules/kubernetes/monitoring/loki.tf (uncomment Alloy helm release)

Step 1: Uncomment and update the Alloy Helm release

Replace the commented Alloy block with:

resource "helm_release" "alloy" {
  namespace        = kubernetes_namespace.monitoring.metadata[0].name
  create_namespace = true
  name             = "alloy"

  repository = "https://grafana.github.io/helm-charts"
  chart      = "alloy"

  values = [file("${path.module}/alloy.yaml")]
  atomic = true

  depends_on = [helm_release.loki]
}

Step 2: Run terraform fmt

Run: terraform fmt -recursive modules/kubernetes/monitoring/

Step 3: Commit

git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Alloy Helm release"

Task 6: Add Grafana Loki datasource ConfigMap

Grafana's sidecar auto-discovers ConfigMaps with label grafana_datasource: "1". Create one for Loki.

Files:

  • Modify: modules/kubernetes/monitoring/loki.tf (add ConfigMap resource)

Step 1: Add the datasource ConfigMap

Add to loki.tf:

resource "kubernetes_config_map" "grafana_loki_datasource" {
  metadata {
    name      = "grafana-loki-datasource"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      grafana_datasource = "1"
    }
  }
  data = {
    "loki-datasource.yaml" = yamlencode({
      apiVersion  = 1
      datasources = [{
        name      = "Loki"
        type      = "loki"
        access    = "proxy"
        url       = "http://loki.monitoring.svc.cluster.local:3100"
        isDefault = false
      }]
    })
  }
}

Step 2: Run terraform fmt

Run: terraform fmt -recursive modules/kubernetes/monitoring/

Step 3: Commit

git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"

Task 7: Add Loki alert rules ConfigMap

Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at /loki/rules/fake/.

Files:

  • Modify: modules/kubernetes/monitoring/loki.tf (add alert rules ConfigMap)

Step 1: Add the alert rules ConfigMap

Add to loki.tf:

resource "kubernetes_config_map" "loki_alert_rules" {
  metadata {
    name      = "loki-alert-rules"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
  }
  data = {
    "rules.yaml" = yamlencode({
      groups = [{
        name = "log-alerts"
        rules = [
          {
            alert = "HighErrorRate"
            expr  = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
            for   = "5m"
            labels = {
              severity = "warning"
            }
            annotations = {
              summary = "High error rate in {{ $labels.namespace }}"
            }
          },
          {
            alert = "PodCrashLoopBackOff"
            expr  = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
            for   = "1m"
            labels = {
              severity = "critical"
            }
            annotations = {
              summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
            }
          },
          {
            alert = "OOMKilled"
            expr  = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
            for   = "1m"
            labels = {
              severity = "critical"
            }
            annotations = {
              summary = "OOMKilled detected in {{ $labels.namespace }}"
            }
          }
        ]
      }]
    })
  }
}

Step 2: Run terraform fmt

Run: terraform fmt -recursive modules/kubernetes/monitoring/

Step 3: Commit

git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Loki alert rules ConfigMap"

Task 8: Deploy and verify

Apply all changes via Terraform and verify the stack is working.

Files: None (deployment only)

Step 1: Run terraform apply for monitoring module

Run: terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)

Step 2: Verify sysctl DaemonSet is running on all nodes

Run: kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify Expected: DESIRED=5, CURRENT=5, READY=5

Step 3: Verify Loki pod is running

Run: kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki Expected: 1/1 Running

Step 4: Verify Alloy DaemonSet is running

Run: kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy Expected: DESIRED=5, CURRENT=5, READY=5

Step 5: Verify Loki is receiving logs

Run: kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels' Expected: JSON response with labels like namespace, pod, container

Step 6: Verify Grafana has Loki datasource

Open https://grafana.viktorbarzin.me/explore, select "Loki" datasource, run query: {namespace="monitoring"} Expected: Log lines from monitoring namespace pods

Step 7: Commit final state

git add -A
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"

Troubleshooting

If Alloy pods crash with inotify errors:

  • Check sysctl DaemonSet init logs: kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl
  • Verify sysctl values on node: kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches

If Loki OOMs:

  • Check memory usage: kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki
  • Reduce max_chunk_age from 24h to 12h in loki.yaml to flush more frequently

If Grafana doesn't show Loki datasource:

  • Verify ConfigMap has correct label: kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml
  • Restart Grafana sidecar: kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana

If Loki PV won't bind:

  • Check NFS export exists: ssh root@10.0.10.15 'showmount -e localhost | grep loki'
  • Run NFS export script: cd secrets && bash nfs_exports.sh