infra/docs/plans/2026-02-13-centralized-log-collection-plan.md

# Centralized Log Collection Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.

**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.

**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana

**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`

---

### Task 1: Add sysctl DaemonSet for inotify limits

Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)

**Step 1: Write the sysctl DaemonSet resource**

Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:

```hcl
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
  metadata {
    name      = "sysctl-inotify"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      app = "sysctl-inotify"
    }
  }
  spec {
    selector {
      match_labels = {
        app = "sysctl-inotify"
      }
    }
    template {
      metadata {
        labels = {
          app = "sysctl-inotify"
        }
      }
      spec {
        init_container {
          name  = "sysctl"
          image = "busybox:1.37"
          command = [
            "sh", "-c",
            "sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
          ]
          security_context {
            privileged = true
          }
        }
        container {
          name  = "pause"
          image = "registry.k8s.io/pause:3.10"
          resources {
            requests = {
              cpu    = "1m"
              memory = "4Mi"
            }
            limits = {
              cpu    = "1m"
              memory = "4Mi"
            }
          }
        }
        host_pid = true
        toleration {
          operator = "Exists"
        }
      }
    }
  }
}
```

**Step 2: Run terraform fmt**

Run: `terraform fmt -recursive modules/kubernetes/monitoring/`

**Step 3: Run terraform plan to verify**

Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)

**Step 4: Commit**

```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
```

---

### Task 2: Update Loki Helm values with disk-friendly tuning

Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)

**Step 1: Write updated loki.yaml**

Replace entire contents of `loki.yaml` with:

```yaml
loki:
  commonConfig:
    replication_factor: 1
  schemaConfig:
    configs:
      - from: "2025-04-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_idle_period: 12h
    max_chunk_age: 24h
    chunk_retain_period: 1m
    chunk_target_size: 1572864
    wal:
      dir: /loki-wal
  pattern_ingester:
    enabled: true
  limits_config:
    allow_structured_metadata: true
    volume_enabled: true
    retention_period: 168h
  compactor:
    retention_enabled: true
    working_directory: /loki/compactor
    compaction_interval: 1h
    delete_request_store: filesystem
  ruler:
    enable_api: true
    storage:
      type: local
      local:
        directory: /loki/rules
    alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
    ring:
      kvstore:
        store: inmemory
    rule_path: /loki/scratch
  storage:
    type: "filesystem"
  auth_enabled: false

minio:
  enabled: false

deploymentMode: SingleBinary

singleBinary:
  replicas: 1
  persistence:
    enabled: true
    size: 15Gi
    storageClass: ""
  extraVolumes:
    - name: wal
      emptyDir:
        medium: Memory
        sizeLimit: 2Gi
    - name: rules
      configMap:
        name: loki-alert-rules
  extraVolumeMounts:
    - name: wal
      mountPath: /loki-wal
    - name: rules
      mountPath: /loki/rules/fake
  resources:
    requests:
      cpu: 250m
      memory: 4Gi
    limits:
      cpu: "1"
      memory: 6Gi

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0
```

**Step 2: Commit**

```bash
git add modules/kubernetes/monitoring/loki.yaml
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
```

---

### Task 3: Update Alloy Helm values with resource limits

The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.

**Files:**
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)

**Step 1: Add resource limits to alloy.yaml**

Append after the existing `alloy.configMap.content` block (after the last line):

```yaml

  # Resource limits for DaemonSet pods
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 200m
      memory: 128Mi
```

The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.

**Step 2: Commit**

```bash
git add modules/kubernetes/monitoring/alloy.yaml
git commit -m "[ci skip] Add resource limits to Alloy config"
```

---

### Task 4: Uncomment Loki Helm release and PV in loki.tf

Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)

**Step 1: Uncomment the Loki Helm release (lines 1-12)**

Uncomment and update the helm_release to:

```hcl
resource "helm_release" "loki" {
  namespace        = kubernetes_namespace.monitoring.metadata[0].name
  create_namespace = true
  name             = "loki"

  repository = "https://grafana.github.io/helm-charts"
  chart      = "loki"

  values  = [templatefile("${path.module}/loki.yaml", {})]
  timeout = 300

  depends_on = [kubernetes_config_map.loki_alert_rules]
}
```

**Step 2: Uncomment the Loki NFS PV (lines 14-32)**

Uncomment the `kubernetes_persistent_volume.loki` resource as-is.

**Step 3: Remove the minio PV block (lines 34-52)**

Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.

**Step 4: Run terraform fmt**

Run: `terraform fmt -recursive modules/kubernetes/monitoring/`

**Step 5: Commit**

```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
```

---

### Task 5: Uncomment Alloy Helm release in loki.tf

Enable the Alloy Helm release.

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)

**Step 1: Uncomment and update the Alloy Helm release**

Replace the commented Alloy block with:

```hcl
resource "helm_release" "alloy" {
  namespace        = kubernetes_namespace.monitoring.metadata[0].name
  create_namespace = true
  name             = "alloy"

  repository = "https://grafana.github.io/helm-charts"
  chart      = "alloy"

  values = [file("${path.module}/alloy.yaml")]
  atomic = true

  depends_on = [helm_release.loki]
}
```

**Step 2: Run terraform fmt**

Run: `terraform fmt -recursive modules/kubernetes/monitoring/`

**Step 3: Commit**

```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Alloy Helm release"
```

---

### Task 6: Add Grafana Loki datasource ConfigMap

Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)

**Step 1: Add the datasource ConfigMap**

Add to `loki.tf`:

```hcl
resource "kubernetes_config_map" "grafana_loki_datasource" {
  metadata {
    name      = "grafana-loki-datasource"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      grafana_datasource = "1"
    }
  }
  data = {
    "loki-datasource.yaml" = yamlencode({
      apiVersion  = 1
      datasources = [{
        name      = "Loki"
        type      = "loki"
        access    = "proxy"
        url       = "http://loki.monitoring.svc.cluster.local:3100"
        isDefault = false
      }]
    })
  }
}
```

**Step 2: Run terraform fmt**

Run: `terraform fmt -recursive modules/kubernetes/monitoring/`

**Step 3: Commit**

```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
```

---

### Task 7: Add Loki alert rules ConfigMap

Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.

**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)

**Step 1: Add the alert rules ConfigMap**

Add to `loki.tf`:

```hcl
resource "kubernetes_config_map" "loki_alert_rules" {
  metadata {
    name      = "loki-alert-rules"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
  }
  data = {
    "rules.yaml" = yamlencode({
      groups = [{
        name = "log-alerts"
        rules = [
          {
            alert = "HighErrorRate"
            expr  = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
            for   = "5m"
            labels = {
              severity = "warning"
            }
            annotations = {
              summary = "High error rate in {{ $labels.namespace }}"
            }
          },
          {
            alert = "PodCrashLoopBackOff"
            expr  = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
            for   = "1m"
            labels = {
              severity = "critical"
            }
            annotations = {
              summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
            }
          },
          {
            alert = "OOMKilled"
            expr  = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
            for   = "1m"
            labels = {
              severity = "critical"
            }
            annotations = {
              summary = "OOMKilled detected in {{ $labels.namespace }}"
            }
          }
        ]
      }]
    })
  }
}
```

**Step 2: Run terraform fmt**

Run: `terraform fmt -recursive modules/kubernetes/monitoring/`

**Step 3: Commit**

```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
```

---

### Task 8: Deploy and verify

Apply all changes via Terraform and verify the stack is working.

**Files:** None (deployment only)

**Step 1: Run terraform apply for monitoring module**

Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)

**Step 2: Verify sysctl DaemonSet is running on all nodes**

Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
Expected: DESIRED=5, CURRENT=5, READY=5

**Step 3: Verify Loki pod is running**

Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
Expected: 1/1 Running

**Step 4: Verify Alloy DaemonSet is running**

Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
Expected: DESIRED=5, CURRENT=5, READY=5

**Step 5: Verify Loki is receiving logs**

Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
Expected: JSON response with labels like `namespace`, `pod`, `container`

**Step 6: Verify Grafana has Loki datasource**

Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
Expected: Log lines from monitoring namespace pods

**Step 7: Commit final state**

```bash
git add -A
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
```

---

### Troubleshooting

**If Alloy pods crash with inotify errors:**
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`

**If Loki OOMs:**
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently

**If Grafana doesn't show Loki datasource:**
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`

**If Loki PV won't bind:**
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
- Run NFS export script: `cd secrets && bash nfs_exports.sh`