14 KiB
Centralized Log Collection Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
Architecture: Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
Tech Stack: Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
Design doc: docs/plans/2026-02-13-centralized-log-collection-design.md
Task 1: Add sysctl DaemonSet for inotify limits
Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
Files:
- Modify:
modules/kubernetes/monitoring/loki.tf(replace the comment block at lines 67-71)
Step 1: Write the sysctl DaemonSet resource
Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in loki.tf:
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
metadata {
name = "sysctl-inotify"
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
app = "sysctl-inotify"
}
}
spec {
selector {
match_labels = {
app = "sysctl-inotify"
}
}
template {
metadata {
labels = {
app = "sysctl-inotify"
}
}
spec {
init_container {
name = "sysctl"
image = "busybox:1.37"
command = [
"sh", "-c",
"sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
]
security_context {
privileged = true
}
}
container {
name = "pause"
image = "registry.k8s.io/pause:3.10"
resources {
requests = {
cpu = "1m"
memory = "4Mi"
}
limits = {
cpu = "1m"
memory = "4Mi"
}
}
}
host_pid = true
toleration {
operator = "Exists"
}
}
}
}
}
Step 2: Run terraform fmt
Run: terraform fmt -recursive modules/kubernetes/monitoring/
Step 3: Run terraform plan to verify
Run: terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
Step 4: Commit
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
Task 2: Update Loki Helm values with disk-friendly tuning
Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
Files:
- Modify:
modules/kubernetes/monitoring/loki.yaml(full rewrite)
Step 1: Write updated loki.yaml
Replace entire contents of loki.yaml with:
loki:
commonConfig:
replication_factor: 1
schemaConfig:
configs:
- from: "2025-04-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
ingester:
chunk_idle_period: 12h
max_chunk_age: 24h
chunk_retain_period: 1m
chunk_target_size: 1572864
wal:
dir: /loki-wal
pattern_ingester:
enabled: true
limits_config:
allow_structured_metadata: true
volume_enabled: true
retention_period: 168h
compactor:
retention_enabled: true
working_directory: /loki/compactor
compaction_interval: 1h
delete_request_store: filesystem
ruler:
enable_api: true
storage:
type: local
local:
directory: /loki/rules
alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
ring:
kvstore:
store: inmemory
rule_path: /loki/scratch
storage:
type: "filesystem"
auth_enabled: false
minio:
enabled: false
deploymentMode: SingleBinary
singleBinary:
replicas: 1
persistence:
enabled: true
size: 15Gi
storageClass: ""
extraVolumes:
- name: wal
emptyDir:
medium: Memory
sizeLimit: 2Gi
- name: rules
configMap:
name: loki-alert-rules
extraVolumeMounts:
- name: wal
mountPath: /loki-wal
- name: rules
mountPath: /loki/rules/fake
resources:
requests:
cpu: 250m
memory: 4Gi
limits:
cpu: "1"
memory: 6Gi
# Zero out replica counts of other deployment modes
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
ingester:
replicas: 0
querier:
replicas: 0
queryFrontend:
replicas: 0
queryScheduler:
replicas: 0
distributor:
replicas: 0
compactor:
replicas: 0
indexGateway:
replicas: 0
bloomCompactor:
replicas: 0
bloomGateway:
replicas: 0
Step 2: Commit
git add modules/kubernetes/monitoring/loki.yaml
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
Task 3: Update Alloy Helm values with resource limits
The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
Files:
- Modify:
modules/kubernetes/monitoring/alloy.yaml(add resource limits)
Step 1: Add resource limits to alloy.yaml
Append after the existing alloy.configMap.content block (after the last line):
# Resource limits for DaemonSet pods
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
The final file should have the alloy.configMap.content block unchanged, with alloy.resources added as a sibling under alloy:.
Step 2: Commit
git add modules/kubernetes/monitoring/alloy.yaml
git commit -m "[ci skip] Add resource limits to Alloy config"
Task 4: Uncomment Loki Helm release and PV in loki.tf
Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
Files:
- Modify:
modules/kubernetes/monitoring/loki.tf(uncomment Loki resources, remove minio PV)
Step 1: Uncomment the Loki Helm release (lines 1-12)
Uncomment and update the helm_release to:
resource "helm_release" "loki" {
namespace = kubernetes_namespace.monitoring.metadata[0].name
create_namespace = true
name = "loki"
repository = "https://grafana.github.io/helm-charts"
chart = "loki"
values = [templatefile("${path.module}/loki.yaml", {})]
timeout = 300
depends_on = [kubernetes_config_map.loki_alert_rules]
}
Step 2: Uncomment the Loki NFS PV (lines 14-32)
Uncomment the kubernetes_persistent_volume.loki resource as-is.
Step 3: Remove the minio PV block (lines 34-52)
Delete the entire kubernetes_persistent_volume.loki-minio commented block — minio is disabled.
Step 4: Run terraform fmt
Run: terraform fmt -recursive modules/kubernetes/monitoring/
Step 5: Commit
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
Task 5: Uncomment Alloy Helm release in loki.tf
Enable the Alloy Helm release.
Files:
- Modify:
modules/kubernetes/monitoring/loki.tf(uncomment Alloy helm release)
Step 1: Uncomment and update the Alloy Helm release
Replace the commented Alloy block with:
resource "helm_release" "alloy" {
namespace = kubernetes_namespace.monitoring.metadata[0].name
create_namespace = true
name = "alloy"
repository = "https://grafana.github.io/helm-charts"
chart = "alloy"
values = [file("${path.module}/alloy.yaml")]
atomic = true
depends_on = [helm_release.loki]
}
Step 2: Run terraform fmt
Run: terraform fmt -recursive modules/kubernetes/monitoring/
Step 3: Commit
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Alloy Helm release"
Task 6: Add Grafana Loki datasource ConfigMap
Grafana's sidecar auto-discovers ConfigMaps with label grafana_datasource: "1". Create one for Loki.
Files:
- Modify:
modules/kubernetes/monitoring/loki.tf(add ConfigMap resource)
Step 1: Add the datasource ConfigMap
Add to loki.tf:
resource "kubernetes_config_map" "grafana_loki_datasource" {
metadata {
name = "grafana-loki-datasource"
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
grafana_datasource = "1"
}
}
data = {
"loki-datasource.yaml" = yamlencode({
apiVersion = 1
datasources = [{
name = "Loki"
type = "loki"
access = "proxy"
url = "http://loki.monitoring.svc.cluster.local:3100"
isDefault = false
}]
})
}
}
Step 2: Run terraform fmt
Run: terraform fmt -recursive modules/kubernetes/monitoring/
Step 3: Commit
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
Task 7: Add Loki alert rules ConfigMap
Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at /loki/rules/fake/.
Files:
- Modify:
modules/kubernetes/monitoring/loki.tf(add alert rules ConfigMap)
Step 1: Add the alert rules ConfigMap
Add to loki.tf:
resource "kubernetes_config_map" "loki_alert_rules" {
metadata {
name = "loki-alert-rules"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
data = {
"rules.yaml" = yamlencode({
groups = [{
name = "log-alerts"
rules = [
{
alert = "HighErrorRate"
expr = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
for = "5m"
labels = {
severity = "warning"
}
annotations = {
summary = "High error rate in {{ $labels.namespace }}"
}
},
{
alert = "PodCrashLoopBackOff"
expr = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
for = "1m"
labels = {
severity = "critical"
}
annotations = {
summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
}
},
{
alert = "OOMKilled"
expr = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
for = "1m"
labels = {
severity = "critical"
}
annotations = {
summary = "OOMKilled detected in {{ $labels.namespace }}"
}
}
]
}]
})
}
}
Step 2: Run terraform fmt
Run: terraform fmt -recursive modules/kubernetes/monitoring/
Step 3: Commit
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
Task 8: Deploy and verify
Apply all changes via Terraform and verify the stack is working.
Files: None (deployment only)
Step 1: Run terraform apply for monitoring module
Run: terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
Step 2: Verify sysctl DaemonSet is running on all nodes
Run: kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify
Expected: DESIRED=5, CURRENT=5, READY=5
Step 3: Verify Loki pod is running
Run: kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki
Expected: 1/1 Running
Step 4: Verify Alloy DaemonSet is running
Run: kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy
Expected: DESIRED=5, CURRENT=5, READY=5
Step 5: Verify Loki is receiving logs
Run: kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'
Expected: JSON response with labels like namespace, pod, container
Step 6: Verify Grafana has Loki datasource
Open https://grafana.viktorbarzin.me/explore, select "Loki" datasource, run query: {namespace="monitoring"}
Expected: Log lines from monitoring namespace pods
Step 7: Commit final state
git add -A
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
Troubleshooting
If Alloy pods crash with inotify errors:
- Check sysctl DaemonSet init logs:
kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl - Verify sysctl values on node:
kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches
If Loki OOMs:
- Check memory usage:
kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki - Reduce
max_chunk_agefrom 24h to 12h inloki.yamlto flush more frequently
If Grafana doesn't show Loki datasource:
- Verify ConfigMap has correct label:
kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml - Restart Grafana sidecar:
kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana
If Loki PV won't bind:
- Check NFS export exists:
ssh root@10.0.10.15 'showmount -e localhost | grep loki' - Run NFS export script:
cd secrets && bash nfs_exports.sh