532 lines
14 KiB
Markdown
532 lines
14 KiB
Markdown
# Centralized Log Collection Implementation Plan
|
|
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
|
|
**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
|
|
|
|
**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
|
|
|
|
**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
|
|
|
|
**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`
|
|
|
|
---
|
|
|
|
### Task 1: Add sysctl DaemonSet for inotify limits
|
|
|
|
Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)
|
|
|
|
**Step 1: Write the sysctl DaemonSet resource**
|
|
|
|
Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:
|
|
|
|
```hcl
|
|
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
|
metadata {
|
|
name = "sysctl-inotify"
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
labels = {
|
|
app = "sysctl-inotify"
|
|
}
|
|
}
|
|
spec {
|
|
selector {
|
|
match_labels = {
|
|
app = "sysctl-inotify"
|
|
}
|
|
}
|
|
template {
|
|
metadata {
|
|
labels = {
|
|
app = "sysctl-inotify"
|
|
}
|
|
}
|
|
spec {
|
|
init_container {
|
|
name = "sysctl"
|
|
image = "busybox:1.37"
|
|
command = [
|
|
"sh", "-c",
|
|
"sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
|
|
]
|
|
security_context {
|
|
privileged = true
|
|
}
|
|
}
|
|
container {
|
|
name = "pause"
|
|
image = "registry.k8s.io/pause:3.10"
|
|
resources {
|
|
requests = {
|
|
cpu = "1m"
|
|
memory = "4Mi"
|
|
}
|
|
limits = {
|
|
cpu = "1m"
|
|
memory = "4Mi"
|
|
}
|
|
}
|
|
}
|
|
host_pid = true
|
|
toleration {
|
|
operator = "Exists"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Step 2: Run terraform fmt**
|
|
|
|
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
|
|
|
**Step 3: Run terraform plan to verify**
|
|
|
|
Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
|
|
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.tf
|
|
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: Update Loki Helm values with disk-friendly tuning
|
|
|
|
Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)
|
|
|
|
**Step 1: Write updated loki.yaml**
|
|
|
|
Replace entire contents of `loki.yaml` with:
|
|
|
|
```yaml
|
|
loki:
|
|
commonConfig:
|
|
replication_factor: 1
|
|
schemaConfig:
|
|
configs:
|
|
- from: "2025-04-01"
|
|
store: tsdb
|
|
object_store: filesystem
|
|
schema: v13
|
|
index:
|
|
prefix: loki_index_
|
|
period: 24h
|
|
ingester:
|
|
chunk_idle_period: 12h
|
|
max_chunk_age: 24h
|
|
chunk_retain_period: 1m
|
|
chunk_target_size: 1572864
|
|
wal:
|
|
dir: /loki-wal
|
|
pattern_ingester:
|
|
enabled: true
|
|
limits_config:
|
|
allow_structured_metadata: true
|
|
volume_enabled: true
|
|
retention_period: 168h
|
|
compactor:
|
|
retention_enabled: true
|
|
working_directory: /loki/compactor
|
|
compaction_interval: 1h
|
|
delete_request_store: filesystem
|
|
ruler:
|
|
enable_api: true
|
|
storage:
|
|
type: local
|
|
local:
|
|
directory: /loki/rules
|
|
alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
|
|
ring:
|
|
kvstore:
|
|
store: inmemory
|
|
rule_path: /loki/scratch
|
|
storage:
|
|
type: "filesystem"
|
|
auth_enabled: false
|
|
|
|
minio:
|
|
enabled: false
|
|
|
|
deploymentMode: SingleBinary
|
|
|
|
singleBinary:
|
|
replicas: 1
|
|
persistence:
|
|
enabled: true
|
|
size: 15Gi
|
|
storageClass: ""
|
|
extraVolumes:
|
|
- name: wal
|
|
emptyDir:
|
|
medium: Memory
|
|
sizeLimit: 2Gi
|
|
- name: rules
|
|
configMap:
|
|
name: loki-alert-rules
|
|
extraVolumeMounts:
|
|
- name: wal
|
|
mountPath: /loki-wal
|
|
- name: rules
|
|
mountPath: /loki/rules/fake
|
|
resources:
|
|
requests:
|
|
cpu: 250m
|
|
memory: 4Gi
|
|
limits:
|
|
cpu: "1"
|
|
memory: 6Gi
|
|
|
|
# Zero out replica counts of other deployment modes
|
|
backend:
|
|
replicas: 0
|
|
read:
|
|
replicas: 0
|
|
write:
|
|
replicas: 0
|
|
ingester:
|
|
replicas: 0
|
|
querier:
|
|
replicas: 0
|
|
queryFrontend:
|
|
replicas: 0
|
|
queryScheduler:
|
|
replicas: 0
|
|
distributor:
|
|
replicas: 0
|
|
compactor:
|
|
replicas: 0
|
|
indexGateway:
|
|
replicas: 0
|
|
bloomCompactor:
|
|
replicas: 0
|
|
bloomGateway:
|
|
replicas: 0
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.yaml
|
|
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: Update Alloy Helm values with resource limits
|
|
|
|
The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)
|
|
|
|
**Step 1: Add resource limits to alloy.yaml**
|
|
|
|
Append after the existing `alloy.configMap.content` block (after the last line):
|
|
|
|
```yaml
|
|
|
|
# Resource limits for DaemonSet pods
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 64Mi
|
|
limits:
|
|
cpu: 200m
|
|
memory: 128Mi
|
|
```
|
|
|
|
The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/alloy.yaml
|
|
git commit -m "[ci skip] Add resource limits to Alloy config"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: Uncomment Loki Helm release and PV in loki.tf
|
|
|
|
Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)
|
|
|
|
**Step 1: Uncomment the Loki Helm release (lines 1-12)**
|
|
|
|
Uncomment and update the helm_release to:
|
|
|
|
```hcl
|
|
resource "helm_release" "loki" {
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
create_namespace = true
|
|
name = "loki"
|
|
|
|
repository = "https://grafana.github.io/helm-charts"
|
|
chart = "loki"
|
|
|
|
values = [templatefile("${path.module}/loki.yaml", {})]
|
|
timeout = 300
|
|
|
|
depends_on = [kubernetes_config_map.loki_alert_rules]
|
|
}
|
|
```
|
|
|
|
**Step 2: Uncomment the Loki NFS PV (lines 14-32)**
|
|
|
|
Uncomment the `kubernetes_persistent_volume.loki` resource as-is.
|
|
|
|
**Step 3: Remove the minio PV block (lines 34-52)**
|
|
|
|
Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.
|
|
|
|
**Step 4: Run terraform fmt**
|
|
|
|
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.tf
|
|
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: Uncomment Alloy Helm release in loki.tf
|
|
|
|
Enable the Alloy Helm release.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)
|
|
|
|
**Step 1: Uncomment and update the Alloy Helm release**
|
|
|
|
Replace the commented Alloy block with:
|
|
|
|
```hcl
|
|
resource "helm_release" "alloy" {
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
create_namespace = true
|
|
name = "alloy"
|
|
|
|
repository = "https://grafana.github.io/helm-charts"
|
|
chart = "alloy"
|
|
|
|
values = [file("${path.module}/alloy.yaml")]
|
|
atomic = true
|
|
|
|
depends_on = [helm_release.loki]
|
|
}
|
|
```
|
|
|
|
**Step 2: Run terraform fmt**
|
|
|
|
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.tf
|
|
git commit -m "[ci skip] Enable Alloy Helm release"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Add Grafana Loki datasource ConfigMap
|
|
|
|
Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)
|
|
|
|
**Step 1: Add the datasource ConfigMap**
|
|
|
|
Add to `loki.tf`:
|
|
|
|
```hcl
|
|
resource "kubernetes_config_map" "grafana_loki_datasource" {
|
|
metadata {
|
|
name = "grafana-loki-datasource"
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
labels = {
|
|
grafana_datasource = "1"
|
|
}
|
|
}
|
|
data = {
|
|
"loki-datasource.yaml" = yamlencode({
|
|
apiVersion = 1
|
|
datasources = [{
|
|
name = "Loki"
|
|
type = "loki"
|
|
access = "proxy"
|
|
url = "http://loki.monitoring.svc.cluster.local:3100"
|
|
isDefault = false
|
|
}]
|
|
})
|
|
}
|
|
}
|
|
```
|
|
|
|
**Step 2: Run terraform fmt**
|
|
|
|
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.tf
|
|
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 7: Add Loki alert rules ConfigMap
|
|
|
|
Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.
|
|
|
|
**Files:**
|
|
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)
|
|
|
|
**Step 1: Add the alert rules ConfigMap**
|
|
|
|
Add to `loki.tf`:
|
|
|
|
```hcl
|
|
resource "kubernetes_config_map" "loki_alert_rules" {
|
|
metadata {
|
|
name = "loki-alert-rules"
|
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
|
}
|
|
data = {
|
|
"rules.yaml" = yamlencode({
|
|
groups = [{
|
|
name = "log-alerts"
|
|
rules = [
|
|
{
|
|
alert = "HighErrorRate"
|
|
expr = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
|
|
for = "5m"
|
|
labels = {
|
|
severity = "warning"
|
|
}
|
|
annotations = {
|
|
summary = "High error rate in {{ $labels.namespace }}"
|
|
}
|
|
},
|
|
{
|
|
alert = "PodCrashLoopBackOff"
|
|
expr = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
|
|
for = "1m"
|
|
labels = {
|
|
severity = "critical"
|
|
}
|
|
annotations = {
|
|
summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
|
|
}
|
|
},
|
|
{
|
|
alert = "OOMKilled"
|
|
expr = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
|
|
for = "1m"
|
|
labels = {
|
|
severity = "critical"
|
|
}
|
|
annotations = {
|
|
summary = "OOMKilled detected in {{ $labels.namespace }}"
|
|
}
|
|
}
|
|
]
|
|
}]
|
|
})
|
|
}
|
|
}
|
|
```
|
|
|
|
**Step 2: Run terraform fmt**
|
|
|
|
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add modules/kubernetes/monitoring/loki.tf
|
|
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 8: Deploy and verify
|
|
|
|
Apply all changes via Terraform and verify the stack is working.
|
|
|
|
**Files:** None (deployment only)
|
|
|
|
**Step 1: Run terraform apply for monitoring module**
|
|
|
|
Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
|
|
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
|
|
|
|
**Step 2: Verify sysctl DaemonSet is running on all nodes**
|
|
|
|
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
|
|
Expected: DESIRED=5, CURRENT=5, READY=5
|
|
|
|
**Step 3: Verify Loki pod is running**
|
|
|
|
Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
|
|
Expected: 1/1 Running
|
|
|
|
**Step 4: Verify Alloy DaemonSet is running**
|
|
|
|
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
|
|
Expected: DESIRED=5, CURRENT=5, READY=5
|
|
|
|
**Step 5: Verify Loki is receiving logs**
|
|
|
|
Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
|
|
Expected: JSON response with labels like `namespace`, `pod`, `container`
|
|
|
|
**Step 6: Verify Grafana has Loki datasource**
|
|
|
|
Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
|
|
Expected: Log lines from monitoring namespace pods
|
|
|
|
**Step 7: Commit final state**
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
|
|
```
|
|
|
|
---
|
|
|
|
### Troubleshooting
|
|
|
|
**If Alloy pods crash with inotify errors:**
|
|
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
|
|
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`
|
|
|
|
**If Loki OOMs:**
|
|
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
|
|
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently
|
|
|
|
**If Grafana doesn't show Loki datasource:**
|
|
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
|
|
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`
|
|
|
|
**If Loki PV won't bind:**
|
|
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
|
|
- Run NFS export script: `cd secrets && bash nfs_exports.sh`
|