[ci skip] Add centralized log collection implementation plan
This commit is contained in:
parent
6ac8d549cb
commit
04dd438b01
1 changed files with 532 additions and 0 deletions
532
docs/plans/2026-02-13-centralized-log-collection-plan.md
Normal file
532
docs/plans/2026-02-13-centralized-log-collection-plan.md
Normal file
|
|
@ -0,0 +1,532 @@
|
|||
# Centralized Log Collection Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
|
||||
|
||||
**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
|
||||
|
||||
**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add sysctl DaemonSet for inotify limits
|
||||
|
||||
Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)
|
||||
|
||||
**Step 1: Write the sysctl DaemonSet resource**
|
||||
|
||||
Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
||||
metadata {
|
||||
name = "sysctl-inotify"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
init_container {
|
||||
name = "sysctl"
|
||||
image = "busybox:1.37"
|
||||
command = [
|
||||
"sh", "-c",
|
||||
"sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
|
||||
]
|
||||
security_context {
|
||||
privileged = true
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "pause"
|
||||
image = "registry.k8s.io/pause:3.10"
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "1m"
|
||||
memory = "4Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "1m"
|
||||
memory = "4Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
host_pid = true
|
||||
toleration {
|
||||
operator = "Exists"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Run terraform plan to verify**
|
||||
|
||||
Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
|
||||
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Update Loki Helm values with disk-friendly tuning
|
||||
|
||||
Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)
|
||||
|
||||
**Step 1: Write updated loki.yaml**
|
||||
|
||||
Replace entire contents of `loki.yaml` with:
|
||||
|
||||
```yaml
|
||||
loki:
|
||||
commonConfig:
|
||||
replication_factor: 1
|
||||
schemaConfig:
|
||||
configs:
|
||||
- from: "2025-04-01"
|
||||
store: tsdb
|
||||
object_store: filesystem
|
||||
schema: v13
|
||||
index:
|
||||
prefix: loki_index_
|
||||
period: 24h
|
||||
ingester:
|
||||
chunk_idle_period: 12h
|
||||
max_chunk_age: 24h
|
||||
chunk_retain_period: 1m
|
||||
chunk_target_size: 1572864
|
||||
wal:
|
||||
dir: /loki-wal
|
||||
pattern_ingester:
|
||||
enabled: true
|
||||
limits_config:
|
||||
allow_structured_metadata: true
|
||||
volume_enabled: true
|
||||
retention_period: 168h
|
||||
compactor:
|
||||
retention_enabled: true
|
||||
working_directory: /loki/compactor
|
||||
compaction_interval: 1h
|
||||
delete_request_store: filesystem
|
||||
ruler:
|
||||
enable_api: true
|
||||
storage:
|
||||
type: local
|
||||
local:
|
||||
directory: /loki/rules
|
||||
alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
|
||||
ring:
|
||||
kvstore:
|
||||
store: inmemory
|
||||
rule_path: /loki/scratch
|
||||
storage:
|
||||
type: "filesystem"
|
||||
auth_enabled: false
|
||||
|
||||
minio:
|
||||
enabled: false
|
||||
|
||||
deploymentMode: SingleBinary
|
||||
|
||||
singleBinary:
|
||||
replicas: 1
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 15Gi
|
||||
storageClass: ""
|
||||
extraVolumes:
|
||||
- name: wal
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: 2Gi
|
||||
- name: rules
|
||||
configMap:
|
||||
name: loki-alert-rules
|
||||
extraVolumeMounts:
|
||||
- name: wal
|
||||
mountPath: /loki-wal
|
||||
- name: rules
|
||||
mountPath: /loki/rules/fake
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 4Gi
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 6Gi
|
||||
|
||||
# Zero out replica counts of other deployment modes
|
||||
backend:
|
||||
replicas: 0
|
||||
read:
|
||||
replicas: 0
|
||||
write:
|
||||
replicas: 0
|
||||
ingester:
|
||||
replicas: 0
|
||||
querier:
|
||||
replicas: 0
|
||||
queryFrontend:
|
||||
replicas: 0
|
||||
queryScheduler:
|
||||
replicas: 0
|
||||
distributor:
|
||||
replicas: 0
|
||||
compactor:
|
||||
replicas: 0
|
||||
indexGateway:
|
||||
replicas: 0
|
||||
bloomCompactor:
|
||||
replicas: 0
|
||||
bloomGateway:
|
||||
replicas: 0
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.yaml
|
||||
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Update Alloy Helm values with resource limits
|
||||
|
||||
The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)
|
||||
|
||||
**Step 1: Add resource limits to alloy.yaml**
|
||||
|
||||
Append after the existing `alloy.configMap.content` block (after the last line):
|
||||
|
||||
```yaml
|
||||
|
||||
# Resource limits for DaemonSet pods
|
||||
resources:
|
||||
requests:
|
||||
cpu: 50m
|
||||
memory: 64Mi
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 128Mi
|
||||
```
|
||||
|
||||
The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/alloy.yaml
|
||||
git commit -m "[ci skip] Add resource limits to Alloy config"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Uncomment Loki Helm release and PV in loki.tf
|
||||
|
||||
Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)
|
||||
|
||||
**Step 1: Uncomment the Loki Helm release (lines 1-12)**
|
||||
|
||||
Uncomment and update the helm_release to:
|
||||
|
||||
```hcl
|
||||
resource "helm_release" "loki" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
name = "loki"
|
||||
|
||||
repository = "https://grafana.github.io/helm-charts"
|
||||
chart = "loki"
|
||||
|
||||
values = [templatefile("${path.module}/loki.yaml", {})]
|
||||
timeout = 300
|
||||
|
||||
depends_on = [kubernetes_config_map.loki_alert_rules]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Uncomment the Loki NFS PV (lines 14-32)**
|
||||
|
||||
Uncomment the `kubernetes_persistent_volume.loki` resource as-is.
|
||||
|
||||
**Step 3: Remove the minio PV block (lines 34-52)**
|
||||
|
||||
Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.
|
||||
|
||||
**Step 4: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Uncomment Alloy Helm release in loki.tf
|
||||
|
||||
Enable the Alloy Helm release.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)
|
||||
|
||||
**Step 1: Uncomment and update the Alloy Helm release**
|
||||
|
||||
Replace the commented Alloy block with:
|
||||
|
||||
```hcl
|
||||
resource "helm_release" "alloy" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
name = "alloy"
|
||||
|
||||
repository = "https://grafana.github.io/helm-charts"
|
||||
chart = "alloy"
|
||||
|
||||
values = [file("${path.module}/alloy.yaml")]
|
||||
atomic = true
|
||||
|
||||
depends_on = [helm_release.loki]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Enable Alloy Helm release"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add Grafana Loki datasource ConfigMap
|
||||
|
||||
Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)
|
||||
|
||||
**Step 1: Add the datasource ConfigMap**
|
||||
|
||||
Add to `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "grafana_loki_datasource" {
|
||||
metadata {
|
||||
name = "grafana-loki-datasource"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
labels = {
|
||||
grafana_datasource = "1"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
"loki-datasource.yaml" = yamlencode({
|
||||
apiVersion = 1
|
||||
datasources = [{
|
||||
name = "Loki"
|
||||
type = "loki"
|
||||
access = "proxy"
|
||||
url = "http://loki.monitoring.svc.cluster.local:3100"
|
||||
isDefault = false
|
||||
}]
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Add Loki alert rules ConfigMap
|
||||
|
||||
Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)
|
||||
|
||||
**Step 1: Add the alert rules ConfigMap**
|
||||
|
||||
Add to `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "loki_alert_rules" {
|
||||
metadata {
|
||||
name = "loki-alert-rules"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"rules.yaml" = yamlencode({
|
||||
groups = [{
|
||||
name = "log-alerts"
|
||||
rules = [
|
||||
{
|
||||
alert = "HighErrorRate"
|
||||
expr = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
|
||||
for = "5m"
|
||||
labels = {
|
||||
severity = "warning"
|
||||
}
|
||||
annotations = {
|
||||
summary = "High error rate in {{ $labels.namespace }}"
|
||||
}
|
||||
},
|
||||
{
|
||||
alert = "PodCrashLoopBackOff"
|
||||
expr = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
|
||||
for = "1m"
|
||||
labels = {
|
||||
severity = "critical"
|
||||
}
|
||||
annotations = {
|
||||
summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
|
||||
}
|
||||
},
|
||||
{
|
||||
alert = "OOMKilled"
|
||||
expr = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
|
||||
for = "1m"
|
||||
labels = {
|
||||
severity = "critical"
|
||||
}
|
||||
annotations = {
|
||||
summary = "OOMKilled detected in {{ $labels.namespace }}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}]
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Deploy and verify
|
||||
|
||||
Apply all changes via Terraform and verify the stack is working.
|
||||
|
||||
**Files:** None (deployment only)
|
||||
|
||||
**Step 1: Run terraform apply for monitoring module**
|
||||
|
||||
Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
|
||||
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
|
||||
|
||||
**Step 2: Verify sysctl DaemonSet is running on all nodes**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
|
||||
Expected: DESIRED=5, CURRENT=5, READY=5
|
||||
|
||||
**Step 3: Verify Loki pod is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
|
||||
Expected: 1/1 Running
|
||||
|
||||
**Step 4: Verify Alloy DaemonSet is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
|
||||
Expected: DESIRED=5, CURRENT=5, READY=5
|
||||
|
||||
**Step 5: Verify Loki is receiving logs**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
|
||||
Expected: JSON response with labels like `namespace`, `pod`, `container`
|
||||
|
||||
**Step 6: Verify Grafana has Loki datasource**
|
||||
|
||||
Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
|
||||
Expected: Log lines from monitoring namespace pods
|
||||
|
||||
**Step 7: Commit final state**
|
||||
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**If Alloy pods crash with inotify errors:**
|
||||
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
|
||||
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`
|
||||
|
||||
**If Loki OOMs:**
|
||||
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
|
||||
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently
|
||||
|
||||
**If Grafana doesn't show Loki datasource:**
|
||||
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
|
||||
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`
|
||||
|
||||
**If Loki PV won't bind:**
|
||||
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
|
||||
- Run NFS export script: `cd secrets && bash nfs_exports.sh`
|
||||
Loading…
Add table
Add a link
Reference in a new issue