[ci skip] Remove legacy files and orphaned modules
Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.
This commit is contained in:
parent
c7c7047f1c
commit
116c4d9c30
56 changed files with 2 additions and 9402 deletions
|
|
@ -1,140 +0,0 @@
|
|||
# Centralized Log Collection Design
|
||||
|
||||
## Date: 2026-02-13
|
||||
|
||||
## Goal
|
||||
|
||||
Centrally collect logs from all Kubernetes pods for monitoring and alerting. Minimize disk I/O by holding logs in memory for extended periods, flushing to NFS once daily. Alert on log patterns via existing Alertmanager pipeline.
|
||||
|
||||
## Requirements
|
||||
|
||||
- **Primary use case**: Monitoring and alerting (log-based alert rules evaluated in real-time)
|
||||
- **Retention**: 7 days on disk after flush
|
||||
- **Memory budget**: 4-8GB total (~6.6GB used)
|
||||
- **Disk strategy**: 24h in-memory chunks, WAL on tmpfs, single daily flush to NFS
|
||||
- **Crash policy**: Accept up to 24h log loss on pod/node crash (alerts still fire in real-time before flush)
|
||||
- **Alert delivery**: Loki Ruler -> existing Alertmanager -> Slack/email
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────┐ ┌──────────────────────┐ ┌──────────────┐
|
||||
│ Alloy DaemonSet │ │ Loki SingleBinary │ │ Grafana │
|
||||
│ 5 pods, 128Mi ea │────>│ 1 pod, 6Gi RAM │<────│ (existing) │
|
||||
│ tails /var/log/ │ │ │ │ + Loki │
|
||||
│ pods on each node│ │ Ingester: 24h chunks │ │ datasource │
|
||||
└──────────────────┘ │ WAL: tmpfs (in-memory) │ └──────────────┘
|
||||
│ Storage: NFS 15Gi │
|
||||
┌──────────────────┐ │ Ruler ──> Alertmanager │
|
||||
│ Sysctl DaemonSet │ └──────────────────────┘
|
||||
│ 5 pods (pause) │
|
||||
│ sets inotify │
|
||||
│ limits on nodes │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Sysctl DaemonSet
|
||||
|
||||
Solves the `too many open files` / fsnotify watcher exhaustion problem that previously blocked Alloy.
|
||||
|
||||
- Privileged init container runs `sysctl -w` on each node
|
||||
- Settings: `fs.inotify.max_user_watches=1048576`, `fs.inotify.max_user_instances=512`, `fs.inotify.max_queued_events=1048576`
|
||||
- Main container: `pause` image (near-zero resources)
|
||||
- Survives node reboots (DaemonSet recreates pod)
|
||||
- Namespace: `monitoring`
|
||||
|
||||
### 2. Loki (Helm Release)
|
||||
|
||||
Single-binary deployment. Existing Helm chart config in `loki.yaml`, updated with:
|
||||
|
||||
**Ingester tuning (disk-friendly):**
|
||||
- `chunk_idle_period: 12h` — don't flush idle streams quickly
|
||||
- `max_chunk_age: 24h` — hold chunks in memory for full day
|
||||
- `chunk_retain_period: 1m` — brief retain after flush
|
||||
- `chunk_target_size: 1572864` (1.5MB) — larger chunks = fewer writes
|
||||
- WAL: tmpfs emptyDir (`medium: Memory`, 2Gi limit)
|
||||
|
||||
**Retention:**
|
||||
- `retention_period: 168h` (7 days)
|
||||
- Compactor enabled for retention enforcement
|
||||
|
||||
**Ruler:**
|
||||
- Evaluates LogQL alert rules in real-time (before chunk flush)
|
||||
- Fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
|
||||
|
||||
**Storage:**
|
||||
- NFS PV/PVC at `/mnt/main/loki/loki` (15Gi, existing)
|
||||
- TSDB index with 24h period
|
||||
|
||||
**Resources:**
|
||||
- Memory: 6Gi limit
|
||||
- CPU: 1 limit
|
||||
|
||||
### 3. Alloy (Helm Release)
|
||||
|
||||
DaemonSet log collector. Existing config in `alloy.yaml` is complete:
|
||||
- Discovers pods via `discovery.kubernetes`
|
||||
- Labels: namespace, pod, container, app, job, container_runtime, cluster
|
||||
- Tails `/var/log/pods/` on each node
|
||||
- Forwards to `http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push`
|
||||
|
||||
**Resources per pod:**
|
||||
- Memory: 128Mi limit
|
||||
- CPU: 200m limit
|
||||
|
||||
### 4. Grafana Datasource
|
||||
|
||||
ConfigMap with label `grafana_datasource: "1"` for sidecar auto-discovery:
|
||||
- Name: Loki
|
||||
- Type: loki
|
||||
- URL: `http://loki.monitoring.svc.cluster.local:3100`
|
||||
- Existing `loki.json` dashboard already in dashboards directory
|
||||
|
||||
### 5. Starter Alert Rules
|
||||
|
||||
Configured in Loki Ruler (evaluated in real-time, before disk flush):
|
||||
|
||||
| Alert | LogQL Expression | Severity |
|
||||
|-------|-----------------|----------|
|
||||
| HighErrorRate | `sum(rate({namespace=~".+"} \|= "error" [5m])) by (namespace) > 10` | warning |
|
||||
| PodCrashLoopBackOff | `count_over_time({namespace=~".+"} \|= "CrashLoopBackOff" [5m]) > 0` | critical |
|
||||
| OOMKilled | `count_over_time({namespace=~".+"} \|= "OOMKilled" [5m]) > 0` | critical |
|
||||
|
||||
## Memory Budget
|
||||
|
||||
| Component | Per-pod | Pods | Total |
|
||||
|-----------|---------|------|-------|
|
||||
| Alloy | 128Mi | 5 | 640Mi |
|
||||
| Loki | 6Gi | 1 | 6Gi |
|
||||
| Sysctl DS | ~0 (pause) | 5 | ~0 |
|
||||
| **Total** | | | **~6.6 GB** |
|
||||
|
||||
## Files to Change
|
||||
|
||||
| File | Action |
|
||||
|------|--------|
|
||||
| `modules/kubernetes/monitoring/loki.tf` | Uncomment Loki + Alloy helm releases, add sysctl DaemonSet, add Grafana Loki datasource ConfigMap |
|
||||
| `modules/kubernetes/monitoring/loki.yaml` | Update with ingester tuning, ruler config, retention, resource limits |
|
||||
| `modules/kubernetes/monitoring/alloy.yaml` | Add resource limits in Helm values wrapper |
|
||||
| `secrets/nfs_directories.txt` | Ensure `/mnt/main/loki` entries exist |
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. Add sysctl DaemonSet to `loki.tf`
|
||||
2. Update `loki.yaml` with disk-friendly tuning, ruler, retention, resources
|
||||
3. Update `alloy.yaml` with resource limits
|
||||
4. Uncomment Loki Helm release in `loki.tf`, wire up NFS PV/PVC
|
||||
5. Uncomment Alloy Helm release in `loki.tf`
|
||||
6. Add Grafana Loki datasource ConfigMap to `loki.tf`
|
||||
7. Add alert rules to Loki config
|
||||
8. Ensure NFS exports exist in `secrets/nfs_directories.txt`
|
||||
9. `terraform apply -target=module.kubernetes_cluster.module.monitoring`
|
||||
10. Verify: Grafana Explore -> Loki datasource -> query `{namespace="monitoring"}`
|
||||
|
||||
## Risks
|
||||
|
||||
- **24h data loss on crash**: Accepted trade-off. Alerts fire in real-time before flush, so alert coverage is not affected — only historical log browsing is at risk.
|
||||
- **Memory pressure**: 6Gi for Loki on a 16GB node is significant. Monitor with existing Prometheus memory alerts.
|
||||
- **Log volume spikes**: A chatty pod could cause Loki to OOM. Alloy can be configured with rate limiting if needed (future enhancement).
|
||||
|
|
@ -1,532 +0,0 @@
|
|||
# Centralized Log Collection Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
|
||||
|
||||
**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
|
||||
|
||||
**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add sysctl DaemonSet for inotify limits
|
||||
|
||||
Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)
|
||||
|
||||
**Step 1: Write the sysctl DaemonSet resource**
|
||||
|
||||
Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
|
||||
metadata {
|
||||
name = "sysctl-inotify"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "sysctl-inotify"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
init_container {
|
||||
name = "sysctl"
|
||||
image = "busybox:1.37"
|
||||
command = [
|
||||
"sh", "-c",
|
||||
"sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
|
||||
]
|
||||
security_context {
|
||||
privileged = true
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "pause"
|
||||
image = "registry.k8s.io/pause:3.10"
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "1m"
|
||||
memory = "4Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "1m"
|
||||
memory = "4Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
host_pid = true
|
||||
toleration {
|
||||
operator = "Exists"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Run terraform plan to verify**
|
||||
|
||||
Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
|
||||
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Update Loki Helm values with disk-friendly tuning
|
||||
|
||||
Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)
|
||||
|
||||
**Step 1: Write updated loki.yaml**
|
||||
|
||||
Replace entire contents of `loki.yaml` with:
|
||||
|
||||
```yaml
|
||||
loki:
|
||||
commonConfig:
|
||||
replication_factor: 1
|
||||
schemaConfig:
|
||||
configs:
|
||||
- from: "2025-04-01"
|
||||
store: tsdb
|
||||
object_store: filesystem
|
||||
schema: v13
|
||||
index:
|
||||
prefix: loki_index_
|
||||
period: 24h
|
||||
ingester:
|
||||
chunk_idle_period: 12h
|
||||
max_chunk_age: 24h
|
||||
chunk_retain_period: 1m
|
||||
chunk_target_size: 1572864
|
||||
wal:
|
||||
dir: /loki-wal
|
||||
pattern_ingester:
|
||||
enabled: true
|
||||
limits_config:
|
||||
allow_structured_metadata: true
|
||||
volume_enabled: true
|
||||
retention_period: 168h
|
||||
compactor:
|
||||
retention_enabled: true
|
||||
working_directory: /loki/compactor
|
||||
compaction_interval: 1h
|
||||
delete_request_store: filesystem
|
||||
ruler:
|
||||
enable_api: true
|
||||
storage:
|
||||
type: local
|
||||
local:
|
||||
directory: /loki/rules
|
||||
alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
|
||||
ring:
|
||||
kvstore:
|
||||
store: inmemory
|
||||
rule_path: /loki/scratch
|
||||
storage:
|
||||
type: "filesystem"
|
||||
auth_enabled: false
|
||||
|
||||
minio:
|
||||
enabled: false
|
||||
|
||||
deploymentMode: SingleBinary
|
||||
|
||||
singleBinary:
|
||||
replicas: 1
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 15Gi
|
||||
storageClass: ""
|
||||
extraVolumes:
|
||||
- name: wal
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: 2Gi
|
||||
- name: rules
|
||||
configMap:
|
||||
name: loki-alert-rules
|
||||
extraVolumeMounts:
|
||||
- name: wal
|
||||
mountPath: /loki-wal
|
||||
- name: rules
|
||||
mountPath: /loki/rules/fake
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 4Gi
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 6Gi
|
||||
|
||||
# Zero out replica counts of other deployment modes
|
||||
backend:
|
||||
replicas: 0
|
||||
read:
|
||||
replicas: 0
|
||||
write:
|
||||
replicas: 0
|
||||
ingester:
|
||||
replicas: 0
|
||||
querier:
|
||||
replicas: 0
|
||||
queryFrontend:
|
||||
replicas: 0
|
||||
queryScheduler:
|
||||
replicas: 0
|
||||
distributor:
|
||||
replicas: 0
|
||||
compactor:
|
||||
replicas: 0
|
||||
indexGateway:
|
||||
replicas: 0
|
||||
bloomCompactor:
|
||||
replicas: 0
|
||||
bloomGateway:
|
||||
replicas: 0
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.yaml
|
||||
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Update Alloy Helm values with resource limits
|
||||
|
||||
The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)
|
||||
|
||||
**Step 1: Add resource limits to alloy.yaml**
|
||||
|
||||
Append after the existing `alloy.configMap.content` block (after the last line):
|
||||
|
||||
```yaml
|
||||
|
||||
# Resource limits for DaemonSet pods
|
||||
resources:
|
||||
requests:
|
||||
cpu: 50m
|
||||
memory: 64Mi
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 128Mi
|
||||
```
|
||||
|
||||
The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/alloy.yaml
|
||||
git commit -m "[ci skip] Add resource limits to Alloy config"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Uncomment Loki Helm release and PV in loki.tf
|
||||
|
||||
Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)
|
||||
|
||||
**Step 1: Uncomment the Loki Helm release (lines 1-12)**
|
||||
|
||||
Uncomment and update the helm_release to:
|
||||
|
||||
```hcl
|
||||
resource "helm_release" "loki" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
name = "loki"
|
||||
|
||||
repository = "https://grafana.github.io/helm-charts"
|
||||
chart = "loki"
|
||||
|
||||
values = [templatefile("${path.module}/loki.yaml", {})]
|
||||
timeout = 300
|
||||
|
||||
depends_on = [kubernetes_config_map.loki_alert_rules]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Uncomment the Loki NFS PV (lines 14-32)**
|
||||
|
||||
Uncomment the `kubernetes_persistent_volume.loki` resource as-is.
|
||||
|
||||
**Step 3: Remove the minio PV block (lines 34-52)**
|
||||
|
||||
Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.
|
||||
|
||||
**Step 4: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Uncomment Alloy Helm release in loki.tf
|
||||
|
||||
Enable the Alloy Helm release.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)
|
||||
|
||||
**Step 1: Uncomment and update the Alloy Helm release**
|
||||
|
||||
Replace the commented Alloy block with:
|
||||
|
||||
```hcl
|
||||
resource "helm_release" "alloy" {
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
create_namespace = true
|
||||
name = "alloy"
|
||||
|
||||
repository = "https://grafana.github.io/helm-charts"
|
||||
chart = "alloy"
|
||||
|
||||
values = [file("${path.module}/alloy.yaml")]
|
||||
atomic = true
|
||||
|
||||
depends_on = [helm_release.loki]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Enable Alloy Helm release"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add Grafana Loki datasource ConfigMap
|
||||
|
||||
Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)
|
||||
|
||||
**Step 1: Add the datasource ConfigMap**
|
||||
|
||||
Add to `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "grafana_loki_datasource" {
|
||||
metadata {
|
||||
name = "grafana-loki-datasource"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
labels = {
|
||||
grafana_datasource = "1"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
"loki-datasource.yaml" = yamlencode({
|
||||
apiVersion = 1
|
||||
datasources = [{
|
||||
name = "Loki"
|
||||
type = "loki"
|
||||
access = "proxy"
|
||||
url = "http://loki.monitoring.svc.cluster.local:3100"
|
||||
isDefault = false
|
||||
}]
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Add Loki alert rules ConfigMap
|
||||
|
||||
Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)
|
||||
|
||||
**Step 1: Add the alert rules ConfigMap**
|
||||
|
||||
Add to `loki.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "loki_alert_rules" {
|
||||
metadata {
|
||||
name = "loki-alert-rules"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"rules.yaml" = yamlencode({
|
||||
groups = [{
|
||||
name = "log-alerts"
|
||||
rules = [
|
||||
{
|
||||
alert = "HighErrorRate"
|
||||
expr = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
|
||||
for = "5m"
|
||||
labels = {
|
||||
severity = "warning"
|
||||
}
|
||||
annotations = {
|
||||
summary = "High error rate in {{ $labels.namespace }}"
|
||||
}
|
||||
},
|
||||
{
|
||||
alert = "PodCrashLoopBackOff"
|
||||
expr = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
|
||||
for = "1m"
|
||||
labels = {
|
||||
severity = "critical"
|
||||
}
|
||||
annotations = {
|
||||
summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
|
||||
}
|
||||
},
|
||||
{
|
||||
alert = "OOMKilled"
|
||||
expr = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
|
||||
for = "1m"
|
||||
labels = {
|
||||
severity = "critical"
|
||||
}
|
||||
annotations = {
|
||||
summary = "OOMKilled detected in {{ $labels.namespace }}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}]
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Run terraform fmt**
|
||||
|
||||
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/monitoring/loki.tf
|
||||
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Deploy and verify
|
||||
|
||||
Apply all changes via Terraform and verify the stack is working.
|
||||
|
||||
**Files:** None (deployment only)
|
||||
|
||||
**Step 1: Run terraform apply for monitoring module**
|
||||
|
||||
Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
|
||||
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
|
||||
|
||||
**Step 2: Verify sysctl DaemonSet is running on all nodes**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
|
||||
Expected: DESIRED=5, CURRENT=5, READY=5
|
||||
|
||||
**Step 3: Verify Loki pod is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
|
||||
Expected: 1/1 Running
|
||||
|
||||
**Step 4: Verify Alloy DaemonSet is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
|
||||
Expected: DESIRED=5, CURRENT=5, READY=5
|
||||
|
||||
**Step 5: Verify Loki is receiving logs**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
|
||||
Expected: JSON response with labels like `namespace`, `pod`, `container`
|
||||
|
||||
**Step 6: Verify Grafana has Loki datasource**
|
||||
|
||||
Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
|
||||
Expected: Log lines from monitoring namespace pods
|
||||
|
||||
**Step 7: Commit final state**
|
||||
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**If Alloy pods crash with inotify errors:**
|
||||
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
|
||||
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`
|
||||
|
||||
**If Loki OOMs:**
|
||||
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
|
||||
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently
|
||||
|
||||
**If Grafana doesn't show Loki datasource:**
|
||||
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
|
||||
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`
|
||||
|
||||
**If Loki PV won't bind:**
|
||||
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
|
||||
- Run NFS export script: `cd secrets && bash nfs_exports.sh`
|
||||
|
|
@ -1,154 +0,0 @@
|
|||
# Multi-User Kubernetes Access Design
|
||||
|
||||
**Date**: 2026-02-17
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The cluster uses a single `kubernetes-admin` client certificate for all access. There is no way to:
|
||||
- Give different users different levels of access
|
||||
- Track who performed which actions
|
||||
- Enforce resource limits per user
|
||||
- Onboard new users without sharing admin credentials
|
||||
|
||||
## Decision
|
||||
|
||||
Native OIDC authentication on the kube-apiserver using Authentik as the identity provider, with Terraform-managed RBAC and a self-service Svelte portal for user onboarding.
|
||||
|
||||
### Alternatives Considered
|
||||
|
||||
1. **Pinniped (Concierge + Supervisor)**: Avoids API server changes but adds two components to maintain. Requires Pinniped CLI on user machines. Overkill for a single-cluster setup.
|
||||
2. **kube-oidc-proxy**: Avoids API server changes but adds a proxy in the request path (single point of failure, extra latency). Sporadic maintenance from JetStack.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
User → Self-Service Portal → Authentik Login → Download Kubeconfig
|
||||
│
|
||||
User → kubectl (with kubelogin) → kube-apiserver → OIDC validation → Authentik
|
||||
│
|
||||
RBAC evaluation
|
||||
│
|
||||
Audit logging → Alloy → Loki → Grafana
|
||||
```
|
||||
|
||||
### User Roles
|
||||
|
||||
| Role | Scope | Access |
|
||||
|------|-------|--------|
|
||||
| `admin` | Cluster-wide | Full `cluster-admin` access |
|
||||
| `power-user` | Cluster-wide | Deploy/manage workloads, view all resources, no RBAC/node modification |
|
||||
| `namespace-owner` | Specific namespaces | Full `admin` within assigned namespaces only |
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Authentik OIDC Provider
|
||||
|
||||
New OAuth2/OIDC application in Authentik configured via Terraform (`modules/kubernetes/authentik/`).
|
||||
|
||||
- **Application name**: `kubernetes`
|
||||
- **Provider type**: OAuth2/OpenID Connect
|
||||
- **Client type**: Public (no client secret, kubelogin uses PKCE)
|
||||
- **Redirect URIs**: `http://localhost:8000/callback` (kubelogin default)
|
||||
- **Scopes**: `openid`, `email`, `profile`, `groups`
|
||||
- **Property mappings**: Include `groups` claim for RBAC group matching
|
||||
|
||||
### 2. kube-apiserver OIDC Flags
|
||||
|
||||
One-time change on k8s-master (`10.0.20.100`), automated via Terraform `null_resource` with `remote-exec`.
|
||||
|
||||
Added to `/etc/kubernetes/manifests/kube-apiserver.yaml`:
|
||||
|
||||
```yaml
|
||||
- --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
|
||||
- --oidc-client-id=kubernetes
|
||||
- --oidc-username-claim=email
|
||||
- --oidc-groups-claim=groups
|
||||
```
|
||||
|
||||
Kubelet auto-restarts the API server pod when the manifest changes. These flags persist through `kubeadm upgrade apply`.
|
||||
|
||||
### 3. RBAC (Terraform-managed)
|
||||
|
||||
New module: `modules/kubernetes/rbac/main.tf`
|
||||
|
||||
**User definition** in `terraform.tfvars`:
|
||||
|
||||
```hcl
|
||||
k8s_users = {
|
||||
"viktor" = {
|
||||
role = "admin"
|
||||
email = "viktor@viktorbarzin.me"
|
||||
}
|
||||
"alice" = {
|
||||
role = "power-user"
|
||||
email = "alice@example.com"
|
||||
}
|
||||
"bob" = {
|
||||
role = "namespace-owner"
|
||||
namespaces = ["bob-apps", "bob-dev"]
|
||||
email = "bob@example.com"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Resources created per role:**
|
||||
|
||||
| Role | Terraform Resources |
|
||||
|------|-------------------|
|
||||
| `admin` | `ClusterRoleBinding` → `cluster-admin` for user email |
|
||||
| `power-user` | Custom `ClusterRole` (workload management, no RBAC/node access) + `ClusterRoleBinding` |
|
||||
| `namespace-owner` | `Namespace`(s) + `RoleBinding` → built-in `admin` ClusterRole + `ResourceQuota` per namespace |
|
||||
|
||||
### 4. Self-Service Portal
|
||||
|
||||
Svelte (SvelteKit) app at `https://k8s-portal.viktorbarzin.me`.
|
||||
|
||||
**Flow:**
|
||||
1. User visits portal → Authentik login via Traefik forward auth
|
||||
2. Portal displays user's role and assigned namespaces
|
||||
3. User downloads pre-configured kubeconfig (generated server-side)
|
||||
4. Portal shows setup instructions (install kubectl + kubelogin)
|
||||
|
||||
**Kubeconfig template** includes:
|
||||
- Cluster: `https://10.0.20.100:6443` with CA cert
|
||||
- Auth: `exec` credential plugin pointing to kubelogin
|
||||
- OIDC issuer URL and client ID pre-configured
|
||||
|
||||
**Deployment**: Standard Kubernetes deployment + service + ingress, Terraform-managed like other services. No database needed — user role info read from Kubernetes RBAC bindings or a Terraform-generated ConfigMap.
|
||||
|
||||
### 5. Audit Logging
|
||||
|
||||
Kubernetes audit policy deployed to master via the same `null_resource`.
|
||||
|
||||
**Policy** (`/etc/kubernetes/audit-policy.yaml`):
|
||||
- `RequestResponse` level for OIDC-authenticated users (captures what they changed)
|
||||
- `Metadata` level for system/service accounts (keeps volume down)
|
||||
- Secrets logged at `Metadata` level only (no request/response bodies)
|
||||
|
||||
**Log pipeline**: Audit log file → Alloy (DaemonSet on master) → Loki → Grafana dashboard
|
||||
|
||||
**Grafana dashboard** shows: who accessed what resource, when, from where, and the outcome (allow/deny).
|
||||
|
||||
### 6. Resource Quotas
|
||||
|
||||
Each namespace-owner namespace gets a `ResourceQuota`:
|
||||
|
||||
```hcl
|
||||
requests.cpu = "2"
|
||||
requests.memory = "4Gi"
|
||||
limits.cpu = "4"
|
||||
limits.memory = "8Gi"
|
||||
pods = "20"
|
||||
```
|
||||
|
||||
Defaults can be overridden per-user via an optional `quota` field in the `k8s_users` variable.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. Authentik OIDC application setup
|
||||
2. kube-apiserver OIDC flag configuration
|
||||
3. RBAC Terraform module
|
||||
4. Audit logging
|
||||
5. Self-service portal
|
||||
6. Grafana dashboard for audit logs
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,111 +0,0 @@
|
|||
# OpenClaw Cluster Management Agent — Design
|
||||
|
||||
**Date**: 2026-02-21
|
||||
**Status**: Approved
|
||||
|
||||
## Goal
|
||||
|
||||
Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
CronJob (every 30min)
|
||||
└─ kubectl exec into OpenClaw pod
|
||||
└─ /workspace/infra/.claude/cluster-health.sh
|
||||
├─ kubectl get nodes (check health)
|
||||
├─ kubectl get pods -A (find problems)
|
||||
├─ kubectl delete pod (evicted/stuck)
|
||||
└─ curl Slack webhook (report)
|
||||
```
|
||||
|
||||
Interactive path: User asks OpenClaw via UI -> `cluster-health` skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`)
|
||||
|
||||
Teaches OpenClaw:
|
||||
- What health checks to run
|
||||
- What's safe to auto-fix vs alert-only
|
||||
- How to format Slack alerts
|
||||
- How to do deeper investigation when asked interactively
|
||||
|
||||
Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc.
|
||||
|
||||
### 2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`)
|
||||
|
||||
Reusable script that performs all checks:
|
||||
|
||||
**Checks:**
|
||||
- Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure)
|
||||
- Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending)
|
||||
- Evicted pods
|
||||
- Failed deployments (unavailable replicas)
|
||||
- Pending PVCs
|
||||
- Resource pressure (high CPU/memory allocation)
|
||||
- Failed CronJobs
|
||||
- DaemonSet health (missing pods)
|
||||
|
||||
**Safe auto-fix actions:**
|
||||
- Delete evicted pods
|
||||
- Delete completed/succeeded pods older than 24h
|
||||
- Restart (delete) pods in CrashLoopBackOff for more than 1 hour
|
||||
|
||||
**Alert-only (never auto-fix):**
|
||||
- Node NotReady
|
||||
- Persistent OOMKilled
|
||||
- ImagePullBackOff
|
||||
- Pending PVCs
|
||||
- Failed deployments with 0 available replicas
|
||||
|
||||
**Output:**
|
||||
- Structured text summary
|
||||
- Posts to Slack via webhook
|
||||
- Exit code 0 = healthy, 1 = issues found
|
||||
|
||||
### 3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`)
|
||||
|
||||
- Schedule: `*/30 * * * *`
|
||||
- Container: `bitnami/kubectl` (minimal image with kubectl)
|
||||
- Command: `kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh`
|
||||
- ServiceAccount with RBAC to exec into pods in `openclaw` namespace
|
||||
- `concurrencyPolicy: Forbid`
|
||||
- `failedJobsHistoryLimit: 3`
|
||||
- `successfulJobsHistoryLimit: 3`
|
||||
|
||||
### 4. Slack Integration
|
||||
|
||||
- Webhook URL from `openclaw_skill_secrets["slack"]` (already configured)
|
||||
- Passed as `SLACK_WEBHOOK_URL` env var to the OpenClaw pod
|
||||
|
||||
## Slack Message Format
|
||||
|
||||
```
|
||||
:white_check_mark: Cluster Health Check — All Clear
|
||||
Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues
|
||||
```
|
||||
|
||||
```
|
||||
:warning: Cluster Health Check — 3 Issues Found
|
||||
|
||||
Auto-fixed:
|
||||
- Deleted 4 evicted pods in monitoring namespace
|
||||
- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h)
|
||||
|
||||
Needs attention:
|
||||
- Node k8s-node3: MemoryPressure condition detected
|
||||
- PVC data-tandoor pending for 45 minutes
|
||||
```
|
||||
|
||||
## Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Mode | Proactive (scheduled) | Want automated monitoring |
|
||||
| Alert channel | Slack | Existing webhook in openclaw_skill_secrets |
|
||||
| Auto-fix | Safe fixes only | Delete evicted, restart stuck; alert for the rest |
|
||||
| Frequency | 30 minutes | Balance between detection speed and overhead |
|
||||
| Checks scope | Standard K8s health | Pod/node/deployment/PVC/CronJob/DaemonSet |
|
||||
| Trigger mechanism | CronJob execs into OpenClaw pod | Reuses OpenClaw's tools; LLM available interactively |
|
||||
| Fallback | None | Uptime Kuma monitors OpenClaw availability |
|
||||
|
|
@ -1,800 +0,0 @@
|
|||
# OpenClaw Cluster Management Agent — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via `kubectl exec`.
|
||||
|
||||
**Architecture:** CronJob (bitnami/kubectl) -> `kubectl exec` into OpenClaw pod -> runs `cluster-health.sh` which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.
|
||||
|
||||
**Tech Stack:** Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add Slack webhook to openclaw_skill_secrets
|
||||
|
||||
**Files:**
|
||||
- Modify: `terraform.tfvars:1291-1295` (add slack_webhook key)
|
||||
- Modify: `modules/kubernetes/openclaw/main.tf:350-376` (add SLACK_WEBHOOK_URL env var)
|
||||
|
||||
**Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars**
|
||||
|
||||
Add a new key `slack_webhook` to the existing `openclaw_skill_secrets` map. The user must provide the webhook URL. For now, use the existing `alertmanager_slack_api_url` value or a dedicated one.
|
||||
|
||||
In `terraform.tfvars`, change:
|
||||
```hcl
|
||||
openclaw_skill_secrets = {
|
||||
home_assistant_token = "..."
|
||||
home_assistant_sofia_token = "..."
|
||||
uptime_kuma_password = "..."
|
||||
}
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
openclaw_skill_secrets = {
|
||||
home_assistant_token = "..."
|
||||
home_assistant_sofia_token = "..."
|
||||
uptime_kuma_password = "..."
|
||||
slack_webhook = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
}
|
||||
```
|
||||
|
||||
**NOTE:** Ask the user which Slack webhook URL to use. Candidates:
|
||||
- `alertmanager_slack_api_url` (line 4 in tfvars)
|
||||
- `tiny_tuya_slack_url` (line 1213, comment says "K8s bot slack")
|
||||
- A new webhook the user creates
|
||||
|
||||
**Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container**
|
||||
|
||||
In `modules/kubernetes/openclaw/main.tf`, add after the `UPTIME_KUMA_PASSWORD` env block (around line 370):
|
||||
```hcl
|
||||
# Skill secrets - Slack
|
||||
env {
|
||||
name = "SLACK_WEBHOOK_URL"
|
||||
value = var.skill_secrets["slack_webhook"]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/openclaw/main.tf
|
||||
git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
|
||||
```
|
||||
|
||||
Do NOT commit `terraform.tfvars` separately — it will be committed with the full set of changes at the end.
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Create the cluster-health.sh helper script
|
||||
|
||||
**Files:**
|
||||
- Create: `.claude/cluster-health.sh`
|
||||
|
||||
**Step 1: Write the health check script**
|
||||
|
||||
Create `.claude/cluster-health.sh` with the following structure. The script:
|
||||
- Uses `$KUBECONFIG` (already set in OpenClaw pod) or falls back to in-cluster config
|
||||
- Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
|
||||
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
|
||||
- Posts structured Slack message via `$SLACK_WEBHOOK_URL`
|
||||
- Exit code 0 = healthy, 1 = issues found, 2 = critical
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# Cluster health check script for OpenClaw.
|
||||
# Runs health checks, auto-fixes safe issues, posts to Slack.
|
||||
# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
|
||||
#
|
||||
# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
|
||||
# --no-slack Skip Slack notification (useful for interactive/debug runs)
|
||||
# --no-fix Skip auto-fix actions (report only)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SEND_SLACK=true
|
||||
AUTO_FIX=true
|
||||
ISSUES=()
|
||||
FIXES=()
|
||||
WARNINGS=()
|
||||
|
||||
# --- Argument parsing ---
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--no-slack) SEND_SLACK=false ;;
|
||||
--no-fix) AUTO_FIX=false ;;
|
||||
esac
|
||||
done
|
||||
|
||||
KUBECTL="kubectl"
|
||||
|
||||
# --- 1. Node Health ---
|
||||
check_nodes() {
|
||||
local nodes not_ready
|
||||
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
|
||||
not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)
|
||||
|
||||
if [[ -n "$not_ready" ]]; then
|
||||
while IFS= read -r node; do
|
||||
ISSUES+=("Node NotReady: $node")
|
||||
done <<< "$not_ready"
|
||||
fi
|
||||
|
||||
# Check conditions
|
||||
local conditions
|
||||
conditions=$($KUBECTL get nodes -o json | python3 -c '
|
||||
import json, sys
|
||||
data = json.load(sys.stdin)
|
||||
for node in data["items"]:
|
||||
name = node["metadata"]["name"]
|
||||
for c in node["status"]["conditions"]:
|
||||
if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
|
||||
print(name + ": " + c["type"])
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -n "$conditions" ]]; then
|
||||
while IFS= read -r line; do
|
||||
ISSUES+=("$line")
|
||||
done <<< "$conditions"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 2. Pod Health ---
|
||||
check_pods() {
|
||||
local bad
|
||||
bad=$( {
|
||||
$KUBECTL get pods -A --no-headers 2>/dev/null \
|
||||
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
|
||||
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
|
||||
|
||||
if [[ -z "$bad" ]]; then return; fi
|
||||
|
||||
while IFS= read -r line; do
|
||||
local ns pod status
|
||||
ns=$(echo "$line" | awk '{print $1}')
|
||||
pod=$(echo "$line" | awk '{print $2}')
|
||||
status=$(echo "$line" | awk '{print $4}')
|
||||
|
||||
if [[ "$status" == "CrashLoopBackOff" ]]; then
|
||||
# Check if stuck for >1 hour
|
||||
local restart_count
|
||||
restart_count=$(echo "$line" | awk '{print $5}')
|
||||
if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
|
||||
$KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
|
||||
FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
|
||||
WARNINGS+=("Failed to restart $ns/$pod")
|
||||
else
|
||||
ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
|
||||
fi
|
||||
elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
|
||||
ISSUES+=("ImagePullBackOff: $ns/$pod")
|
||||
else
|
||||
ISSUES+=("Error: $ns/$pod ($status)")
|
||||
fi
|
||||
done <<< "$bad"
|
||||
}
|
||||
|
||||
# --- 3. Evicted/Failed Pods ---
|
||||
check_evicted() {
|
||||
local evicted count
|
||||
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$evicted" ]]; then return; fi
|
||||
count=$(echo "$evicted" | wc -l | tr -d ' ')
|
||||
|
||||
if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
|
||||
$KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
|
||||
FIXES+=("Deleted $count evicted/failed pod(s)") || \
|
||||
WARNINGS+=("Failed to delete evicted pods")
|
||||
else
|
||||
ISSUES+=("$count evicted/failed pod(s)")
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 4. Failed Deployments ---
|
||||
check_deployments() {
|
||||
local deps
|
||||
deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return
|
||||
|
||||
while IFS= read -r line; do
|
||||
local ns name ready current desired
|
||||
ns=$(echo "$line" | awk '{print $1}')
|
||||
name=$(echo "$line" | awk '{print $2}')
|
||||
ready=$(echo "$line" | awk '{print $3}')
|
||||
current=$(echo "$ready" | cut -d/ -f1)
|
||||
desired=$(echo "$ready" | cut -d/ -f2)
|
||||
|
||||
if [[ "$current" != "$desired" ]]; then
|
||||
ISSUES+=("Deployment $ns/$name: $current/$desired ready")
|
||||
fi
|
||||
done <<< "$deps"
|
||||
}
|
||||
|
||||
# --- 5. Pending PVCs ---
|
||||
check_pvcs() {
|
||||
local pvcs
|
||||
pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return
|
||||
|
||||
if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi
|
||||
|
||||
while IFS= read -r line; do
|
||||
local ns name status
|
||||
ns=$(echo "$line" | awk '{print $1}')
|
||||
name=$(echo "$line" | awk '{print $2}')
|
||||
status=$(echo "$line" | awk '{print $3}')
|
||||
|
||||
if [[ "$status" != "Bound" ]]; then
|
||||
ISSUES+=("PVC $ns/$name: $status")
|
||||
fi
|
||||
done <<< "$pvcs"
|
||||
}
|
||||
|
||||
# --- 6. Resource Pressure ---
|
||||
check_resources() {
|
||||
local top
|
||||
top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return
|
||||
|
||||
while IFS= read -r line; do
|
||||
local node cpu_pct mem_pct
|
||||
node=$(echo "$line" | awk '{print $1}')
|
||||
cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
|
||||
mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
|
||||
|
||||
[[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue
|
||||
|
||||
if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
|
||||
ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
|
||||
elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
|
||||
WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
|
||||
fi
|
||||
done <<< "$top"
|
||||
}
|
||||
|
||||
# --- 7. CronJob Failures ---
|
||||
check_cronjobs() {
|
||||
local failures
|
||||
failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
|
||||
import json, sys
|
||||
from datetime import datetime, timezone, timedelta
|
||||
|
||||
data = json.load(sys.stdin)
|
||||
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
|
||||
|
||||
for job in data.get("items", []):
|
||||
meta = job.get("metadata", {})
|
||||
ns = meta.get("namespace", "")
|
||||
name = meta.get("name", "")
|
||||
owners = meta.get("ownerReferences", [])
|
||||
if not any(o.get("kind") == "CronJob" for o in owners):
|
||||
continue
|
||||
for c in job.get("status", {}).get("conditions", []):
|
||||
if c.get("type") == "Failed" and c.get("status") == "True":
|
||||
ts = c.get("lastTransitionTime", "")
|
||||
if ts:
|
||||
try:
|
||||
t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
|
||||
if t > cutoff:
|
||||
print(f"{ns}/{name}")
|
||||
except:
|
||||
print(f"{ns}/{name}")
|
||||
' 2>/dev/null) || true
|
||||
|
||||
if [[ -n "$failures" ]]; then
|
||||
local count
|
||||
count=$(echo "$failures" | wc -l | tr -d ' ')
|
||||
ISSUES+=("$count CronJob failure(s) in last 24h")
|
||||
fi
|
||||
}
|
||||
|
||||
# --- 8. DaemonSet Health ---
|
||||
check_daemonsets() {
|
||||
local ds
|
||||
ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return
|
||||
|
||||
while IFS= read -r line; do
|
||||
local ns name desired ready
|
||||
ns=$(echo "$line" | awk '{print $1}')
|
||||
name=$(echo "$line" | awk '{print $2}')
|
||||
desired=$(echo "$line" | awk '{print $3}')
|
||||
ready=$(echo "$line" | awk '{print $5}')
|
||||
|
||||
if [[ "$desired" != "$ready" ]]; then
|
||||
ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
|
||||
fi
|
||||
done <<< "$ds"
|
||||
}
|
||||
|
||||
# --- Cluster summary stats ---
|
||||
get_summary_stats() {
|
||||
local node_count ready_count pod_count
|
||||
node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
|
||||
ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
|
||||
pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
|
||||
echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
|
||||
}
|
||||
|
||||
# --- Send Slack message ---
|
||||
send_slack() {
|
||||
local webhook_url="$SLACK_WEBHOOK_URL"
|
||||
if [[ -z "${webhook_url:-}" ]]; then
|
||||
echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
|
||||
return
|
||||
fi
|
||||
|
||||
local summary issue_count fix_count warning_count
|
||||
summary=$(get_summary_stats)
|
||||
issue_count=${#ISSUES[@]}
|
||||
fix_count=${#FIXES[@]}
|
||||
warning_count=${#WARNINGS[@]}
|
||||
|
||||
local text=""
|
||||
local total_problems=$((issue_count + warning_count))
|
||||
|
||||
if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
|
||||
text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
|
||||
else
|
||||
if [[ "$issue_count" -gt 0 ]]; then
|
||||
text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
|
||||
elif [[ "$warning_count" -gt 0 ]]; then
|
||||
text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
|
||||
else
|
||||
text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
|
||||
fi
|
||||
|
||||
if [[ "$fix_count" -gt 0 ]]; then
|
||||
text+="\n\n*Auto-fixed:*"
|
||||
for fix in "${FIXES[@]}"; do
|
||||
text+="\n• ${fix}"
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$issue_count" -gt 0 ]]; then
|
||||
text+="\n\n*Needs attention:*"
|
||||
for issue in "${ISSUES[@]}"; do
|
||||
text+="\n• ${issue}"
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$warning_count" -gt 0 ]]; then
|
||||
text+="\n\n*Warnings:*"
|
||||
for warning in "${WARNINGS[@]}"; do
|
||||
text+="\n• ${warning}"
|
||||
done
|
||||
fi
|
||||
fi
|
||||
|
||||
curl -s -X POST "$webhook_url" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{\"text\": \"${text}\"}" > /dev/null 2>&1
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
main() {
|
||||
echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="
|
||||
|
||||
check_nodes
|
||||
check_pods
|
||||
check_evicted
|
||||
check_deployments
|
||||
check_pvcs
|
||||
check_resources
|
||||
check_cronjobs
|
||||
check_daemonsets
|
||||
|
||||
local issue_count=${#ISSUES[@]}
|
||||
local fix_count=${#FIXES[@]}
|
||||
local warning_count=${#WARNINGS[@]}
|
||||
|
||||
echo ""
|
||||
echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"
|
||||
|
||||
if [[ "$fix_count" -gt 0 ]]; then
|
||||
echo ""
|
||||
echo "Auto-fixed:"
|
||||
for fix in "${FIXES[@]}"; do echo " - $fix"; done
|
||||
fi
|
||||
|
||||
if [[ "$issue_count" -gt 0 ]]; then
|
||||
echo ""
|
||||
echo "Issues:"
|
||||
for issue in "${ISSUES[@]}"; do echo " - $issue"; done
|
||||
fi
|
||||
|
||||
if [[ "$warning_count" -gt 0 ]]; then
|
||||
echo ""
|
||||
echo "Warnings:"
|
||||
for warning in "${WARNINGS[@]}"; do echo " - $warning"; done
|
||||
fi
|
||||
|
||||
if [[ "$SEND_SLACK" == true ]]; then
|
||||
send_slack
|
||||
echo ""
|
||||
echo "Slack notification sent."
|
||||
fi
|
||||
|
||||
# Exit code
|
||||
if [[ "$issue_count" -gt 0 ]]; then
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
||||
}
|
||||
|
||||
main "$@"
|
||||
```
|
||||
|
||||
**Step 2: Make it executable**
|
||||
|
||||
```bash
|
||||
chmod +x .claude/cluster-health.sh
|
||||
```
|
||||
|
||||
**Step 3: Test locally (dry run)**
|
||||
|
||||
```bash
|
||||
KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
|
||||
```
|
||||
|
||||
Expected: Script runs, prints check results, no Slack post.
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add .claude/cluster-health.sh
|
||||
git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Create the cluster-health skill
|
||||
|
||||
**Files:**
|
||||
- Create: `.claude/skills/cluster-health/SKILL.md`
|
||||
|
||||
**Step 1: Write the skill document**
|
||||
|
||||
```markdown
|
||||
---
|
||||
name: cluster-health
|
||||
description: |
|
||||
Check Kubernetes cluster health and fix common issues. Use when:
|
||||
(1) User asks to check the cluster, check health, or "what's wrong",
|
||||
(2) User asks about pod status, node health, or deployment issues,
|
||||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
|
||||
and stuck CrashLoopBackOff pods.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# Cluster Health Check
|
||||
|
||||
## Overview
|
||||
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
|
||||
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
|
||||
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
|
||||
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)
|
||||
|
||||
## Quick Check
|
||||
|
||||
Run the health check script:
|
||||
```bash
|
||||
bash /workspace/infra/.claude/cluster-health.sh --no-slack
|
||||
```
|
||||
|
||||
Or with Slack notification:
|
||||
```bash
|
||||
bash /workspace/infra/.claude/cluster-health.sh
|
||||
```
|
||||
|
||||
Report-only (no auto-fix):
|
||||
```bash
|
||||
bash /workspace/infra/.claude/cluster-health.sh --no-fix
|
||||
```
|
||||
|
||||
## What It Checks
|
||||
|
||||
| # | Check | Auto-Fix | Alert |
|
||||
|---|-------|----------|-------|
|
||||
| 1 | Node health (NotReady, conditions) | No | Yes |
|
||||
| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
|
||||
| 3 | Evicted/failed pods | Delete all | Yes |
|
||||
| 4 | Deployment availability (current != desired) | No | Yes |
|
||||
| 5 | PVC status (not Bound) | No | Yes |
|
||||
| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
|
||||
| 7 | CronJob failures (last 24h) | No | Yes |
|
||||
| 8 | DaemonSet health (desired != ready) | No | Yes |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
These are the ONLY things the script auto-fixes:
|
||||
1. **Evicted/failed pods**: `kubectl delete pods -A --field-selector=status.phase=Failed`
|
||||
2. **CrashLoopBackOff pods with >10 restarts**: `kubectl delete pod -n <ns> <pod> --grace-period=30`
|
||||
|
||||
Everything else is alert-only. NEVER auto-fix:
|
||||
- Node NotReady (could be maintenance)
|
||||
- ImagePullBackOff (needs image tag or registry fix)
|
||||
- Pending PVCs (needs storage investigation)
|
||||
- Failed deployments (needs config investigation)
|
||||
|
||||
## Deep Investigation
|
||||
|
||||
When the script reports issues and the user asks for more detail, use these commands:
|
||||
|
||||
### Node issues
|
||||
```bash
|
||||
kubectl describe node <node-name>
|
||||
kubectl top node <node-name>
|
||||
kubectl get events --field-selector involvedObject.name=<node-name>
|
||||
```
|
||||
|
||||
### Pod issues
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name>
|
||||
kubectl logs -n <namespace> <pod-name> --tail=100
|
||||
kubectl logs -n <namespace> <pod-name> --previous --tail=100
|
||||
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
|
||||
```
|
||||
|
||||
### Deployment issues
|
||||
```bash
|
||||
kubectl describe deployment -n <namespace> <deployment-name>
|
||||
kubectl rollout status deployment -n <namespace> <deployment-name>
|
||||
kubectl rollout history deployment -n <namespace> <deployment-name>
|
||||
```
|
||||
|
||||
### PVC issues
|
||||
```bash
|
||||
kubectl describe pvc -n <namespace> <pvc-name>
|
||||
kubectl get pv
|
||||
kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
|
||||
```
|
||||
|
||||
### Resource pressure
|
||||
```bash
|
||||
kubectl top nodes
|
||||
kubectl top pods -A --sort-by=memory | head -20
|
||||
kubectl top pods -A --sort-by=cpu | head -20
|
||||
```
|
||||
|
||||
## Common Remediation
|
||||
|
||||
### CrashLoopBackOff (persistent)
|
||||
1. Check logs: `kubectl logs -n <ns> <pod> --previous --tail=100`
|
||||
2. Check events: `kubectl describe pod -n <ns> <pod>`
|
||||
3. Common causes: OOMKilled (increase memory limit), bad config, missing env var
|
||||
4. If image issue: check if newer image exists, update in Terraform
|
||||
|
||||
### OOMKilled
|
||||
1. Check current limits: `kubectl describe pod -n <ns> <pod> | grep -A2 Limits`
|
||||
2. Fix: Update resource limits in Terraform module for the service
|
||||
3. Apply: `terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"`
|
||||
|
||||
### ImagePullBackOff
|
||||
1. Check image: `kubectl describe pod -n <ns> <pod> | grep Image`
|
||||
2. Check registry: Is the image tag valid? Is the registry reachable?
|
||||
3. Check pull-through cache: Docker registry at 10.0.20.10
|
||||
|
||||
### Node NotReady
|
||||
1. Check kubelet: SSH to node, `systemctl status kubelet`
|
||||
2. Check resources: `kubectl top node <node>`
|
||||
3. Check conditions: `kubectl describe node <node> | grep -A10 Conditions`
|
||||
|
||||
## Slack Webhook
|
||||
|
||||
Messages are posted to the webhook at `$SLACK_WEBHOOK_URL`. Format:
|
||||
- All clear: green check + summary stats
|
||||
- Issues found: red siren + list of issues + auto-fix actions taken
|
||||
- Warnings only: yellow warning + elevated metrics
|
||||
|
||||
## Infrastructure
|
||||
|
||||
- **Terraform module**: `modules/kubernetes/openclaw/main.tf`
|
||||
- **CronJob**: Runs in `openclaw` namespace every 30 min
|
||||
- **Existing healthcheck**: `scripts/cluster_healthcheck.sh` (local-only, not for OpenClaw)
|
||||
- **Repo path inside pod**: `/workspace/infra/`
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add .claude/skills/cluster-health/SKILL.md
|
||||
git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Add CronJob and RBAC to Terraform
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/openclaw/main.tf` (append CronJob + ServiceAccount + Role + RoleBinding)
|
||||
|
||||
**Step 1: Add CronJob resources**
|
||||
|
||||
Append the following to `modules/kubernetes/openclaw/main.tf` after the `module "ingress"` block:
|
||||
|
||||
```hcl
|
||||
# --- CronJob: Scheduled cluster health check ---
|
||||
|
||||
resource "kubernetes_service_account" "healthcheck" {
|
||||
metadata {
|
||||
name = "cluster-healthcheck"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role" "healthcheck_exec" {
|
||||
metadata {
|
||||
name = "healthcheck-pod-exec"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods"]
|
||||
verbs = ["get", "list"]
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods/exec"]
|
||||
verbs = ["create"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "healthcheck_exec" {
|
||||
metadata {
|
||||
name = "healthcheck-pod-exec"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.healthcheck.metadata[0].name
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.healthcheck_exec.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
|
||||
metadata {
|
||||
name = "cluster-healthcheck"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
labels = {
|
||||
app = "cluster-healthcheck"
|
||||
tier = var.tier
|
||||
}
|
||||
}
|
||||
spec {
|
||||
schedule = "*/30 * * * *"
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 3
|
||||
|
||||
job_template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "cluster-healthcheck"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
active_deadline_seconds = 300
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "cluster-healthcheck"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
|
||||
restart_policy = "Never"
|
||||
|
||||
container {
|
||||
name = "healthcheck"
|
||||
image = "bitnami/kubectl:1.34"
|
||||
command = ["bash", "-c", <<-EOF
|
||||
# Find the openclaw pod
|
||||
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
||||
if [ -z "$POD" ]; then
|
||||
echo "ERROR: OpenClaw pod not found"
|
||||
exit 1
|
||||
fi
|
||||
echo "Executing health check in pod $POD..."
|
||||
kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
|
||||
EOF
|
||||
]
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "64Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "128Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Verify Terraform formatting**
|
||||
|
||||
```bash
|
||||
terraform fmt modules/kubernetes/openclaw/main.tf
|
||||
```
|
||||
|
||||
**Step 3: Verify Terraform plan**
|
||||
|
||||
```bash
|
||||
terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
|
||||
```
|
||||
|
||||
Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/openclaw/main.tf
|
||||
git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Deploy and verify
|
||||
|
||||
**Step 1: Apply Terraform**
|
||||
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
**Step 2: Verify CronJob exists**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
|
||||
```
|
||||
|
||||
Expected: `cluster-healthcheck` with schedule `*/30 * * * *`
|
||||
|
||||
**Step 3: Verify RBAC**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
|
||||
```
|
||||
|
||||
Expected: `cluster-healthcheck` SA, `healthcheck-pod-exec` role and rolebinding
|
||||
|
||||
**Step 4: Trigger a manual run**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
|
||||
```
|
||||
|
||||
**Step 5: Check job output**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
|
||||
kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
|
||||
```
|
||||
|
||||
Expected: Health check output with results. If `SLACK_WEBHOOK_URL` is set, check Slack for the message.
|
||||
|
||||
**Step 6: Clean up test job**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
|
||||
```
|
||||
|
||||
**Step 7: Final commit**
|
||||
|
||||
```bash
|
||||
git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
|
||||
git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"
|
||||
```
|
||||
|
|
@ -1,387 +0,0 @@
|
|||
# Terragrunt Migration Design
|
||||
|
||||
**Date**: 2026-02-22
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The infrastructure repo has a monolithic Terraform setup:
|
||||
- 15MB state file, 857 resources, 85+ service modules in a single root
|
||||
- `terraform plan/apply` evaluates all modules even when targeting one service
|
||||
- `null_resource.core_services` bottleneck blocks 73 services behind 12 core modules
|
||||
- 150+ variables passed through root -> kubernetes_cluster -> individual services
|
||||
- 3 providers (kubernetes, helm, proxmox) initialize on every run
|
||||
|
||||
## Goals
|
||||
|
||||
- **Speed**: Faster plan/apply by splitting state into independent stacks
|
||||
- **Blast radius isolation**: Bad apply can't break unrelated services
|
||||
- **DRY config**: Shared provider/backend configuration via Terragrunt
|
||||
- **Proper DAG**: Full references between stacks (not hardcoded DNS strings)
|
||||
- **Bootstrappable**: `terragrunt run-all apply` works from scratch
|
||||
- **CI/CD**: Changed-stack detection in Drone CI
|
||||
|
||||
## Architecture: Flat Stacks
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
infra/
|
||||
├── terragrunt.hcl # Root config (providers, backend, common vars)
|
||||
├── stacks/
|
||||
│ ├── infra/ # Proxmox VMs, templates, docker-registry
|
||||
│ │ ├── terragrunt.hcl
|
||||
│ │ └── main.tf
|
||||
│ ├── platform/ # Core: traefik, metallb, redis, dbaas, authentik, etc.
|
||||
│ │ ├── terragrunt.hcl
|
||||
│ │ └── main.tf
|
||||
│ ├── blog/ # One dir per user service
|
||||
│ │ ├── terragrunt.hcl
|
||||
│ │ └── main.tf
|
||||
│ ├── immich/
|
||||
│ │ ├── terragrunt.hcl
|
||||
│ │ └── main.tf
|
||||
│ └── ... (~65 service dirs)
|
||||
├── modules/ # UNCHANGED — existing modules stay where they are
|
||||
│ ├── kubernetes/
|
||||
│ │ ├── ingress_factory/
|
||||
│ │ ├── setup_tls_secret/
|
||||
│ │ ├── blog/
|
||||
│ │ ├── immich/
|
||||
│ │ └── ...
|
||||
│ ├── create-vm/
|
||||
│ └── create-template-vm/
|
||||
├── state/ # Per-stack state files
|
||||
│ ├── infra/terraform.tfstate
|
||||
│ ├── platform/terraform.tfstate
|
||||
│ ├── blog/terraform.tfstate
|
||||
│ └── ...
|
||||
├── terraform.tfvars # UNCHANGED — encrypted secrets
|
||||
├── secrets/ # UNCHANGED — TLS certs
|
||||
├── main.tf # LEGACY — gradually emptied during migration
|
||||
└── terraform.tfstate # LEGACY — gradually emptied during migration
|
||||
```
|
||||
|
||||
Each stack has a thin `main.tf` wrapper that calls the existing module via
|
||||
`source = "../../modules/kubernetes/<service>"`. We do NOT use Terragrunt's
|
||||
`terraform { source }` directive because our modules use relative paths
|
||||
(`../ingress_factory`, `../setup_tls_secret`) that would break when Terragrunt
|
||||
copies them to `.terragrunt-cache/`.
|
||||
|
||||
### Stack Composition
|
||||
|
||||
**Infra stack** (~10 resources):
|
||||
- Proxmox VM templates (k8s, non-k8s, docker-registry)
|
||||
- Docker registry VM
|
||||
- Uses proxmox provider (not kubernetes/helm)
|
||||
|
||||
**Platform stack** (~200 resources, ~20 services):
|
||||
- traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared
|
||||
- monitoring (prometheus, alertmanager, grafana, loki, alloy)
|
||||
- kyverno, metrics-server, nvidia, mailserver, authelia
|
||||
- wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy
|
||||
- Exports outputs consumed by service stacks
|
||||
|
||||
**Per-service stacks** (~65, each 5-25 resources):
|
||||
- One stack per user-facing service
|
||||
- Each depends on platform via Terragrunt `dependency` block
|
||||
- Some depend on other services (f1-stream -> coturn, etc.)
|
||||
|
||||
### Dependency Graph
|
||||
|
||||
```
|
||||
┌─────────┐
|
||||
│ infra │
|
||||
└────┬────┘
|
||||
│
|
||||
┌────▼────┐
|
||||
│platform │ exports: redis_host, postgresql_host,
|
||||
│ │ mysql_host, smtp_host, tls_secret_name, ...
|
||||
└────┬────┘
|
||||
│
|
||||
┌────────┬───────────┼───────────┬────────┐
|
||||
│ │ │ │ │
|
||||
┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐
|
||||
│ blog │ │immich │ │ affine │ │ollama │ │coturn│ ...
|
||||
└───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘
|
||||
│ │
|
||||
┌────▼───┐ ┌───▼──────┐
|
||||
│openclaw│ │f1-stream │
|
||||
│gramps │ └──────────┘
|
||||
│ytdlp │
|
||||
└────────┘
|
||||
```
|
||||
|
||||
### Platform Stack Outputs
|
||||
|
||||
| Output | Value | Consumers |
|
||||
|--------|-------|-----------|
|
||||
| `redis_host` | `redis.redis.svc.cluster.local` | 10 services |
|
||||
| `postgresql_host` | `postgresql.dbaas.svc.cluster.local` | 10 services |
|
||||
| `postgresql_port` | `5432` | 10 services |
|
||||
| `mysql_host` | `mysql.dbaas.svc.cluster.local` | 8 services |
|
||||
| `mysql_port` | `3306` | 8 services |
|
||||
| `smtp_host` | `mail.viktorbarzin.me` | 6 services |
|
||||
| `smtp_port` | `587` | 6 services |
|
||||
| `tls_secret_name` | from variable | all services |
|
||||
| `authentik_outpost_url` | `http://ak-outpost-...` | traefik |
|
||||
| `crowdsec_lapi_host` | `crowdsec-service...` | traefik |
|
||||
| `alertmanager_url` | `http://prometheus-alertmanager...` | loki |
|
||||
| `loki_push_url` | `http://loki...` | alloy |
|
||||
|
||||
Service-to-service dependencies:
|
||||
|
||||
| Service | Depends on | Outputs consumed |
|
||||
|---------|-----------|-----------------|
|
||||
| f1-stream | coturn | `coturn_host`, `coturn_port` |
|
||||
| real-estate-crawler | osm-routing | `osrm_foot_host`, `osrm_bicycle_host` |
|
||||
| openclaw, grampsweb, ytdlp | ollama | `ollama_host` |
|
||||
|
||||
### Module Modifications
|
||||
|
||||
Service modules that hardcode DNS names need modification to accept hosts as variables.
|
||||
~20 modules affected. Example for affine:
|
||||
|
||||
**Before:**
|
||||
```hcl
|
||||
# modules/kubernetes/affine/main.tf
|
||||
DATABASE_URL = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine"
|
||||
REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"
|
||||
```
|
||||
|
||||
**After:**
|
||||
```hcl
|
||||
variable "redis_host" { type = string }
|
||||
variable "postgresql_host" { type = string }
|
||||
variable "postgresql_port" { type = number }
|
||||
|
||||
DATABASE_URL = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine"
|
||||
REDIS_SERVER_HOST = var.redis_host
|
||||
```
|
||||
|
||||
## Root Terragrunt Configuration
|
||||
|
||||
```hcl
|
||||
# infra/terragrunt.hcl
|
||||
|
||||
remote_state {
|
||||
backend = "local"
|
||||
generate = {
|
||||
path = "backend.tf"
|
||||
if_exists = "overwrite_terragrunt"
|
||||
}
|
||||
config = {
|
||||
path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
|
||||
}
|
||||
}
|
||||
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = [
|
||||
"${get_repo_root()}/terraform.tfvars"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
generate "k8s_providers" {
|
||||
path = "providers.tf"
|
||||
if_exists = "overwrite_terragrunt"
|
||||
contents = <<EOF
|
||||
variable "kube_config_path" {
|
||||
type = string
|
||||
default = "~/.kube/config"
|
||||
}
|
||||
|
||||
provider "kubernetes" {
|
||||
config_path = var.kube_config_path
|
||||
}
|
||||
|
||||
provider "helm" {
|
||||
kubernetes {
|
||||
config_path = var.kube_config_path
|
||||
}
|
||||
}
|
||||
EOF
|
||||
}
|
||||
```
|
||||
|
||||
## Stack Wrapper Examples
|
||||
|
||||
### Simple service (blog)
|
||||
|
||||
```hcl
|
||||
# stacks/blog/terragrunt.hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
}
|
||||
|
||||
inputs = {
|
||||
tls_secret_name = dependency.platform.outputs.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
```hcl
|
||||
# stacks/blog/main.tf
|
||||
variable "tls_secret_name" {}
|
||||
variable "kube_config_path" { default = "~/.kube/config" }
|
||||
|
||||
module "blog" {
|
||||
source = "../../modules/kubernetes/blog"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
tier = "4-aux"
|
||||
}
|
||||
```
|
||||
|
||||
### Database-backed service (affine)
|
||||
|
||||
```hcl
|
||||
# stacks/affine/terragrunt.hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
}
|
||||
|
||||
inputs = {
|
||||
tls_secret_name = dependency.platform.outputs.tls_secret_name
|
||||
redis_host = dependency.platform.outputs.redis_host
|
||||
postgresql_host = dependency.platform.outputs.postgresql_host
|
||||
postgresql_port = dependency.platform.outputs.postgresql_port
|
||||
smtp_host = dependency.platform.outputs.smtp_host
|
||||
smtp_port = dependency.platform.outputs.smtp_port
|
||||
}
|
||||
```
|
||||
|
||||
```hcl
|
||||
# stacks/affine/main.tf
|
||||
variable "tls_secret_name" {}
|
||||
variable "kube_config_path" { default = "~/.kube/config" }
|
||||
variable "affine_postgresql_password" {}
|
||||
variable "redis_host" { type = string }
|
||||
variable "postgresql_host" { type = string }
|
||||
variable "postgresql_port" { type = number }
|
||||
variable "smtp_host" { type = string }
|
||||
variable "smtp_port" { type = number }
|
||||
|
||||
module "affine" {
|
||||
source = "../../modules/kubernetes/affine"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
postgresql_password = var.affine_postgresql_password
|
||||
redis_host = var.redis_host
|
||||
postgresql_host = var.postgresql_host
|
||||
postgresql_port = var.postgresql_port
|
||||
smtp_host = var.smtp_host
|
||||
smtp_port = var.smtp_port
|
||||
tier = "4-aux"
|
||||
}
|
||||
```
|
||||
|
||||
### Service-to-service dependency (f1-stream -> coturn)
|
||||
|
||||
```hcl
|
||||
# stacks/f1-stream/terragrunt.hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
}
|
||||
|
||||
dependency "coturn" {
|
||||
config_path = "../coturn"
|
||||
}
|
||||
|
||||
inputs = {
|
||||
tls_secret_name = dependency.platform.outputs.tls_secret_name
|
||||
coturn_host = dependency.coturn.outputs.coturn_host
|
||||
coturn_port = dependency.coturn.outputs.coturn_port
|
||||
}
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 0: Setup
|
||||
- Install Terragrunt
|
||||
- Create root `terragrunt.hcl`, `stacks/`, `state/` directories
|
||||
- No state changes, no risk
|
||||
|
||||
### Phase 1: Infra Stack (VMs)
|
||||
- Create `stacks/infra/` with Proxmox provider + VM module calls
|
||||
- `terraform state mv` 4 root-level module resources to `state/infra/`
|
||||
- Remove from root `main.tf`
|
||||
- Verify: `cd stacks/infra && terragrunt plan` shows no changes
|
||||
|
||||
### Phase 2: Platform Stack (Core Services)
|
||||
- Create `stacks/platform/main.tf` with ~20 core services + outputs
|
||||
- `terraform state mv` ~200 resources from `module.kubernetes_cluster.module.<core>`
|
||||
- Remove `null_resource.core_services` (Terragrunt handles ordering)
|
||||
- Verify: `cd stacks/platform && terragrunt plan` shows no changes
|
||||
|
||||
### Phase 3: Simple Services (No DB Dependencies)
|
||||
- blog, echo, privatebin, excalidraw, city-guesser, dashy, etc.
|
||||
- Create stack, move state, verify — one at a time
|
||||
|
||||
### Phase 4: Database-Backed Services
|
||||
- Modify modules to accept hosts as variables
|
||||
- affine, immich, linkwarden, nextcloud, grampsweb, etc.
|
||||
- Create stack, move state, verify
|
||||
|
||||
### Phase 5: Service-to-Service Dependencies
|
||||
- ollama -> openclaw, grampsweb, ytdlp
|
||||
- coturn -> f1-stream
|
||||
- osm-routing -> real-estate-crawler
|
||||
|
||||
### Phase 6: Cleanup
|
||||
- Delete DEFCON system from `modules/kubernetes/main.tf`
|
||||
- Delete legacy `terraform.tfstate`
|
||||
- Delete root `main.tf` kubernetes_cluster module call
|
||||
- Update CI/CD to Terragrunt
|
||||
|
||||
### Rollback
|
||||
At any phase, `terraform state mv` resources back to monolith state and
|
||||
restore module calls.
|
||||
|
||||
## CI/CD: Changed-Stack Detection
|
||||
|
||||
Drone CI pipeline detects changed files per commit and maps to affected stacks:
|
||||
|
||||
| Changed file | Affected stack |
|
||||
|-------------|---------------|
|
||||
| `stacks/blog/*` | blog |
|
||||
| `modules/kubernetes/blog/*` | blog |
|
||||
| `terraform.tfvars` | all stacks |
|
||||
| `terragrunt.hcl` | all stacks |
|
||||
| `modules/kubernetes/ingress_factory/*` | all stacks |
|
||||
|
||||
### Manual Workflow
|
||||
|
||||
```bash
|
||||
# Apply single service
|
||||
cd stacks/blog && terragrunt apply
|
||||
|
||||
# Apply everything (respects DAG ordering)
|
||||
cd stacks && terragrunt run-all apply
|
||||
|
||||
# Plan everything
|
||||
cd stacks && terragrunt run-all plan
|
||||
```
|
||||
|
||||
## Decisions Made
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Tool | Terragrunt | DRY config, dependency management, run-all orchestration |
|
||||
| Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core |
|
||||
| Migration | Incremental | Lower risk, verify each step |
|
||||
| Shared modules | Relative paths | Simple, no registry overhead |
|
||||
| State backend | Local files | No external dependencies |
|
||||
| Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch |
|
||||
| CI/CD | Changed-stack detection | Only apply what changed |
|
||||
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue