[ci skip] Remove legacy files and orphaned modules

Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.
This commit is contained in:
Viktor Barzin 2026-02-22 15:23:27 +00:00
parent c7c7047f1c
commit 116c4d9c30
56 changed files with 2 additions and 9402 deletions

View file

@ -1,140 +0,0 @@
# Centralized Log Collection Design
## Date: 2026-02-13
## Goal
Centrally collect logs from all Kubernetes pods for monitoring and alerting. Minimize disk I/O by holding logs in memory for extended periods, flushing to NFS once daily. Alert on log patterns via existing Alertmanager pipeline.
## Requirements
- **Primary use case**: Monitoring and alerting (log-based alert rules evaluated in real-time)
- **Retention**: 7 days on disk after flush
- **Memory budget**: 4-8GB total (~6.6GB used)
- **Disk strategy**: 24h in-memory chunks, WAL on tmpfs, single daily flush to NFS
- **Crash policy**: Accept up to 24h log loss on pod/node crash (alerts still fire in real-time before flush)
- **Alert delivery**: Loki Ruler -> existing Alertmanager -> Slack/email
## Architecture
```
┌──────────────────┐ ┌──────────────────────┐ ┌──────────────┐
│ Alloy DaemonSet │ │ Loki SingleBinary │ │ Grafana │
│ 5 pods, 128Mi ea │────>│ 1 pod, 6Gi RAM │<────│ (existing) │
│ tails /var/log/ │ │ │ │ + Loki │
│ pods on each node│ │ Ingester: 24h chunks │ │ datasource │
└──────────────────┘ │ WAL: tmpfs (in-memory) │ └──────────────┘
│ Storage: NFS 15Gi │
┌──────────────────┐ │ Ruler ──> Alertmanager │
│ Sysctl DaemonSet │ └──────────────────────┘
│ 5 pods (pause) │
│ sets inotify │
│ limits on nodes │
└──────────────────┘
```
## Components
### 1. Sysctl DaemonSet
Solves the `too many open files` / fsnotify watcher exhaustion problem that previously blocked Alloy.
- Privileged init container runs `sysctl -w` on each node
- Settings: `fs.inotify.max_user_watches=1048576`, `fs.inotify.max_user_instances=512`, `fs.inotify.max_queued_events=1048576`
- Main container: `pause` image (near-zero resources)
- Survives node reboots (DaemonSet recreates pod)
- Namespace: `monitoring`
### 2. Loki (Helm Release)
Single-binary deployment. Existing Helm chart config in `loki.yaml`, updated with:
**Ingester tuning (disk-friendly):**
- `chunk_idle_period: 12h` — don't flush idle streams quickly
- `max_chunk_age: 24h` — hold chunks in memory for full day
- `chunk_retain_period: 1m` — brief retain after flush
- `chunk_target_size: 1572864` (1.5MB) — larger chunks = fewer writes
- WAL: tmpfs emptyDir (`medium: Memory`, 2Gi limit)
**Retention:**
- `retention_period: 168h` (7 days)
- Compactor enabled for retention enforcement
**Ruler:**
- Evaluates LogQL alert rules in real-time (before chunk flush)
- Fires to `http://prometheus-alertmanager.monitoring.svc.cluster.local:9093`
**Storage:**
- NFS PV/PVC at `/mnt/main/loki/loki` (15Gi, existing)
- TSDB index with 24h period
**Resources:**
- Memory: 6Gi limit
- CPU: 1 limit
### 3. Alloy (Helm Release)
DaemonSet log collector. Existing config in `alloy.yaml` is complete:
- Discovers pods via `discovery.kubernetes`
- Labels: namespace, pod, container, app, job, container_runtime, cluster
- Tails `/var/log/pods/` on each node
- Forwards to `http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push`
**Resources per pod:**
- Memory: 128Mi limit
- CPU: 200m limit
### 4. Grafana Datasource
ConfigMap with label `grafana_datasource: "1"` for sidecar auto-discovery:
- Name: Loki
- Type: loki
- URL: `http://loki.monitoring.svc.cluster.local:3100`
- Existing `loki.json` dashboard already in dashboards directory
### 5. Starter Alert Rules
Configured in Loki Ruler (evaluated in real-time, before disk flush):
| Alert | LogQL Expression | Severity |
|-------|-----------------|----------|
| HighErrorRate | `sum(rate({namespace=~".+"} \|= "error" [5m])) by (namespace) > 10` | warning |
| PodCrashLoopBackOff | `count_over_time({namespace=~".+"} \|= "CrashLoopBackOff" [5m]) > 0` | critical |
| OOMKilled | `count_over_time({namespace=~".+"} \|= "OOMKilled" [5m]) > 0` | critical |
## Memory Budget
| Component | Per-pod | Pods | Total |
|-----------|---------|------|-------|
| Alloy | 128Mi | 5 | 640Mi |
| Loki | 6Gi | 1 | 6Gi |
| Sysctl DS | ~0 (pause) | 5 | ~0 |
| **Total** | | | **~6.6 GB** |
## Files to Change
| File | Action |
|------|--------|
| `modules/kubernetes/monitoring/loki.tf` | Uncomment Loki + Alloy helm releases, add sysctl DaemonSet, add Grafana Loki datasource ConfigMap |
| `modules/kubernetes/monitoring/loki.yaml` | Update with ingester tuning, ruler config, retention, resource limits |
| `modules/kubernetes/monitoring/alloy.yaml` | Add resource limits in Helm values wrapper |
| `secrets/nfs_directories.txt` | Ensure `/mnt/main/loki` entries exist |
## Implementation Steps
1. Add sysctl DaemonSet to `loki.tf`
2. Update `loki.yaml` with disk-friendly tuning, ruler, retention, resources
3. Update `alloy.yaml` with resource limits
4. Uncomment Loki Helm release in `loki.tf`, wire up NFS PV/PVC
5. Uncomment Alloy Helm release in `loki.tf`
6. Add Grafana Loki datasource ConfigMap to `loki.tf`
7. Add alert rules to Loki config
8. Ensure NFS exports exist in `secrets/nfs_directories.txt`
9. `terraform apply -target=module.kubernetes_cluster.module.monitoring`
10. Verify: Grafana Explore -> Loki datasource -> query `{namespace="monitoring"}`
## Risks
- **24h data loss on crash**: Accepted trade-off. Alerts fire in real-time before flush, so alert coverage is not affected — only historical log browsing is at risk.
- **Memory pressure**: 6Gi for Loki on a 16GB node is significant. Monitor with existing Prometheus memory alerts.
- **Log volume spikes**: A chatty pod could cause Loki to OOM. Alloy can be configured with rate limiting if needed (future enhancement).

View file

@ -1,532 +0,0 @@
# Centralized Log Collection Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Deploy Loki + Alloy for centralized Kubernetes log collection with 24h in-memory chunks, 7-day disk retention, and log-based alerting via existing Alertmanager.
**Architecture:** Alloy DaemonSet tails pod logs on all 5 nodes, forwards to single-binary Loki which holds chunks in 6Gi RAM for 24h before flushing to NFS. Loki Ruler evaluates LogQL alert rules in real-time and fires to Alertmanager. Grafana gets a Loki datasource via sidecar auto-provisioning.
**Tech Stack:** Terraform, Helm (Loki chart, Alloy chart), Kubernetes DaemonSet, NFS, Grafana
**Design doc:** `docs/plans/2026-02-13-centralized-log-collection-design.md`
---
### Task 1: Add sysctl DaemonSet for inotify limits
Alloy uses fsnotify to tail log files. Default kernel limits cause "too many open files" errors. This DaemonSet sets the limits on every node persistently.
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (replace the comment block at lines 67-71)
**Step 1: Write the sysctl DaemonSet resource**
Replace lines 67-71 (the comment block about sysctl) with this Terraform resource in `loki.tf`:
```hcl
resource "kubernetes_daemon_set_v1" "sysctl-inotify" {
metadata {
name = "sysctl-inotify"
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
app = "sysctl-inotify"
}
}
spec {
selector {
match_labels = {
app = "sysctl-inotify"
}
}
template {
metadata {
labels = {
app = "sysctl-inotify"
}
}
spec {
init_container {
name = "sysctl"
image = "busybox:1.37"
command = [
"sh", "-c",
"sysctl -w fs.inotify.max_user_watches=1048576 && sysctl -w fs.inotify.max_user_instances=512 && sysctl -w fs.inotify.max_queued_events=1048576"
]
security_context {
privileged = true
}
}
container {
name = "pause"
image = "registry.k8s.io/pause:3.10"
resources {
requests = {
cpu = "1m"
memory = "4Mi"
}
limits = {
cpu = "1m"
memory = "4Mi"
}
}
}
host_pid = true
toleration {
operator = "Exists"
}
}
}
}
}
```
**Step 2: Run terraform fmt**
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
**Step 3: Run terraform plan to verify**
Run: `terraform plan -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" 2>&1 | tail -30`
Expected: Plan shows 1 resource to add (kubernetes_daemon_set_v1.sysctl-inotify)
**Step 4: Commit**
```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add sysctl DaemonSet for inotify limits"
```
---
### Task 2: Update Loki Helm values with disk-friendly tuning
Configure ingester for 24h in-memory chunks, WAL on tmpfs, 7-day retention, ruler for alerting, and resource limits.
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.yaml` (full rewrite)
**Step 1: Write updated loki.yaml**
Replace entire contents of `loki.yaml` with:
```yaml
loki:
commonConfig:
replication_factor: 1
schemaConfig:
configs:
- from: "2025-04-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
ingester:
chunk_idle_period: 12h
max_chunk_age: 24h
chunk_retain_period: 1m
chunk_target_size: 1572864
wal:
dir: /loki-wal
pattern_ingester:
enabled: true
limits_config:
allow_structured_metadata: true
volume_enabled: true
retention_period: 168h
compactor:
retention_enabled: true
working_directory: /loki/compactor
compaction_interval: 1h
delete_request_store: filesystem
ruler:
enable_api: true
storage:
type: local
local:
directory: /loki/rules
alertmanager_url: http://alertmanager.monitoring.svc.cluster.local:9093
ring:
kvstore:
store: inmemory
rule_path: /loki/scratch
storage:
type: "filesystem"
auth_enabled: false
minio:
enabled: false
deploymentMode: SingleBinary
singleBinary:
replicas: 1
persistence:
enabled: true
size: 15Gi
storageClass: ""
extraVolumes:
- name: wal
emptyDir:
medium: Memory
sizeLimit: 2Gi
- name: rules
configMap:
name: loki-alert-rules
extraVolumeMounts:
- name: wal
mountPath: /loki-wal
- name: rules
mountPath: /loki/rules/fake
resources:
requests:
cpu: 250m
memory: 4Gi
limits:
cpu: "1"
memory: 6Gi
# Zero out replica counts of other deployment modes
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
ingester:
replicas: 0
querier:
replicas: 0
queryFrontend:
replicas: 0
queryScheduler:
replicas: 0
distributor:
replicas: 0
compactor:
replicas: 0
indexGateway:
replicas: 0
bloomCompactor:
replicas: 0
bloomGateway:
replicas: 0
```
**Step 2: Commit**
```bash
git add modules/kubernetes/monitoring/loki.yaml
git commit -m "[ci skip] Update Loki config with disk-friendly tuning and ruler"
```
---
### Task 3: Update Alloy Helm values with resource limits
The Alloy config content is already complete. Wrap it in proper Helm values with resource limits.
**Files:**
- Modify: `modules/kubernetes/monitoring/alloy.yaml` (add resource limits)
**Step 1: Add resource limits to alloy.yaml**
Append after the existing `alloy.configMap.content` block (after the last line):
```yaml
# Resource limits for DaemonSet pods
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
```
The final file should have the `alloy.configMap.content` block unchanged, with `alloy.resources` added as a sibling under `alloy:`.
**Step 2: Commit**
```bash
git add modules/kubernetes/monitoring/alloy.yaml
git commit -m "[ci skip] Add resource limits to Alloy config"
```
---
### Task 4: Uncomment Loki Helm release and PV in loki.tf
Enable the Loki Helm release and its NFS persistent volume. Remove minio PV (not needed with filesystem storage).
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Loki resources, remove minio PV)
**Step 1: Uncomment the Loki Helm release (lines 1-12)**
Uncomment and update the helm_release to:
```hcl
resource "helm_release" "loki" {
namespace = kubernetes_namespace.monitoring.metadata[0].name
create_namespace = true
name = "loki"
repository = "https://grafana.github.io/helm-charts"
chart = "loki"
values = [templatefile("${path.module}/loki.yaml", {})]
timeout = 300
depends_on = [kubernetes_config_map.loki_alert_rules]
}
```
**Step 2: Uncomment the Loki NFS PV (lines 14-32)**
Uncomment the `kubernetes_persistent_volume.loki` resource as-is.
**Step 3: Remove the minio PV block (lines 34-52)**
Delete the entire `kubernetes_persistent_volume.loki-minio` commented block — minio is disabled.
**Step 4: Run terraform fmt**
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
**Step 5: Commit**
```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Loki Helm release and NFS PV"
```
---
### Task 5: Uncomment Alloy Helm release in loki.tf
Enable the Alloy Helm release.
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (uncomment Alloy helm release)
**Step 1: Uncomment and update the Alloy Helm release**
Replace the commented Alloy block with:
```hcl
resource "helm_release" "alloy" {
namespace = kubernetes_namespace.monitoring.metadata[0].name
create_namespace = true
name = "alloy"
repository = "https://grafana.github.io/helm-charts"
chart = "alloy"
values = [file("${path.module}/alloy.yaml")]
atomic = true
depends_on = [helm_release.loki]
}
```
**Step 2: Run terraform fmt**
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
**Step 3: Commit**
```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Enable Alloy Helm release"
```
---
### Task 6: Add Grafana Loki datasource ConfigMap
Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`. Create one for Loki.
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add ConfigMap resource)
**Step 1: Add the datasource ConfigMap**
Add to `loki.tf`:
```hcl
resource "kubernetes_config_map" "grafana_loki_datasource" {
metadata {
name = "grafana-loki-datasource"
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
grafana_datasource = "1"
}
}
data = {
"loki-datasource.yaml" = yamlencode({
apiVersion = 1
datasources = [{
name = "Loki"
type = "loki"
access = "proxy"
url = "http://loki.monitoring.svc.cluster.local:3100"
isDefault = false
}]
})
}
}
```
**Step 2: Run terraform fmt**
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
**Step 3: Commit**
```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Grafana Loki datasource ConfigMap"
```
---
### Task 7: Add Loki alert rules ConfigMap
Create the ConfigMap that Loki's ruler reads for alert rules. Mounted into the Loki pod at `/loki/rules/fake/`.
**Files:**
- Modify: `modules/kubernetes/monitoring/loki.tf` (add alert rules ConfigMap)
**Step 1: Add the alert rules ConfigMap**
Add to `loki.tf`:
```hcl
resource "kubernetes_config_map" "loki_alert_rules" {
metadata {
name = "loki-alert-rules"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
data = {
"rules.yaml" = yamlencode({
groups = [{
name = "log-alerts"
rules = [
{
alert = "HighErrorRate"
expr = "sum(rate({namespace=~\".+\"} |= \"error\" [5m])) by (namespace) > 10"
for = "5m"
labels = {
severity = "warning"
}
annotations = {
summary = "High error rate in {{ $labels.namespace }}"
}
},
{
alert = "PodCrashLoopBackOff"
expr = "count_over_time({namespace=~\".+\"} |= \"CrashLoopBackOff\" [5m]) > 0"
for = "1m"
labels = {
severity = "critical"
}
annotations = {
summary = "CrashLoopBackOff detected in {{ $labels.namespace }}"
}
},
{
alert = "OOMKilled"
expr = "count_over_time({namespace=~\".+\"} |= \"OOMKilled\" [5m]) > 0"
for = "1m"
labels = {
severity = "critical"
}
annotations = {
summary = "OOMKilled detected in {{ $labels.namespace }}"
}
}
]
}]
})
}
}
```
**Step 2: Run terraform fmt**
Run: `terraform fmt -recursive modules/kubernetes/monitoring/`
**Step 3: Commit**
```bash
git add modules/kubernetes/monitoring/loki.tf
git commit -m "[ci skip] Add Loki alert rules ConfigMap"
```
---
### Task 8: Deploy and verify
Apply all changes via Terraform and verify the stack is working.
**Files:** None (deployment only)
**Step 1: Run terraform apply for monitoring module**
Run: `terraform apply -target=module.kubernetes_cluster.module.monitoring -var="kube_config_path=$(pwd)/config" -auto-approve`
Expected: Multiple resources created (sysctl DaemonSet, Loki Helm release, Alloy Helm release, PV, ConfigMaps)
**Step 2: Verify sysctl DaemonSet is running on all nodes**
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring sysctl-inotify`
Expected: DESIRED=5, CURRENT=5, READY=5
**Step 3: Verify Loki pod is running**
Run: `kubectl --kubeconfig $(pwd)/config get pods -n monitoring -l app.kubernetes.io/name=loki`
Expected: 1/1 Running
**Step 4: Verify Alloy DaemonSet is running**
Run: `kubectl --kubeconfig $(pwd)/config get ds -n monitoring -l app.kubernetes.io/name=alloy`
Expected: DESIRED=5, CURRENT=5, READY=5
**Step 5: Verify Loki is receiving logs**
Run: `kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/loki -- wget -qO- 'http://localhost:3100/loki/api/v1/labels'`
Expected: JSON response with labels like `namespace`, `pod`, `container`
**Step 6: Verify Grafana has Loki datasource**
Open `https://grafana.viktorbarzin.me/explore`, select "Loki" datasource, run query: `{namespace="monitoring"}`
Expected: Log lines from monitoring namespace pods
**Step 7: Commit final state**
```bash
git add -A
git commit -m "[ci skip] Deploy centralized log collection (Loki + Alloy)"
```
---
### Troubleshooting
**If Alloy pods crash with inotify errors:**
- Check sysctl DaemonSet init logs: `kubectl --kubeconfig $(pwd)/config logs -n monitoring ds/sysctl-inotify -c sysctl`
- Verify sysctl values on node: `kubectl --kubeconfig $(pwd)/config debug node/k8s-node2 -it --image=busybox -- sysctl fs.inotify.max_user_watches`
**If Loki OOMs:**
- Check memory usage: `kubectl --kubeconfig $(pwd)/config top pod -n monitoring -l app.kubernetes.io/name=loki`
- Reduce `max_chunk_age` from 24h to 12h in `loki.yaml` to flush more frequently
**If Grafana doesn't show Loki datasource:**
- Verify ConfigMap has correct label: `kubectl --kubeconfig $(pwd)/config get cm -n monitoring grafana-loki-datasource -o yaml`
- Restart Grafana sidecar: `kubectl --kubeconfig $(pwd)/config rollout restart deploy -n monitoring grafana`
**If Loki PV won't bind:**
- Check NFS export exists: `ssh root@10.0.10.15 'showmount -e localhost | grep loki'`
- Run NFS export script: `cd secrets && bash nfs_exports.sh`

View file

@ -1,154 +0,0 @@
# Multi-User Kubernetes Access Design
**Date**: 2026-02-17
**Status**: Approved
## Problem
The cluster uses a single `kubernetes-admin` client certificate for all access. There is no way to:
- Give different users different levels of access
- Track who performed which actions
- Enforce resource limits per user
- Onboard new users without sharing admin credentials
## Decision
Native OIDC authentication on the kube-apiserver using Authentik as the identity provider, with Terraform-managed RBAC and a self-service Svelte portal for user onboarding.
### Alternatives Considered
1. **Pinniped (Concierge + Supervisor)**: Avoids API server changes but adds two components to maintain. Requires Pinniped CLI on user machines. Overkill for a single-cluster setup.
2. **kube-oidc-proxy**: Avoids API server changes but adds a proxy in the request path (single point of failure, extra latency). Sporadic maintenance from JetStack.
## Architecture
```
User → Self-Service Portal → Authentik Login → Download Kubeconfig
User → kubectl (with kubelogin) → kube-apiserver → OIDC validation → Authentik
RBAC evaluation
Audit logging → Alloy → Loki → Grafana
```
### User Roles
| Role | Scope | Access |
|------|-------|--------|
| `admin` | Cluster-wide | Full `cluster-admin` access |
| `power-user` | Cluster-wide | Deploy/manage workloads, view all resources, no RBAC/node modification |
| `namespace-owner` | Specific namespaces | Full `admin` within assigned namespaces only |
## Components
### 1. Authentik OIDC Provider
New OAuth2/OIDC application in Authentik configured via Terraform (`modules/kubernetes/authentik/`).
- **Application name**: `kubernetes`
- **Provider type**: OAuth2/OpenID Connect
- **Client type**: Public (no client secret, kubelogin uses PKCE)
- **Redirect URIs**: `http://localhost:8000/callback` (kubelogin default)
- **Scopes**: `openid`, `email`, `profile`, `groups`
- **Property mappings**: Include `groups` claim for RBAC group matching
### 2. kube-apiserver OIDC Flags
One-time change on k8s-master (`10.0.20.100`), automated via Terraform `null_resource` with `remote-exec`.
Added to `/etc/kubernetes/manifests/kube-apiserver.yaml`:
```yaml
- --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
- --oidc-client-id=kubernetes
- --oidc-username-claim=email
- --oidc-groups-claim=groups
```
Kubelet auto-restarts the API server pod when the manifest changes. These flags persist through `kubeadm upgrade apply`.
### 3. RBAC (Terraform-managed)
New module: `modules/kubernetes/rbac/main.tf`
**User definition** in `terraform.tfvars`:
```hcl
k8s_users = {
"viktor" = {
role = "admin"
email = "viktor@viktorbarzin.me"
}
"alice" = {
role = "power-user"
email = "alice@example.com"
}
"bob" = {
role = "namespace-owner"
namespaces = ["bob-apps", "bob-dev"]
email = "bob@example.com"
}
}
```
**Resources created per role:**
| Role | Terraform Resources |
|------|-------------------|
| `admin` | `ClusterRoleBinding``cluster-admin` for user email |
| `power-user` | Custom `ClusterRole` (workload management, no RBAC/node access) + `ClusterRoleBinding` |
| `namespace-owner` | `Namespace`(s) + `RoleBinding` → built-in `admin` ClusterRole + `ResourceQuota` per namespace |
### 4. Self-Service Portal
Svelte (SvelteKit) app at `https://k8s-portal.viktorbarzin.me`.
**Flow:**
1. User visits portal → Authentik login via Traefik forward auth
2. Portal displays user's role and assigned namespaces
3. User downloads pre-configured kubeconfig (generated server-side)
4. Portal shows setup instructions (install kubectl + kubelogin)
**Kubeconfig template** includes:
- Cluster: `https://10.0.20.100:6443` with CA cert
- Auth: `exec` credential plugin pointing to kubelogin
- OIDC issuer URL and client ID pre-configured
**Deployment**: Standard Kubernetes deployment + service + ingress, Terraform-managed like other services. No database needed — user role info read from Kubernetes RBAC bindings or a Terraform-generated ConfigMap.
### 5. Audit Logging
Kubernetes audit policy deployed to master via the same `null_resource`.
**Policy** (`/etc/kubernetes/audit-policy.yaml`):
- `RequestResponse` level for OIDC-authenticated users (captures what they changed)
- `Metadata` level for system/service accounts (keeps volume down)
- Secrets logged at `Metadata` level only (no request/response bodies)
**Log pipeline**: Audit log file → Alloy (DaemonSet on master) → Loki → Grafana dashboard
**Grafana dashboard** shows: who accessed what resource, when, from where, and the outcome (allow/deny).
### 6. Resource Quotas
Each namespace-owner namespace gets a `ResourceQuota`:
```hcl
requests.cpu = "2"
requests.memory = "4Gi"
limits.cpu = "4"
limits.memory = "8Gi"
pods = "20"
```
Defaults can be overridden per-user via an optional `quota` field in the `k8s_users` variable.
## Implementation Order
1. Authentik OIDC application setup
2. kube-apiserver OIDC flag configuration
3. RBAC Terraform module
4. Audit logging
5. Self-service portal
6. Grafana dashboard for audit logs

File diff suppressed because it is too large Load diff

View file

@ -1,111 +0,0 @@
# OpenClaw Cluster Management Agent — Design
**Date**: 2026-02-21
**Status**: Approved
## Goal
Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script.
## Architecture
```
CronJob (every 30min)
└─ kubectl exec into OpenClaw pod
└─ /workspace/infra/.claude/cluster-health.sh
├─ kubectl get nodes (check health)
├─ kubectl get pods -A (find problems)
├─ kubectl delete pod (evicted/stuck)
└─ curl Slack webhook (report)
```
Interactive path: User asks OpenClaw via UI -> `cluster-health` skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation.
## Components
### 1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`)
Teaches OpenClaw:
- What health checks to run
- What's safe to auto-fix vs alert-only
- How to format Slack alerts
- How to do deeper investigation when asked interactively
Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc.
### 2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`)
Reusable script that performs all checks:
**Checks:**
- Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure)
- Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending)
- Evicted pods
- Failed deployments (unavailable replicas)
- Pending PVCs
- Resource pressure (high CPU/memory allocation)
- Failed CronJobs
- DaemonSet health (missing pods)
**Safe auto-fix actions:**
- Delete evicted pods
- Delete completed/succeeded pods older than 24h
- Restart (delete) pods in CrashLoopBackOff for more than 1 hour
**Alert-only (never auto-fix):**
- Node NotReady
- Persistent OOMKilled
- ImagePullBackOff
- Pending PVCs
- Failed deployments with 0 available replicas
**Output:**
- Structured text summary
- Posts to Slack via webhook
- Exit code 0 = healthy, 1 = issues found
### 3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`)
- Schedule: `*/30 * * * *`
- Container: `bitnami/kubectl` (minimal image with kubectl)
- Command: `kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh`
- ServiceAccount with RBAC to exec into pods in `openclaw` namespace
- `concurrencyPolicy: Forbid`
- `failedJobsHistoryLimit: 3`
- `successfulJobsHistoryLimit: 3`
### 4. Slack Integration
- Webhook URL from `openclaw_skill_secrets["slack"]` (already configured)
- Passed as `SLACK_WEBHOOK_URL` env var to the OpenClaw pod
## Slack Message Format
```
:white_check_mark: Cluster Health Check — All Clear
Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues
```
```
:warning: Cluster Health Check — 3 Issues Found
Auto-fixed:
- Deleted 4 evicted pods in monitoring namespace
- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h)
Needs attention:
- Node k8s-node3: MemoryPressure condition detected
- PVC data-tandoor pending for 45 minutes
```
## Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Mode | Proactive (scheduled) | Want automated monitoring |
| Alert channel | Slack | Existing webhook in openclaw_skill_secrets |
| Auto-fix | Safe fixes only | Delete evicted, restart stuck; alert for the rest |
| Frequency | 30 minutes | Balance between detection speed and overhead |
| Checks scope | Standard K8s health | Pod/node/deployment/PVC/CronJob/DaemonSet |
| Trigger mechanism | CronJob execs into OpenClaw pod | Reuses OpenClaw's tools; LLM available interactively |
| Fallback | None | Uptime Kuma monitors OpenClaw availability |

View file

@ -1,800 +0,0 @@
# OpenClaw Cluster Management Agent — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via `kubectl exec`.
**Architecture:** CronJob (bitnami/kubectl) -> `kubectl exec` into OpenClaw pod -> runs `cluster-health.sh` which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.
**Tech Stack:** Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl
---
### Task 1: Add Slack webhook to openclaw_skill_secrets
**Files:**
- Modify: `terraform.tfvars:1291-1295` (add slack_webhook key)
- Modify: `modules/kubernetes/openclaw/main.tf:350-376` (add SLACK_WEBHOOK_URL env var)
**Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars**
Add a new key `slack_webhook` to the existing `openclaw_skill_secrets` map. The user must provide the webhook URL. For now, use the existing `alertmanager_slack_api_url` value or a dedicated one.
In `terraform.tfvars`, change:
```hcl
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
}
```
to:
```hcl
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
slack_webhook = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
}
```
**NOTE:** Ask the user which Slack webhook URL to use. Candidates:
- `alertmanager_slack_api_url` (line 4 in tfvars)
- `tiny_tuya_slack_url` (line 1213, comment says "K8s bot slack")
- A new webhook the user creates
**Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container**
In `modules/kubernetes/openclaw/main.tf`, add after the `UPTIME_KUMA_PASSWORD` env block (around line 370):
```hcl
# Skill secrets - Slack
env {
name = "SLACK_WEBHOOK_URL"
value = var.skill_secrets["slack_webhook"]
}
```
**Step 3: Commit**
```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
```
Do NOT commit `terraform.tfvars` separately — it will be committed with the full set of changes at the end.
---
### Task 2: Create the cluster-health.sh helper script
**Files:**
- Create: `.claude/cluster-health.sh`
**Step 1: Write the health check script**
Create `.claude/cluster-health.sh` with the following structure. The script:
- Uses `$KUBECONFIG` (already set in OpenClaw pod) or falls back to in-cluster config
- Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
- Posts structured Slack message via `$SLACK_WEBHOOK_URL`
- Exit code 0 = healthy, 1 = issues found, 2 = critical
```bash
#!/usr/bin/env bash
# Cluster health check script for OpenClaw.
# Runs health checks, auto-fixes safe issues, posts to Slack.
# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
#
# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
# --no-slack Skip Slack notification (useful for interactive/debug runs)
# --no-fix Skip auto-fix actions (report only)
set -euo pipefail
SEND_SLACK=true
AUTO_FIX=true
ISSUES=()
FIXES=()
WARNINGS=()
# --- Argument parsing ---
for arg in "$@"; do
case "$arg" in
--no-slack) SEND_SLACK=false ;;
--no-fix) AUTO_FIX=false ;;
esac
done
KUBECTL="kubectl"
# --- 1. Node Health ---
check_nodes() {
local nodes not_ready
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)
if [[ -n "$not_ready" ]]; then
while IFS= read -r node; do
ISSUES+=("Node NotReady: $node")
done <<< "$not_ready"
fi
# Check conditions
local conditions
conditions=$($KUBECTL get nodes -o json | python3 -c '
import json, sys
data = json.load(sys.stdin)
for node in data["items"]:
name = node["metadata"]["name"]
for c in node["status"]["conditions"]:
if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
print(name + ": " + c["type"])
' 2>/dev/null) || true
if [[ -n "$conditions" ]]; then
while IFS= read -r line; do
ISSUES+=("$line")
done <<< "$conditions"
fi
}
# --- 2. Pod Health ---
check_pods() {
local bad
bad=$( {
$KUBECTL get pods -A --no-headers 2>/dev/null \
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
if [[ -z "$bad" ]]; then return; fi
while IFS= read -r line; do
local ns pod status
ns=$(echo "$line" | awk '{print $1}')
pod=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $4}')
if [[ "$status" == "CrashLoopBackOff" ]]; then
# Check if stuck for >1 hour
local restart_count
restart_count=$(echo "$line" | awk '{print $5}')
if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
$KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
WARNINGS+=("Failed to restart $ns/$pod")
else
ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
fi
elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
ISSUES+=("ImagePullBackOff: $ns/$pod")
else
ISSUES+=("Error: $ns/$pod ($status)")
fi
done <<< "$bad"
}
# --- 3. Evicted/Failed Pods ---
check_evicted() {
local evicted count
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
if [[ -z "$evicted" ]]; then return; fi
count=$(echo "$evicted" | wc -l | tr -d ' ')
if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
$KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
FIXES+=("Deleted $count evicted/failed pod(s)") || \
WARNINGS+=("Failed to delete evicted pods")
else
ISSUES+=("$count evicted/failed pod(s)")
fi
}
# --- 4. Failed Deployments ---
check_deployments() {
local deps
deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name ready current desired
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
ready=$(echo "$line" | awk '{print $3}')
current=$(echo "$ready" | cut -d/ -f1)
desired=$(echo "$ready" | cut -d/ -f2)
if [[ "$current" != "$desired" ]]; then
ISSUES+=("Deployment $ns/$name: $current/$desired ready")
fi
done <<< "$deps"
}
# --- 5. Pending PVCs ---
check_pvcs() {
local pvcs
pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return
if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi
while IFS= read -r line; do
local ns name status
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $3}')
if [[ "$status" != "Bound" ]]; then
ISSUES+=("PVC $ns/$name: $status")
fi
done <<< "$pvcs"
}
# --- 6. Resource Pressure ---
check_resources() {
local top
top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local node cpu_pct mem_pct
node=$(echo "$line" | awk '{print $1}')
cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
[[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue
if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
fi
done <<< "$top"
}
# --- 7. CronJob Failures ---
check_cronjobs() {
local failures
failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta
data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
for job in data.get("items", []):
meta = job.get("metadata", {})
ns = meta.get("namespace", "")
name = meta.get("name", "")
owners = meta.get("ownerReferences", [])
if not any(o.get("kind") == "CronJob" for o in owners):
continue
for c in job.get("status", {}).get("conditions", []):
if c.get("type") == "Failed" and c.get("status") == "True":
ts = c.get("lastTransitionTime", "")
if ts:
try:
t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
if t > cutoff:
print(f"{ns}/{name}")
except:
print(f"{ns}/{name}")
' 2>/dev/null) || true
if [[ -n "$failures" ]]; then
local count
count=$(echo "$failures" | wc -l | tr -d ' ')
ISSUES+=("$count CronJob failure(s) in last 24h")
fi
}
# --- 8. DaemonSet Health ---
check_daemonsets() {
local ds
ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name desired ready
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
desired=$(echo "$line" | awk '{print $3}')
ready=$(echo "$line" | awk '{print $5}')
if [[ "$desired" != "$ready" ]]; then
ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
fi
done <<< "$ds"
}
# --- Cluster summary stats ---
get_summary_stats() {
local node_count ready_count pod_count
node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
}
# --- Send Slack message ---
send_slack() {
local webhook_url="$SLACK_WEBHOOK_URL"
if [[ -z "${webhook_url:-}" ]]; then
echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
return
fi
local summary issue_count fix_count warning_count
summary=$(get_summary_stats)
issue_count=${#ISSUES[@]}
fix_count=${#FIXES[@]}
warning_count=${#WARNINGS[@]}
local text=""
local total_problems=$((issue_count + warning_count))
if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
else
if [[ "$issue_count" -gt 0 ]]; then
text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
elif [[ "$warning_count" -gt 0 ]]; then
text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
else
text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
fi
if [[ "$fix_count" -gt 0 ]]; then
text+="\n\n*Auto-fixed:*"
for fix in "${FIXES[@]}"; do
text+="\n• ${fix}"
done
fi
if [[ "$issue_count" -gt 0 ]]; then
text+="\n\n*Needs attention:*"
for issue in "${ISSUES[@]}"; do
text+="\n• ${issue}"
done
fi
if [[ "$warning_count" -gt 0 ]]; then
text+="\n\n*Warnings:*"
for warning in "${WARNINGS[@]}"; do
text+="\n• ${warning}"
done
fi
fi
curl -s -X POST "$webhook_url" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"${text}\"}" > /dev/null 2>&1
}
# --- Main ---
main() {
echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="
check_nodes
check_pods
check_evicted
check_deployments
check_pvcs
check_resources
check_cronjobs
check_daemonsets
local issue_count=${#ISSUES[@]}
local fix_count=${#FIXES[@]}
local warning_count=${#WARNINGS[@]}
echo ""
echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"
if [[ "$fix_count" -gt 0 ]]; then
echo ""
echo "Auto-fixed:"
for fix in "${FIXES[@]}"; do echo " - $fix"; done
fi
if [[ "$issue_count" -gt 0 ]]; then
echo ""
echo "Issues:"
for issue in "${ISSUES[@]}"; do echo " - $issue"; done
fi
if [[ "$warning_count" -gt 0 ]]; then
echo ""
echo "Warnings:"
for warning in "${WARNINGS[@]}"; do echo " - $warning"; done
fi
if [[ "$SEND_SLACK" == true ]]; then
send_slack
echo ""
echo "Slack notification sent."
fi
# Exit code
if [[ "$issue_count" -gt 0 ]]; then
exit 1
fi
exit 0
}
main "$@"
```
**Step 2: Make it executable**
```bash
chmod +x .claude/cluster-health.sh
```
**Step 3: Test locally (dry run)**
```bash
KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
```
Expected: Script runs, prints check results, no Slack post.
**Step 4: Commit**
```bash
git add .claude/cluster-health.sh
git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
```
---
### Task 3: Create the cluster-health skill
**Files:**
- Create: `.claude/skills/cluster-health/SKILL.md`
**Step 1: Write the skill document**
```markdown
---
name: cluster-health
description: |
Check Kubernetes cluster health and fix common issues. Use when:
(1) User asks to check the cluster, check health, or "what's wrong",
(2) User asks about pod status, node health, or deployment issues,
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
and stuck CrashLoopBackOff pods.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# Cluster Health Check
## Overview
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)
## Quick Check
Run the health check script:
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-slack
```
Or with Slack notification:
```bash
bash /workspace/infra/.claude/cluster-health.sh
```
Report-only (no auto-fix):
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-fix
```
## What It Checks
| # | Check | Auto-Fix | Alert |
|---|-------|----------|-------|
| 1 | Node health (NotReady, conditions) | No | Yes |
| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
| 3 | Evicted/failed pods | Delete all | Yes |
| 4 | Deployment availability (current != desired) | No | Yes |
| 5 | PVC status (not Bound) | No | Yes |
| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
| 7 | CronJob failures (last 24h) | No | Yes |
| 8 | DaemonSet health (desired != ready) | No | Yes |
## Safe Auto-Fix Rules
These are the ONLY things the script auto-fixes:
1. **Evicted/failed pods**: `kubectl delete pods -A --field-selector=status.phase=Failed`
2. **CrashLoopBackOff pods with >10 restarts**: `kubectl delete pod -n <ns> <pod> --grace-period=30`
Everything else is alert-only. NEVER auto-fix:
- Node NotReady (could be maintenance)
- ImagePullBackOff (needs image tag or registry fix)
- Pending PVCs (needs storage investigation)
- Failed deployments (needs config investigation)
## Deep Investigation
When the script reports issues and the user asks for more detail, use these commands:
### Node issues
```bash
kubectl describe node <node-name>
kubectl top node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>
```
### Pod issues
```bash
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --tail=100
kubectl logs -n <namespace> <pod-name> --previous --tail=100
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
```
### Deployment issues
```bash
kubectl describe deployment -n <namespace> <deployment-name>
kubectl rollout status deployment -n <namespace> <deployment-name>
kubectl rollout history deployment -n <namespace> <deployment-name>
```
### PVC issues
```bash
kubectl describe pvc -n <namespace> <pvc-name>
kubectl get pv
kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
```
### Resource pressure
```bash
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
```
## Common Remediation
### CrashLoopBackOff (persistent)
1. Check logs: `kubectl logs -n <ns> <pod> --previous --tail=100`
2. Check events: `kubectl describe pod -n <ns> <pod>`
3. Common causes: OOMKilled (increase memory limit), bad config, missing env var
4. If image issue: check if newer image exists, update in Terraform
### OOMKilled
1. Check current limits: `kubectl describe pod -n <ns> <pod> | grep -A2 Limits`
2. Fix: Update resource limits in Terraform module for the service
3. Apply: `terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"`
### ImagePullBackOff
1. Check image: `kubectl describe pod -n <ns> <pod> | grep Image`
2. Check registry: Is the image tag valid? Is the registry reachable?
3. Check pull-through cache: Docker registry at 10.0.20.10
### Node NotReady
1. Check kubelet: SSH to node, `systemctl status kubelet`
2. Check resources: `kubectl top node <node>`
3. Check conditions: `kubectl describe node <node> | grep -A10 Conditions`
## Slack Webhook
Messages are posted to the webhook at `$SLACK_WEBHOOK_URL`. Format:
- All clear: green check + summary stats
- Issues found: red siren + list of issues + auto-fix actions taken
- Warnings only: yellow warning + elevated metrics
## Infrastructure
- **Terraform module**: `modules/kubernetes/openclaw/main.tf`
- **CronJob**: Runs in `openclaw` namespace every 30 min
- **Existing healthcheck**: `scripts/cluster_healthcheck.sh` (local-only, not for OpenClaw)
- **Repo path inside pod**: `/workspace/infra/`
```
**Step 2: Commit**
```bash
git add .claude/skills/cluster-health/SKILL.md
git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
```
---
### Task 4: Add CronJob and RBAC to Terraform
**Files:**
- Modify: `modules/kubernetes/openclaw/main.tf` (append CronJob + ServiceAccount + Role + RoleBinding)
**Step 1: Add CronJob resources**
Append the following to `modules/kubernetes/openclaw/main.tf` after the `module "ingress"` block:
```hcl
# --- CronJob: Scheduled cluster health check ---
resource "kubernetes_service_account" "healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
}
resource "kubernetes_role" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["get", "list"]
}
rule {
api_groups = [""]
resources = ["pods/exec"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.healthcheck.metadata[0].name
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.healthcheck_exec.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
labels = {
app = "cluster-healthcheck"
tier = var.tier
}
}
spec {
schedule = "*/30 * * * *"
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
job_template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
active_deadline_seconds = 300
template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
restart_policy = "Never"
container {
name = "healthcheck"
image = "bitnami/kubectl:1.34"
command = ["bash", "-c", <<-EOF
# Find the openclaw pod
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [ -z "$POD" ]; then
echo "ERROR: OpenClaw pod not found"
exit 1
fi
echo "Executing health check in pod $POD..."
kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
EOF
]
resources {
requests = {
cpu = "50m"
memory = "64Mi"
}
limits = {
memory = "128Mi"
}
}
}
}
}
}
}
}
}
```
**Step 2: Verify Terraform formatting**
```bash
terraform fmt modules/kubernetes/openclaw/main.tf
```
**Step 3: Verify Terraform plan**
```bash
terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
```
Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.
**Step 4: Commit**
```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
```
---
### Task 5: Deploy and verify
**Step 1: Apply Terraform**
```bash
terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
```
**Step 2: Verify CronJob exists**
```bash
kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
```
Expected: `cluster-healthcheck` with schedule `*/30 * * * *`
**Step 3: Verify RBAC**
```bash
kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
```
Expected: `cluster-healthcheck` SA, `healthcheck-pod-exec` role and rolebinding
**Step 4: Trigger a manual run**
```bash
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
```
**Step 5: Check job output**
```bash
kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
```
Expected: Health check output with results. If `SLACK_WEBHOOK_URL` is set, check Slack for the message.
**Step 6: Clean up test job**
```bash
kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
```
**Step 7: Final commit**
```bash
git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"
```

View file

@ -1,387 +0,0 @@
# Terragrunt Migration Design
**Date**: 2026-02-22
**Status**: Approved
## Problem
The infrastructure repo has a monolithic Terraform setup:
- 15MB state file, 857 resources, 85+ service modules in a single root
- `terraform plan/apply` evaluates all modules even when targeting one service
- `null_resource.core_services` bottleneck blocks 73 services behind 12 core modules
- 150+ variables passed through root -> kubernetes_cluster -> individual services
- 3 providers (kubernetes, helm, proxmox) initialize on every run
## Goals
- **Speed**: Faster plan/apply by splitting state into independent stacks
- **Blast radius isolation**: Bad apply can't break unrelated services
- **DRY config**: Shared provider/backend configuration via Terragrunt
- **Proper DAG**: Full references between stacks (not hardcoded DNS strings)
- **Bootstrappable**: `terragrunt run-all apply` works from scratch
- **CI/CD**: Changed-stack detection in Drone CI
## Architecture: Flat Stacks
### Directory Structure
```
infra/
├── terragrunt.hcl # Root config (providers, backend, common vars)
├── stacks/
│ ├── infra/ # Proxmox VMs, templates, docker-registry
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── platform/ # Core: traefik, metallb, redis, dbaas, authentik, etc.
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── blog/ # One dir per user service
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ ├── immich/
│ │ ├── terragrunt.hcl
│ │ └── main.tf
│ └── ... (~65 service dirs)
├── modules/ # UNCHANGED — existing modules stay where they are
│ ├── kubernetes/
│ │ ├── ingress_factory/
│ │ ├── setup_tls_secret/
│ │ ├── blog/
│ │ ├── immich/
│ │ └── ...
│ ├── create-vm/
│ └── create-template-vm/
├── state/ # Per-stack state files
│ ├── infra/terraform.tfstate
│ ├── platform/terraform.tfstate
│ ├── blog/terraform.tfstate
│ └── ...
├── terraform.tfvars # UNCHANGED — encrypted secrets
├── secrets/ # UNCHANGED — TLS certs
├── main.tf # LEGACY — gradually emptied during migration
└── terraform.tfstate # LEGACY — gradually emptied during migration
```
Each stack has a thin `main.tf` wrapper that calls the existing module via
`source = "../../modules/kubernetes/<service>"`. We do NOT use Terragrunt's
`terraform { source }` directive because our modules use relative paths
(`../ingress_factory`, `../setup_tls_secret`) that would break when Terragrunt
copies them to `.terragrunt-cache/`.
### Stack Composition
**Infra stack** (~10 resources):
- Proxmox VM templates (k8s, non-k8s, docker-registry)
- Docker registry VM
- Uses proxmox provider (not kubernetes/helm)
**Platform stack** (~200 resources, ~20 services):
- traefik, metallb, redis, dbaas, technitium, authentik, crowdsec, cloudflared
- monitoring (prometheus, alertmanager, grafana, loki, alloy)
- kyverno, metrics-server, nvidia, mailserver, authelia
- wireguard, headscale, xray, uptime-kuma, vaultwarden, reverse-proxy
- Exports outputs consumed by service stacks
**Per-service stacks** (~65, each 5-25 resources):
- One stack per user-facing service
- Each depends on platform via Terragrunt `dependency` block
- Some depend on other services (f1-stream -> coturn, etc.)
### Dependency Graph
```
┌─────────┐
│ infra │
└────┬────┘
┌────▼────┐
│platform │ exports: redis_host, postgresql_host,
│ │ mysql_host, smtp_host, tls_secret_name, ...
└────┬────┘
┌────────┬───────────┼───────────┬────────┐
│ │ │ │ │
┌────▼──┐ ┌───▼───┐ ┌────▼───┐ ┌─────▼──┐ ┌──▼───┐
│ blog │ │immich │ │ affine │ │ollama │ │coturn│ ...
└───────┘ └───────┘ └────────┘ └───┬────┘ └──┬───┘
│ │
┌────▼───┐ ┌───▼──────┐
│openclaw│ │f1-stream │
│gramps │ └──────────┘
│ytdlp │
└────────┘
```
### Platform Stack Outputs
| Output | Value | Consumers |
|--------|-------|-----------|
| `redis_host` | `redis.redis.svc.cluster.local` | 10 services |
| `postgresql_host` | `postgresql.dbaas.svc.cluster.local` | 10 services |
| `postgresql_port` | `5432` | 10 services |
| `mysql_host` | `mysql.dbaas.svc.cluster.local` | 8 services |
| `mysql_port` | `3306` | 8 services |
| `smtp_host` | `mail.viktorbarzin.me` | 6 services |
| `smtp_port` | `587` | 6 services |
| `tls_secret_name` | from variable | all services |
| `authentik_outpost_url` | `http://ak-outpost-...` | traefik |
| `crowdsec_lapi_host` | `crowdsec-service...` | traefik |
| `alertmanager_url` | `http://prometheus-alertmanager...` | loki |
| `loki_push_url` | `http://loki...` | alloy |
Service-to-service dependencies:
| Service | Depends on | Outputs consumed |
|---------|-----------|-----------------|
| f1-stream | coturn | `coturn_host`, `coturn_port` |
| real-estate-crawler | osm-routing | `osrm_foot_host`, `osrm_bicycle_host` |
| openclaw, grampsweb, ytdlp | ollama | `ollama_host` |
### Module Modifications
Service modules that hardcode DNS names need modification to accept hosts as variables.
~20 modules affected. Example for affine:
**Before:**
```hcl
# modules/kubernetes/affine/main.tf
DATABASE_URL = "postgresql://...@postgresql.dbaas.svc.cluster.local:5432/affine"
REDIS_SERVER_HOST = "redis.redis.svc.cluster.local"
```
**After:**
```hcl
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
DATABASE_URL = "postgresql://...@${var.postgresql_host}:${var.postgresql_port}/affine"
REDIS_SERVER_HOST = var.redis_host
```
## Root Terragrunt Configuration
```hcl
# infra/terragrunt.hcl
remote_state {
backend = "local"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"
}
}
terraform {
extra_arguments "common_vars" {
commands = get_terraform_commands_that_need_vars()
required_var_files = [
"${get_repo_root()}/terraform.tfvars"
]
}
}
generate "k8s_providers" {
path = "providers.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
variable "kube_config_path" {
type = string
default = "~/.kube/config"
}
provider "kubernetes" {
config_path = var.kube_config_path
}
provider "helm" {
kubernetes {
config_path = var.kube_config_path
}
}
EOF
}
```
## Stack Wrapper Examples
### Simple service (blog)
```hcl
# stacks/blog/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
}
```
```hcl
# stacks/blog/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
module "blog" {
source = "../../modules/kubernetes/blog"
tls_secret_name = var.tls_secret_name
tier = "4-aux"
}
```
### Database-backed service (affine)
```hcl
# stacks/affine/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
redis_host = dependency.platform.outputs.redis_host
postgresql_host = dependency.platform.outputs.postgresql_host
postgresql_port = dependency.platform.outputs.postgresql_port
smtp_host = dependency.platform.outputs.smtp_host
smtp_port = dependency.platform.outputs.smtp_port
}
```
```hcl
# stacks/affine/main.tf
variable "tls_secret_name" {}
variable "kube_config_path" { default = "~/.kube/config" }
variable "affine_postgresql_password" {}
variable "redis_host" { type = string }
variable "postgresql_host" { type = string }
variable "postgresql_port" { type = number }
variable "smtp_host" { type = string }
variable "smtp_port" { type = number }
module "affine" {
source = "../../modules/kubernetes/affine"
tls_secret_name = var.tls_secret_name
postgresql_password = var.affine_postgresql_password
redis_host = var.redis_host
postgresql_host = var.postgresql_host
postgresql_port = var.postgresql_port
smtp_host = var.smtp_host
smtp_port = var.smtp_port
tier = "4-aux"
}
```
### Service-to-service dependency (f1-stream -> coturn)
```hcl
# stacks/f1-stream/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
}
dependency "coturn" {
config_path = "../coturn"
}
inputs = {
tls_secret_name = dependency.platform.outputs.tls_secret_name
coturn_host = dependency.coturn.outputs.coturn_host
coturn_port = dependency.coturn.outputs.coturn_port
}
```
## Migration Strategy
### Phase 0: Setup
- Install Terragrunt
- Create root `terragrunt.hcl`, `stacks/`, `state/` directories
- No state changes, no risk
### Phase 1: Infra Stack (VMs)
- Create `stacks/infra/` with Proxmox provider + VM module calls
- `terraform state mv` 4 root-level module resources to `state/infra/`
- Remove from root `main.tf`
- Verify: `cd stacks/infra && terragrunt plan` shows no changes
### Phase 2: Platform Stack (Core Services)
- Create `stacks/platform/main.tf` with ~20 core services + outputs
- `terraform state mv` ~200 resources from `module.kubernetes_cluster.module.<core>`
- Remove `null_resource.core_services` (Terragrunt handles ordering)
- Verify: `cd stacks/platform && terragrunt plan` shows no changes
### Phase 3: Simple Services (No DB Dependencies)
- blog, echo, privatebin, excalidraw, city-guesser, dashy, etc.
- Create stack, move state, verify — one at a time
### Phase 4: Database-Backed Services
- Modify modules to accept hosts as variables
- affine, immich, linkwarden, nextcloud, grampsweb, etc.
- Create stack, move state, verify
### Phase 5: Service-to-Service Dependencies
- ollama -> openclaw, grampsweb, ytdlp
- coturn -> f1-stream
- osm-routing -> real-estate-crawler
### Phase 6: Cleanup
- Delete DEFCON system from `modules/kubernetes/main.tf`
- Delete legacy `terraform.tfstate`
- Delete root `main.tf` kubernetes_cluster module call
- Update CI/CD to Terragrunt
### Rollback
At any phase, `terraform state mv` resources back to monolith state and
restore module calls.
## CI/CD: Changed-Stack Detection
Drone CI pipeline detects changed files per commit and maps to affected stacks:
| Changed file | Affected stack |
|-------------|---------------|
| `stacks/blog/*` | blog |
| `modules/kubernetes/blog/*` | blog |
| `terraform.tfvars` | all stacks |
| `terragrunt.hcl` | all stacks |
| `modules/kubernetes/ingress_factory/*` | all stacks |
### Manual Workflow
```bash
# Apply single service
cd stacks/blog && terragrunt apply
# Apply everything (respects DAG ordering)
cd stacks && terragrunt run-all apply
# Plan everything
cd stacks && terragrunt run-all plan
```
## Decisions Made
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Tool | Terragrunt | DRY config, dependency management, run-all orchestration |
| Stack granularity | 1 platform + 1 per service | Max isolation for apps, grouped core |
| Migration | Incremental | Lower risk, verify each step |
| Shared modules | Relative paths | Simple, no registry overhead |
| State backend | Local files | No external dependencies |
| Cross-stack refs | Full references via outputs | Proper DAG, bootstrappable from scratch |
| CI/CD | Changed-stack detection | Only apply what changed |

File diff suppressed because it is too large Load diff