From 51cb045f12789e28ea528afdac95af75790be139 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 21 Feb 2026 23:45:30 +0000 Subject: [PATCH] [ci skip] Add OpenClaw cluster management agent design doc --- ...026-02-21-openclaw-cluster-agent-design.md | 111 ++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 docs/plans/2026-02-21-openclaw-cluster-agent-design.md diff --git a/docs/plans/2026-02-21-openclaw-cluster-agent-design.md b/docs/plans/2026-02-21-openclaw-cluster-agent-design.md new file mode 100644 index 00000000..fba6446e --- /dev/null +++ b/docs/plans/2026-02-21-openclaw-cluster-agent-design.md @@ -0,0 +1,111 @@ +# OpenClaw Cluster Management Agent — Design + +**Date**: 2026-02-21 +**Status**: Approved + +## Goal + +Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script. + +## Architecture + +``` +CronJob (every 30min) + └─ kubectl exec into OpenClaw pod + └─ /workspace/infra/.claude/cluster-health.sh + ├─ kubectl get nodes (check health) + ├─ kubectl get pods -A (find problems) + ├─ kubectl delete pod (evicted/stuck) + └─ curl Slack webhook (report) +``` + +Interactive path: User asks OpenClaw via UI -> `cluster-health` skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation. + +## Components + +### 1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`) + +Teaches OpenClaw: +- What health checks to run +- What's safe to auto-fix vs alert-only +- How to format Slack alerts +- How to do deeper investigation when asked interactively + +Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc. + +### 2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`) + +Reusable script that performs all checks: + +**Checks:** +- Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure) +- Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending) +- Evicted pods +- Failed deployments (unavailable replicas) +- Pending PVCs +- Resource pressure (high CPU/memory allocation) +- Failed CronJobs +- DaemonSet health (missing pods) + +**Safe auto-fix actions:** +- Delete evicted pods +- Delete completed/succeeded pods older than 24h +- Restart (delete) pods in CrashLoopBackOff for more than 1 hour + +**Alert-only (never auto-fix):** +- Node NotReady +- Persistent OOMKilled +- ImagePullBackOff +- Pending PVCs +- Failed deployments with 0 available replicas + +**Output:** +- Structured text summary +- Posts to Slack via webhook +- Exit code 0 = healthy, 1 = issues found + +### 3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`) + +- Schedule: `*/30 * * * *` +- Container: `bitnami/kubectl` (minimal image with kubectl) +- Command: `kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh` +- ServiceAccount with RBAC to exec into pods in `openclaw` namespace +- `concurrencyPolicy: Forbid` +- `failedJobsHistoryLimit: 3` +- `successfulJobsHistoryLimit: 3` + +### 4. Slack Integration + +- Webhook URL from `openclaw_skill_secrets["slack"]` (already configured) +- Passed as `SLACK_WEBHOOK_URL` env var to the OpenClaw pod + +## Slack Message Format + +``` +:white_check_mark: Cluster Health Check — All Clear +Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues +``` + +``` +:warning: Cluster Health Check — 3 Issues Found + +Auto-fixed: +- Deleted 4 evicted pods in monitoring namespace +- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h) + +Needs attention: +- Node k8s-node3: MemoryPressure condition detected +- PVC data-tandoor pending for 45 minutes +``` + +## Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Mode | Proactive (scheduled) | Want automated monitoring | +| Alert channel | Slack | Existing webhook in openclaw_skill_secrets | +| Auto-fix | Safe fixes only | Delete evicted, restart stuck; alert for the rest | +| Frequency | 30 minutes | Balance between detection speed and overhead | +| Checks scope | Standard K8s health | Pod/node/deployment/PVC/CronJob/DaemonSet | +| Trigger mechanism | CronJob execs into OpenClaw pod | Reuses OpenClaw's tools; LLM available interactively | +| Fallback | None | Uptime Kuma monitors OpenClaw availability |