From 51cb045f12789e28ea528afdac95af75790be139 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <viktorbarzin@meta.com>
Date: Sat, 21 Feb 2026 23:45:30 +0000
Subject: [PATCH] [ci skip] Add OpenClaw cluster management agent design doc

---
 ...026-02-21-openclaw-cluster-agent-design.md | 111 ++++++++++++++++++
 1 file changed, 111 insertions(+)
 create mode 100644 docs/plans/2026-02-21-openclaw-cluster-agent-design.md

diff --git a/docs/plans/2026-02-21-openclaw-cluster-agent-design.md b/docs/plans/2026-02-21-openclaw-cluster-agent-design.md
new file mode 100644
index 00000000..fba6446e
--- /dev/null
+++ b/docs/plans/2026-02-21-openclaw-cluster-agent-design.md
@@ -0,0 +1,111 @@
+# OpenClaw Cluster Management Agent — Design
+
+**Date**: 2026-02-21
+**Status**: Approved
+
+## Goal
+
+Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script.
+
+## Architecture
+
+```
+CronJob (every 30min)
+  └─ kubectl exec into OpenClaw pod
+       └─ /workspace/infra/.claude/cluster-health.sh
+            ├─ kubectl get nodes (check health)
+            ├─ kubectl get pods -A (find problems)
+            ├─ kubectl delete pod (evicted/stuck)
+            └─ curl Slack webhook (report)
+```
+
+Interactive path: User asks OpenClaw via UI -> `cluster-health` skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation.
+
+## Components
+
+### 1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`)
+
+Teaches OpenClaw:
+- What health checks to run
+- What's safe to auto-fix vs alert-only
+- How to format Slack alerts
+- How to do deeper investigation when asked interactively
+
+Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc.
+
+### 2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`)
+
+Reusable script that performs all checks:
+
+**Checks:**
+- Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure)
+- Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending)
+- Evicted pods
+- Failed deployments (unavailable replicas)
+- Pending PVCs
+- Resource pressure (high CPU/memory allocation)
+- Failed CronJobs
+- DaemonSet health (missing pods)
+
+**Safe auto-fix actions:**
+- Delete evicted pods
+- Delete completed/succeeded pods older than 24h
+- Restart (delete) pods in CrashLoopBackOff for more than 1 hour
+
+**Alert-only (never auto-fix):**
+- Node NotReady
+- Persistent OOMKilled
+- ImagePullBackOff
+- Pending PVCs
+- Failed deployments with 0 available replicas
+
+**Output:**
+- Structured text summary
+- Posts to Slack via webhook
+- Exit code 0 = healthy, 1 = issues found
+
+### 3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`)
+
+- Schedule: `*/30 * * * *`
+- Container: `bitnami/kubectl` (minimal image with kubectl)
+- Command: `kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh`
+- ServiceAccount with RBAC to exec into pods in `openclaw` namespace
+- `concurrencyPolicy: Forbid`
+- `failedJobsHistoryLimit: 3`
+- `successfulJobsHistoryLimit: 3`
+
+### 4. Slack Integration
+
+- Webhook URL from `openclaw_skill_secrets["slack"]` (already configured)
+- Passed as `SLACK_WEBHOOK_URL` env var to the OpenClaw pod
+
+## Slack Message Format
+
+```
+:white_check_mark: Cluster Health Check — All Clear
+Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues
+```
+
+```
+:warning: Cluster Health Check — 3 Issues Found
+
+Auto-fixed:
+- Deleted 4 evicted pods in monitoring namespace
+- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h)
+
+Needs attention:
+- Node k8s-node3: MemoryPressure condition detected
+- PVC data-tandoor pending for 45 minutes
+```
+
+## Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Mode | Proactive (scheduled) | Want automated monitoring |
+| Alert channel | Slack | Existing webhook in openclaw_skill_secrets |
+| Auto-fix | Safe fixes only | Delete evicted, restart stuck; alert for the rest |
+| Frequency | 30 minutes | Balance between detection speed and overhead |
+| Checks scope | Standard K8s health | Pod/node/deployment/PVC/CronJob/DaemonSet |
+| Trigger mechanism | CronJob execs into OpenClaw pod | Reuses OpenClaw's tools; LLM available interactively |
+| Fallback | None | Uptime Kuma monitors OpenClaw availability |