OpenClaw Cluster Management Agent — Design

Date: 2026-02-21 Status: Approved

Goal

Build a proactive cluster management agent that runs scheduled health checks every 30 minutes, auto-fixes safe issues, and alerts via Slack. The agent is "taught" via an OpenClaw skill and a reusable health check script.

Architecture

CronJob (every 30min)
  └─ kubectl exec into OpenClaw pod
       └─ /workspace/infra/.claude/cluster-health.sh
            ├─ kubectl get nodes (check health)
            ├─ kubectl get pods -A (find problems)
            ├─ kubectl delete pod (evicted/stuck)
            └─ curl Slack webhook (report)

Interactive path: User asks OpenClaw via UI -> cluster-health skill triggers -> runs same script -> LLM analyzes output and can do deeper investigation.

Components

1. `cluster-health` skill (`.claude/skills/cluster-health/SKILL.md`)

Teaches OpenClaw:

What health checks to run
What's safe to auto-fix vs alert-only
How to format Slack alerts
How to do deeper investigation when asked interactively

Trigger conditions: "check cluster", "cluster health", "what's wrong", "health check", etc.

2. `cluster-health.sh` helper script (`.claude/cluster-health.sh`)

Reusable script that performs all checks:

Checks:

Node health (NotReady, MemoryPressure, DiskPressure, PIDPressure)
Pod health (CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled, Pending)
Evicted pods
Failed deployments (unavailable replicas)
Pending PVCs
Resource pressure (high CPU/memory allocation)
Failed CronJobs
DaemonSet health (missing pods)

Safe auto-fix actions:

Delete evicted pods
Delete completed/succeeded pods older than 24h
Restart (delete) pods in CrashLoopBackOff for more than 1 hour

Alert-only (never auto-fix):

Node NotReady
Persistent OOMKilled
ImagePullBackOff
Pending PVCs
Failed deployments with 0 available replicas

Output:

Structured text summary
Posts to Slack via webhook
Exit code 0 = healthy, 1 = issues found

3. Kubernetes CronJob (in `modules/kubernetes/openclaw/main.tf`)

Schedule: */30 * * * *
Container: bitnami/kubectl (minimal image with kubectl)
Command: kubectl exec deploy/openclaw -n openclaw -- /bin/bash /workspace/infra/.claude/cluster-health.sh
ServiceAccount with RBAC to exec into pods in openclaw namespace
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 3
successfulJobsHistoryLimit: 3

4. Slack Integration

Webhook URL from openclaw_skill_secrets["slack"] (already configured)
Passed as SLACK_WEBHOOK_URL env var to the OpenClaw pod

Slack Message Format

:white_check_mark: Cluster Health Check — All Clear
Nodes: 5/5 Ready | Pods: 142 Running | 0 Issues

:warning: Cluster Health Check — 3 Issues Found

Auto-fixed:
- Deleted 4 evicted pods in monitoring namespace
- Restarted stuck pod calibre-web-xyz (CrashLoopBackOff >1h)

Needs attention:
- Node k8s-node3: MemoryPressure condition detected
- PVC data-tandoor pending for 45 minutes

Decisions

Decision	Choice	Rationale
Mode	Proactive (scheduled)	Want automated monitoring
Alert channel	Slack	Existing webhook in openclaw_skill_secrets
Auto-fix	Safe fixes only	Delete evicted, restart stuck; alert for the rest
Frequency	30 minutes	Balance between detection speed and overhead
Checks scope	Standard K8s health	Pod/node/deployment/PVC/CronJob/DaemonSet
Trigger mechanism	CronJob execs into OpenClaw pod	Reuses OpenClaw's tools; LLM available interactively
Fallback	None	Uptime Kuma monitors OpenClaw availability

3.6 KiB Raw Blame History