[ci skip] Add OpenClaw cluster health agent implementation plan

This commit is contained in:
Viktor Barzin 2026-02-21 23:48:36 +00:00
parent 51cb045f12
commit f41e2ca969

View file

@ -0,0 +1,800 @@
# OpenClaw Cluster Management Agent — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via `kubectl exec`.
**Architecture:** CronJob (bitnami/kubectl) -> `kubectl exec` into OpenClaw pod -> runs `cluster-health.sh` which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.
**Tech Stack:** Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl
---
### Task 1: Add Slack webhook to openclaw_skill_secrets
**Files:**
- Modify: `terraform.tfvars:1291-1295` (add slack_webhook key)
- Modify: `modules/kubernetes/openclaw/main.tf:350-376` (add SLACK_WEBHOOK_URL env var)
**Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars**
Add a new key `slack_webhook` to the existing `openclaw_skill_secrets` map. The user must provide the webhook URL. For now, use the existing `alertmanager_slack_api_url` value or a dedicated one.
In `terraform.tfvars`, change:
```hcl
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
}
```
to:
```hcl
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
slack_webhook = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
}
```
**NOTE:** Ask the user which Slack webhook URL to use. Candidates:
- `alertmanager_slack_api_url` (line 4 in tfvars)
- `tiny_tuya_slack_url` (line 1213, comment says "K8s bot slack")
- A new webhook the user creates
**Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container**
In `modules/kubernetes/openclaw/main.tf`, add after the `UPTIME_KUMA_PASSWORD` env block (around line 370):
```hcl
# Skill secrets - Slack
env {
name = "SLACK_WEBHOOK_URL"
value = var.skill_secrets["slack_webhook"]
}
```
**Step 3: Commit**
```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
```
Do NOT commit `terraform.tfvars` separately — it will be committed with the full set of changes at the end.
---
### Task 2: Create the cluster-health.sh helper script
**Files:**
- Create: `.claude/cluster-health.sh`
**Step 1: Write the health check script**
Create `.claude/cluster-health.sh` with the following structure. The script:
- Uses `$KUBECONFIG` (already set in OpenClaw pod) or falls back to in-cluster config
- Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
- Posts structured Slack message via `$SLACK_WEBHOOK_URL`
- Exit code 0 = healthy, 1 = issues found, 2 = critical
```bash
#!/usr/bin/env bash
# Cluster health check script for OpenClaw.
# Runs health checks, auto-fixes safe issues, posts to Slack.
# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
#
# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
# --no-slack Skip Slack notification (useful for interactive/debug runs)
# --no-fix Skip auto-fix actions (report only)
set -euo pipefail
SEND_SLACK=true
AUTO_FIX=true
ISSUES=()
FIXES=()
WARNINGS=()
# --- Argument parsing ---
for arg in "$@"; do
case "$arg" in
--no-slack) SEND_SLACK=false ;;
--no-fix) AUTO_FIX=false ;;
esac
done
KUBECTL="kubectl"
# --- 1. Node Health ---
check_nodes() {
local nodes not_ready
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)
if [[ -n "$not_ready" ]]; then
while IFS= read -r node; do
ISSUES+=("Node NotReady: $node")
done <<< "$not_ready"
fi
# Check conditions
local conditions
conditions=$($KUBECTL get nodes -o json | python3 -c '
import json, sys
data = json.load(sys.stdin)
for node in data["items"]:
name = node["metadata"]["name"]
for c in node["status"]["conditions"]:
if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
print(name + ": " + c["type"])
' 2>/dev/null) || true
if [[ -n "$conditions" ]]; then
while IFS= read -r line; do
ISSUES+=("$line")
done <<< "$conditions"
fi
}
# --- 2. Pod Health ---
check_pods() {
local bad
bad=$( {
$KUBECTL get pods -A --no-headers 2>/dev/null \
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
if [[ -z "$bad" ]]; then return; fi
while IFS= read -r line; do
local ns pod status
ns=$(echo "$line" | awk '{print $1}')
pod=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $4}')
if [[ "$status" == "CrashLoopBackOff" ]]; then
# Check if stuck for >1 hour
local restart_count
restart_count=$(echo "$line" | awk '{print $5}')
if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
$KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
WARNINGS+=("Failed to restart $ns/$pod")
else
ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
fi
elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
ISSUES+=("ImagePullBackOff: $ns/$pod")
else
ISSUES+=("Error: $ns/$pod ($status)")
fi
done <<< "$bad"
}
# --- 3. Evicted/Failed Pods ---
check_evicted() {
local evicted count
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
if [[ -z "$evicted" ]]; then return; fi
count=$(echo "$evicted" | wc -l | tr -d ' ')
if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
$KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
FIXES+=("Deleted $count evicted/failed pod(s)") || \
WARNINGS+=("Failed to delete evicted pods")
else
ISSUES+=("$count evicted/failed pod(s)")
fi
}
# --- 4. Failed Deployments ---
check_deployments() {
local deps
deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name ready current desired
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
ready=$(echo "$line" | awk '{print $3}')
current=$(echo "$ready" | cut -d/ -f1)
desired=$(echo "$ready" | cut -d/ -f2)
if [[ "$current" != "$desired" ]]; then
ISSUES+=("Deployment $ns/$name: $current/$desired ready")
fi
done <<< "$deps"
}
# --- 5. Pending PVCs ---
check_pvcs() {
local pvcs
pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return
if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi
while IFS= read -r line; do
local ns name status
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $3}')
if [[ "$status" != "Bound" ]]; then
ISSUES+=("PVC $ns/$name: $status")
fi
done <<< "$pvcs"
}
# --- 6. Resource Pressure ---
check_resources() {
local top
top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local node cpu_pct mem_pct
node=$(echo "$line" | awk '{print $1}')
cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
[[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue
if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
fi
done <<< "$top"
}
# --- 7. CronJob Failures ---
check_cronjobs() {
local failures
failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta
data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
for job in data.get("items", []):
meta = job.get("metadata", {})
ns = meta.get("namespace", "")
name = meta.get("name", "")
owners = meta.get("ownerReferences", [])
if not any(o.get("kind") == "CronJob" for o in owners):
continue
for c in job.get("status", {}).get("conditions", []):
if c.get("type") == "Failed" and c.get("status") == "True":
ts = c.get("lastTransitionTime", "")
if ts:
try:
t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
if t > cutoff:
print(f"{ns}/{name}")
except:
print(f"{ns}/{name}")
' 2>/dev/null) || true
if [[ -n "$failures" ]]; then
local count
count=$(echo "$failures" | wc -l | tr -d ' ')
ISSUES+=("$count CronJob failure(s) in last 24h")
fi
}
# --- 8. DaemonSet Health ---
check_daemonsets() {
local ds
ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name desired ready
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
desired=$(echo "$line" | awk '{print $3}')
ready=$(echo "$line" | awk '{print $5}')
if [[ "$desired" != "$ready" ]]; then
ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
fi
done <<< "$ds"
}
# --- Cluster summary stats ---
get_summary_stats() {
local node_count ready_count pod_count
node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
}
# --- Send Slack message ---
send_slack() {
local webhook_url="$SLACK_WEBHOOK_URL"
if [[ -z "${webhook_url:-}" ]]; then
echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
return
fi
local summary issue_count fix_count warning_count
summary=$(get_summary_stats)
issue_count=${#ISSUES[@]}
fix_count=${#FIXES[@]}
warning_count=${#WARNINGS[@]}
local text=""
local total_problems=$((issue_count + warning_count))
if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
else
if [[ "$issue_count" -gt 0 ]]; then
text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
elif [[ "$warning_count" -gt 0 ]]; then
text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
else
text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
fi
if [[ "$fix_count" -gt 0 ]]; then
text+="\n\n*Auto-fixed:*"
for fix in "${FIXES[@]}"; do
text+="\n• ${fix}"
done
fi
if [[ "$issue_count" -gt 0 ]]; then
text+="\n\n*Needs attention:*"
for issue in "${ISSUES[@]}"; do
text+="\n• ${issue}"
done
fi
if [[ "$warning_count" -gt 0 ]]; then
text+="\n\n*Warnings:*"
for warning in "${WARNINGS[@]}"; do
text+="\n• ${warning}"
done
fi
fi
curl -s -X POST "$webhook_url" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"${text}\"}" > /dev/null 2>&1
}
# --- Main ---
main() {
echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="
check_nodes
check_pods
check_evicted
check_deployments
check_pvcs
check_resources
check_cronjobs
check_daemonsets
local issue_count=${#ISSUES[@]}
local fix_count=${#FIXES[@]}
local warning_count=${#WARNINGS[@]}
echo ""
echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"
if [[ "$fix_count" -gt 0 ]]; then
echo ""
echo "Auto-fixed:"
for fix in "${FIXES[@]}"; do echo " - $fix"; done
fi
if [[ "$issue_count" -gt 0 ]]; then
echo ""
echo "Issues:"
for issue in "${ISSUES[@]}"; do echo " - $issue"; done
fi
if [[ "$warning_count" -gt 0 ]]; then
echo ""
echo "Warnings:"
for warning in "${WARNINGS[@]}"; do echo " - $warning"; done
fi
if [[ "$SEND_SLACK" == true ]]; then
send_slack
echo ""
echo "Slack notification sent."
fi
# Exit code
if [[ "$issue_count" -gt 0 ]]; then
exit 1
fi
exit 0
}
main "$@"
```
**Step 2: Make it executable**
```bash
chmod +x .claude/cluster-health.sh
```
**Step 3: Test locally (dry run)**
```bash
KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
```
Expected: Script runs, prints check results, no Slack post.
**Step 4: Commit**
```bash
git add .claude/cluster-health.sh
git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
```
---
### Task 3: Create the cluster-health skill
**Files:**
- Create: `.claude/skills/cluster-health/SKILL.md`
**Step 1: Write the skill document**
```markdown
---
name: cluster-health
description: |
Check Kubernetes cluster health and fix common issues. Use when:
(1) User asks to check the cluster, check health, or "what's wrong",
(2) User asks about pod status, node health, or deployment issues,
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
and stuck CrashLoopBackOff pods.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# Cluster Health Check
## Overview
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)
## Quick Check
Run the health check script:
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-slack
```
Or with Slack notification:
```bash
bash /workspace/infra/.claude/cluster-health.sh
```
Report-only (no auto-fix):
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-fix
```
## What It Checks
| # | Check | Auto-Fix | Alert |
|---|-------|----------|-------|
| 1 | Node health (NotReady, conditions) | No | Yes |
| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
| 3 | Evicted/failed pods | Delete all | Yes |
| 4 | Deployment availability (current != desired) | No | Yes |
| 5 | PVC status (not Bound) | No | Yes |
| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
| 7 | CronJob failures (last 24h) | No | Yes |
| 8 | DaemonSet health (desired != ready) | No | Yes |
## Safe Auto-Fix Rules
These are the ONLY things the script auto-fixes:
1. **Evicted/failed pods**: `kubectl delete pods -A --field-selector=status.phase=Failed`
2. **CrashLoopBackOff pods with >10 restarts**: `kubectl delete pod -n <ns> <pod> --grace-period=30`
Everything else is alert-only. NEVER auto-fix:
- Node NotReady (could be maintenance)
- ImagePullBackOff (needs image tag or registry fix)
- Pending PVCs (needs storage investigation)
- Failed deployments (needs config investigation)
## Deep Investigation
When the script reports issues and the user asks for more detail, use these commands:
### Node issues
```bash
kubectl describe node <node-name>
kubectl top node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>
```
### Pod issues
```bash
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --tail=100
kubectl logs -n <namespace> <pod-name> --previous --tail=100
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
```
### Deployment issues
```bash
kubectl describe deployment -n <namespace> <deployment-name>
kubectl rollout status deployment -n <namespace> <deployment-name>
kubectl rollout history deployment -n <namespace> <deployment-name>
```
### PVC issues
```bash
kubectl describe pvc -n <namespace> <pvc-name>
kubectl get pv
kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
```
### Resource pressure
```bash
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
```
## Common Remediation
### CrashLoopBackOff (persistent)
1. Check logs: `kubectl logs -n <ns> <pod> --previous --tail=100`
2. Check events: `kubectl describe pod -n <ns> <pod>`
3. Common causes: OOMKilled (increase memory limit), bad config, missing env var
4. If image issue: check if newer image exists, update in Terraform
### OOMKilled
1. Check current limits: `kubectl describe pod -n <ns> <pod> | grep -A2 Limits`
2. Fix: Update resource limits in Terraform module for the service
3. Apply: `terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"`
### ImagePullBackOff
1. Check image: `kubectl describe pod -n <ns> <pod> | grep Image`
2. Check registry: Is the image tag valid? Is the registry reachable?
3. Check pull-through cache: Docker registry at 10.0.20.10
### Node NotReady
1. Check kubelet: SSH to node, `systemctl status kubelet`
2. Check resources: `kubectl top node <node>`
3. Check conditions: `kubectl describe node <node> | grep -A10 Conditions`
## Slack Webhook
Messages are posted to the webhook at `$SLACK_WEBHOOK_URL`. Format:
- All clear: green check + summary stats
- Issues found: red siren + list of issues + auto-fix actions taken
- Warnings only: yellow warning + elevated metrics
## Infrastructure
- **Terraform module**: `modules/kubernetes/openclaw/main.tf`
- **CronJob**: Runs in `openclaw` namespace every 30 min
- **Existing healthcheck**: `scripts/cluster_healthcheck.sh` (local-only, not for OpenClaw)
- **Repo path inside pod**: `/workspace/infra/`
```
**Step 2: Commit**
```bash
git add .claude/skills/cluster-health/SKILL.md
git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
```
---
### Task 4: Add CronJob and RBAC to Terraform
**Files:**
- Modify: `modules/kubernetes/openclaw/main.tf` (append CronJob + ServiceAccount + Role + RoleBinding)
**Step 1: Add CronJob resources**
Append the following to `modules/kubernetes/openclaw/main.tf` after the `module "ingress"` block:
```hcl
# --- CronJob: Scheduled cluster health check ---
resource "kubernetes_service_account" "healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
}
resource "kubernetes_role" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["get", "list"]
}
rule {
api_groups = [""]
resources = ["pods/exec"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.healthcheck.metadata[0].name
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.healthcheck_exec.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
labels = {
app = "cluster-healthcheck"
tier = var.tier
}
}
spec {
schedule = "*/30 * * * *"
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
job_template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
active_deadline_seconds = 300
template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
restart_policy = "Never"
container {
name = "healthcheck"
image = "bitnami/kubectl:1.34"
command = ["bash", "-c", <<-EOF
# Find the openclaw pod
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [ -z "$POD" ]; then
echo "ERROR: OpenClaw pod not found"
exit 1
fi
echo "Executing health check in pod $POD..."
kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
EOF
]
resources {
requests = {
cpu = "50m"
memory = "64Mi"
}
limits = {
memory = "128Mi"
}
}
}
}
}
}
}
}
}
```
**Step 2: Verify Terraform formatting**
```bash
terraform fmt modules/kubernetes/openclaw/main.tf
```
**Step 3: Verify Terraform plan**
```bash
terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
```
Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.
**Step 4: Commit**
```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
```
---
### Task 5: Deploy and verify
**Step 1: Apply Terraform**
```bash
terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
```
**Step 2: Verify CronJob exists**
```bash
kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
```
Expected: `cluster-healthcheck` with schedule `*/30 * * * *`
**Step 3: Verify RBAC**
```bash
kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
```
Expected: `cluster-healthcheck` SA, `healthcheck-pod-exec` role and rolebinding
**Step 4: Trigger a manual run**
```bash
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
```
**Step 5: Check job output**
```bash
kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
```
Expected: Health check output with results. If `SLACK_WEBHOOK_URL` is set, check Slack for the message.
**Step 6: Clean up test job**
```bash
kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
```
**Step 7: Final commit**
```bash
git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"
```