23 KiB
OpenClaw Cluster Management Agent — Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via kubectl exec.
Architecture: CronJob (bitnami/kubectl) -> kubectl exec into OpenClaw pod -> runs cluster-health.sh which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.
Tech Stack: Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl
Task 1: Add Slack webhook to openclaw_skill_secrets
Files:
- Modify:
terraform.tfvars:1291-1295(add slack_webhook key) - Modify:
modules/kubernetes/openclaw/main.tf:350-376(add SLACK_WEBHOOK_URL env var)
Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars
Add a new key slack_webhook to the existing openclaw_skill_secrets map. The user must provide the webhook URL. For now, use the existing alertmanager_slack_api_url value or a dedicated one.
In terraform.tfvars, change:
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
}
to:
openclaw_skill_secrets = {
home_assistant_token = "..."
home_assistant_sofia_token = "..."
uptime_kuma_password = "..."
slack_webhook = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
}
NOTE: Ask the user which Slack webhook URL to use. Candidates:
alertmanager_slack_api_url(line 4 in tfvars)tiny_tuya_slack_url(line 1213, comment says "K8s bot slack")- A new webhook the user creates
Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container
In modules/kubernetes/openclaw/main.tf, add after the UPTIME_KUMA_PASSWORD env block (around line 370):
# Skill secrets - Slack
env {
name = "SLACK_WEBHOOK_URL"
value = var.skill_secrets["slack_webhook"]
}
Step 3: Commit
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
Do NOT commit terraform.tfvars separately — it will be committed with the full set of changes at the end.
Task 2: Create the cluster-health.sh helper script
Files:
- Create:
.claude/cluster-health.sh
Step 1: Write the health check script
Create .claude/cluster-health.sh with the following structure. The script:
- Uses
$KUBECONFIG(already set in OpenClaw pod) or falls back to in-cluster config - Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
- Posts structured Slack message via
$SLACK_WEBHOOK_URL - Exit code 0 = healthy, 1 = issues found, 2 = critical
#!/usr/bin/env bash
# Cluster health check script for OpenClaw.
# Runs health checks, auto-fixes safe issues, posts to Slack.
# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
#
# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
# --no-slack Skip Slack notification (useful for interactive/debug runs)
# --no-fix Skip auto-fix actions (report only)
set -euo pipefail
SEND_SLACK=true
AUTO_FIX=true
ISSUES=()
FIXES=()
WARNINGS=()
# --- Argument parsing ---
for arg in "$@"; do
case "$arg" in
--no-slack) SEND_SLACK=false ;;
--no-fix) AUTO_FIX=false ;;
esac
done
KUBECTL="kubectl"
# --- 1. Node Health ---
check_nodes() {
local nodes not_ready
nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)
if [[ -n "$not_ready" ]]; then
while IFS= read -r node; do
ISSUES+=("Node NotReady: $node")
done <<< "$not_ready"
fi
# Check conditions
local conditions
conditions=$($KUBECTL get nodes -o json | python3 -c '
import json, sys
data = json.load(sys.stdin)
for node in data["items"]:
name = node["metadata"]["name"]
for c in node["status"]["conditions"]:
if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
print(name + ": " + c["type"])
' 2>/dev/null) || true
if [[ -n "$conditions" ]]; then
while IFS= read -r line; do
ISSUES+=("$line")
done <<< "$conditions"
fi
}
# --- 2. Pod Health ---
check_pods() {
local bad
bad=$( {
$KUBECTL get pods -A --no-headers 2>/dev/null \
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
if [[ -z "$bad" ]]; then return; fi
while IFS= read -r line; do
local ns pod status
ns=$(echo "$line" | awk '{print $1}')
pod=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $4}')
if [[ "$status" == "CrashLoopBackOff" ]]; then
# Check if stuck for >1 hour
local restart_count
restart_count=$(echo "$line" | awk '{print $5}')
if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
$KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
WARNINGS+=("Failed to restart $ns/$pod")
else
ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
fi
elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
ISSUES+=("ImagePullBackOff: $ns/$pod")
else
ISSUES+=("Error: $ns/$pod ($status)")
fi
done <<< "$bad"
}
# --- 3. Evicted/Failed Pods ---
check_evicted() {
local evicted count
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
if [[ -z "$evicted" ]]; then return; fi
count=$(echo "$evicted" | wc -l | tr -d ' ')
if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
$KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
FIXES+=("Deleted $count evicted/failed pod(s)") || \
WARNINGS+=("Failed to delete evicted pods")
else
ISSUES+=("$count evicted/failed pod(s)")
fi
}
# --- 4. Failed Deployments ---
check_deployments() {
local deps
deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name ready current desired
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
ready=$(echo "$line" | awk '{print $3}')
current=$(echo "$ready" | cut -d/ -f1)
desired=$(echo "$ready" | cut -d/ -f2)
if [[ "$current" != "$desired" ]]; then
ISSUES+=("Deployment $ns/$name: $current/$desired ready")
fi
done <<< "$deps"
}
# --- 5. Pending PVCs ---
check_pvcs() {
local pvcs
pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return
if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi
while IFS= read -r line; do
local ns name status
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
status=$(echo "$line" | awk '{print $3}')
if [[ "$status" != "Bound" ]]; then
ISSUES+=("PVC $ns/$name: $status")
fi
done <<< "$pvcs"
}
# --- 6. Resource Pressure ---
check_resources() {
local top
top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local node cpu_pct mem_pct
node=$(echo "$line" | awk '{print $1}')
cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
[[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue
if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
fi
done <<< "$top"
}
# --- 7. CronJob Failures ---
check_cronjobs() {
local failures
failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta
data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)
for job in data.get("items", []):
meta = job.get("metadata", {})
ns = meta.get("namespace", "")
name = meta.get("name", "")
owners = meta.get("ownerReferences", [])
if not any(o.get("kind") == "CronJob" for o in owners):
continue
for c in job.get("status", {}).get("conditions", []):
if c.get("type") == "Failed" and c.get("status") == "True":
ts = c.get("lastTransitionTime", "")
if ts:
try:
t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
if t > cutoff:
print(f"{ns}/{name}")
except:
print(f"{ns}/{name}")
' 2>/dev/null) || true
if [[ -n "$failures" ]]; then
local count
count=$(echo "$failures" | wc -l | tr -d ' ')
ISSUES+=("$count CronJob failure(s) in last 24h")
fi
}
# --- 8. DaemonSet Health ---
check_daemonsets() {
local ds
ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return
while IFS= read -r line; do
local ns name desired ready
ns=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | awk '{print $2}')
desired=$(echo "$line" | awk '{print $3}')
ready=$(echo "$line" | awk '{print $5}')
if [[ "$desired" != "$ready" ]]; then
ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
fi
done <<< "$ds"
}
# --- Cluster summary stats ---
get_summary_stats() {
local node_count ready_count pod_count
node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
}
# --- Send Slack message ---
send_slack() {
local webhook_url="$SLACK_WEBHOOK_URL"
if [[ -z "${webhook_url:-}" ]]; then
echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
return
fi
local summary issue_count fix_count warning_count
summary=$(get_summary_stats)
issue_count=${#ISSUES[@]}
fix_count=${#FIXES[@]}
warning_count=${#WARNINGS[@]}
local text=""
local total_problems=$((issue_count + warning_count))
if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
else
if [[ "$issue_count" -gt 0 ]]; then
text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
elif [[ "$warning_count" -gt 0 ]]; then
text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
else
text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
fi
if [[ "$fix_count" -gt 0 ]]; then
text+="\n\n*Auto-fixed:*"
for fix in "${FIXES[@]}"; do
text+="\n• ${fix}"
done
fi
if [[ "$issue_count" -gt 0 ]]; then
text+="\n\n*Needs attention:*"
for issue in "${ISSUES[@]}"; do
text+="\n• ${issue}"
done
fi
if [[ "$warning_count" -gt 0 ]]; then
text+="\n\n*Warnings:*"
for warning in "${WARNINGS[@]}"; do
text+="\n• ${warning}"
done
fi
fi
curl -s -X POST "$webhook_url" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"${text}\"}" > /dev/null 2>&1
}
# --- Main ---
main() {
echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="
check_nodes
check_pods
check_evicted
check_deployments
check_pvcs
check_resources
check_cronjobs
check_daemonsets
local issue_count=${#ISSUES[@]}
local fix_count=${#FIXES[@]}
local warning_count=${#WARNINGS[@]}
echo ""
echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"
if [[ "$fix_count" -gt 0 ]]; then
echo ""
echo "Auto-fixed:"
for fix in "${FIXES[@]}"; do echo " - $fix"; done
fi
if [[ "$issue_count" -gt 0 ]]; then
echo ""
echo "Issues:"
for issue in "${ISSUES[@]}"; do echo " - $issue"; done
fi
if [[ "$warning_count" -gt 0 ]]; then
echo ""
echo "Warnings:"
for warning in "${WARNINGS[@]}"; do echo " - $warning"; done
fi
if [[ "$SEND_SLACK" == true ]]; then
send_slack
echo ""
echo "Slack notification sent."
fi
# Exit code
if [[ "$issue_count" -gt 0 ]]; then
exit 1
fi
exit 0
}
main "$@"
Step 2: Make it executable
chmod +x .claude/cluster-health.sh
Step 3: Test locally (dry run)
KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
Expected: Script runs, prints check results, no Slack post.
Step 4: Commit
git add .claude/cluster-health.sh
git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
Task 3: Create the cluster-health skill
Files:
- Create:
.claude/skills/cluster-health/SKILL.md
Step 1: Write the skill document
---
name: cluster-health
description: |
Check Kubernetes cluster health and fix common issues. Use when:
(1) User asks to check the cluster, check health, or "what's wrong",
(2) User asks about pod status, node health, or deployment issues,
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
and stuck CrashLoopBackOff pods.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# Cluster Health Check
## Overview
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)
## Quick Check
Run the health check script:
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-slack
Or with Slack notification:
bash /workspace/infra/.claude/cluster-health.sh
Report-only (no auto-fix):
bash /workspace/infra/.claude/cluster-health.sh --no-fix
What It Checks
| # | Check | Auto-Fix | Alert |
|---|---|---|---|
| 1 | Node health (NotReady, conditions) | No | Yes |
| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
| 3 | Evicted/failed pods | Delete all | Yes |
| 4 | Deployment availability (current != desired) | No | Yes |
| 5 | PVC status (not Bound) | No | Yes |
| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
| 7 | CronJob failures (last 24h) | No | Yes |
| 8 | DaemonSet health (desired != ready) | No | Yes |
Safe Auto-Fix Rules
These are the ONLY things the script auto-fixes:
- Evicted/failed pods:
kubectl delete pods -A --field-selector=status.phase=Failed - CrashLoopBackOff pods with >10 restarts:
kubectl delete pod -n <ns> <pod> --grace-period=30
Everything else is alert-only. NEVER auto-fix:
- Node NotReady (could be maintenance)
- ImagePullBackOff (needs image tag or registry fix)
- Pending PVCs (needs storage investigation)
- Failed deployments (needs config investigation)
Deep Investigation
When the script reports issues and the user asks for more detail, use these commands:
Node issues
kubectl describe node <node-name>
kubectl top node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>
Pod issues
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --tail=100
kubectl logs -n <namespace> <pod-name> --previous --tail=100
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
Deployment issues
kubectl describe deployment -n <namespace> <deployment-name>
kubectl rollout status deployment -n <namespace> <deployment-name>
kubectl rollout history deployment -n <namespace> <deployment-name>
PVC issues
kubectl describe pvc -n <namespace> <pvc-name>
kubectl get pv
kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
Common Remediation
CrashLoopBackOff (persistent)
- Check logs:
kubectl logs -n <ns> <pod> --previous --tail=100 - Check events:
kubectl describe pod -n <ns> <pod> - Common causes: OOMKilled (increase memory limit), bad config, missing env var
- If image issue: check if newer image exists, update in Terraform
OOMKilled
- Check current limits:
kubectl describe pod -n <ns> <pod> | grep -A2 Limits - Fix: Update resource limits in Terraform module for the service
- Apply:
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"
ImagePullBackOff
- Check image:
kubectl describe pod -n <ns> <pod> | grep Image - Check registry: Is the image tag valid? Is the registry reachable?
- Check pull-through cache: Docker registry at 10.0.20.10
Node NotReady
- Check kubelet: SSH to node,
systemctl status kubelet - Check resources:
kubectl top node <node> - Check conditions:
kubectl describe node <node> | grep -A10 Conditions
Slack Webhook
Messages are posted to the webhook at $SLACK_WEBHOOK_URL. Format:
- All clear: green check + summary stats
- Issues found: red siren + list of issues + auto-fix actions taken
- Warnings only: yellow warning + elevated metrics
Infrastructure
- Terraform module:
modules/kubernetes/openclaw/main.tf - CronJob: Runs in
openclawnamespace every 30 min - Existing healthcheck:
scripts/cluster_healthcheck.sh(local-only, not for OpenClaw) - Repo path inside pod:
/workspace/infra/
**Step 2: Commit**
```bash
git add .claude/skills/cluster-health/SKILL.md
git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
Task 4: Add CronJob and RBAC to Terraform
Files:
- Modify:
modules/kubernetes/openclaw/main.tf(append CronJob + ServiceAccount + Role + RoleBinding)
Step 1: Add CronJob resources
Append the following to modules/kubernetes/openclaw/main.tf after the module "ingress" block:
# --- CronJob: Scheduled cluster health check ---
resource "kubernetes_service_account" "healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
}
resource "kubernetes_role" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["get", "list"]
}
rule {
api_groups = [""]
resources = ["pods/exec"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "healthcheck_exec" {
metadata {
name = "healthcheck-pod-exec"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.healthcheck.metadata[0].name
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.healthcheck_exec.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
metadata {
name = "cluster-healthcheck"
namespace = kubernetes_namespace.openclaw.metadata[0].name
labels = {
app = "cluster-healthcheck"
tier = var.tier
}
}
spec {
schedule = "*/30 * * * *"
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
job_template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
active_deadline_seconds = 300
template {
metadata {
labels = {
app = "cluster-healthcheck"
}
}
spec {
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
restart_policy = "Never"
container {
name = "healthcheck"
image = "bitnami/kubectl:1.34"
command = ["bash", "-c", <<-EOF
# Find the openclaw pod
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [ -z "$POD" ]; then
echo "ERROR: OpenClaw pod not found"
exit 1
fi
echo "Executing health check in pod $POD..."
kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
EOF
]
resources {
requests = {
cpu = "50m"
memory = "64Mi"
}
limits = {
memory = "128Mi"
}
}
}
}
}
}
}
}
}
Step 2: Verify Terraform formatting
terraform fmt modules/kubernetes/openclaw/main.tf
Step 3: Verify Terraform plan
terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.
Step 4: Commit
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
Task 5: Deploy and verify
Step 1: Apply Terraform
terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
Step 2: Verify CronJob exists
kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
Expected: cluster-healthcheck with schedule */30 * * * *
Step 3: Verify RBAC
kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
Expected: cluster-healthcheck SA, healthcheck-pod-exec role and rolebinding
Step 4: Trigger a manual run
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
Step 5: Check job output
kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
Expected: Health check output with results. If SLACK_WEBHOOK_URL is set, check Slack for the message.
Step 6: Clean up test job
kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
Step 7: Final commit
git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"