# OpenClaw Cluster Management Agent — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Build a proactive cluster health agent — a skill that teaches OpenClaw to check the cluster, a helper script that runs the checks and posts to Slack, and a CronJob that triggers it every 30 minutes via `kubectl exec`.

**Architecture:** CronJob (bitnami/kubectl) -> `kubectl exec` into OpenClaw pod -> runs `cluster-health.sh` which performs 8 health checks, auto-fixes safe issues, and posts a summary to Slack. The same script is available as an OpenClaw skill for interactive use.

**Tech Stack:** Bash (health check script), Terraform/HCL (CronJob + RBAC), Slack webhook API, kubectl

---

### Task 1: Add Slack webhook to openclaw_skill_secrets

**Files:**
- Modify: `terraform.tfvars:1291-1295` (add slack_webhook key)
- Modify: `modules/kubernetes/openclaw/main.tf:350-376` (add SLACK_WEBHOOK_URL env var)

**Step 1: Add slack_webhook to openclaw_skill_secrets in tfvars**

Add a new key `slack_webhook` to the existing `openclaw_skill_secrets` map. The user must provide the webhook URL. For now, use the existing `alertmanager_slack_api_url` value or a dedicated one.

In `terraform.tfvars`, change:
```hcl
openclaw_skill_secrets = {
  home_assistant_token       = "..."
  home_assistant_sofia_token = "..."
  uptime_kuma_password       = "..."
}
```
to:
```hcl
openclaw_skill_secrets = {
  home_assistant_token       = "..."
  home_assistant_sofia_token = "..."
  uptime_kuma_password       = "..."
  slack_webhook              = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
}
```

**NOTE:** Ask the user which Slack webhook URL to use. Candidates:
- `alertmanager_slack_api_url` (line 4 in tfvars)
- `tiny_tuya_slack_url` (line 1213, comment says "K8s bot slack")
- A new webhook the user creates

**Step 2: Add SLACK_WEBHOOK_URL env var to OpenClaw container**

In `modules/kubernetes/openclaw/main.tf`, add after the `UPTIME_KUMA_PASSWORD` env block (around line 370):
```hcl
          # Skill secrets - Slack
          env {
            name  = "SLACK_WEBHOOK_URL"
            value = var.skill_secrets["slack_webhook"]
          }
```

**Step 3: Commit**

```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add Slack webhook env var to OpenClaw deployment"
```

Do NOT commit `terraform.tfvars` separately — it will be committed with the full set of changes at the end.

---

### Task 2: Create the cluster-health.sh helper script

**Files:**
- Create: `.claude/cluster-health.sh`

**Step 1: Write the health check script**

Create `.claude/cluster-health.sh` with the following structure. The script:
- Uses `$KUBECONFIG` (already set in OpenClaw pod) or falls back to in-cluster config
- Runs 8 checks: nodes, pods, evicted, deployments, PVCs, resources, CronJobs, DaemonSets
- Auto-fixes: deletes evicted pods, restarts CrashLoopBackOff pods stuck >1 hour
- Posts structured Slack message via `$SLACK_WEBHOOK_URL`
- Exit code 0 = healthy, 1 = issues found, 2 = critical

```bash
#!/usr/bin/env bash
# Cluster health check script for OpenClaw.
# Runs health checks, auto-fixes safe issues, posts to Slack.
# Designed to run inside the OpenClaw pod (has kubectl via $KUBECONFIG).
#
# Usage: ./cluster-health.sh [--no-slack] [--no-fix]
#   --no-slack  Skip Slack notification (useful for interactive/debug runs)
#   --no-fix    Skip auto-fix actions (report only)

set -euo pipefail

SEND_SLACK=true
AUTO_FIX=true
ISSUES=()
FIXES=()
WARNINGS=()

# --- Argument parsing ---
for arg in "$@"; do
  case "$arg" in
    --no-slack) SEND_SLACK=false ;;
    --no-fix)   AUTO_FIX=false ;;
  esac
done

KUBECTL="kubectl"

# --- 1. Node Health ---
check_nodes() {
  local nodes not_ready
  nodes=$($KUBECTL get nodes --no-headers 2>&1) || { ISSUES+=("Cannot reach cluster API"); return; }
  not_ready=$(echo "$nodes" | awk '$2 != "Ready" {print $1}' || true)

  if [[ -n "$not_ready" ]]; then
    while IFS= read -r node; do
      ISSUES+=("Node NotReady: $node")
    done <<< "$not_ready"
  fi

  # Check conditions
  local conditions
  conditions=$($KUBECTL get nodes -o json | python3 -c '
import json, sys
data = json.load(sys.stdin)
for node in data["items"]:
    name = node["metadata"]["name"]
    for c in node["status"]["conditions"]:
        if c["type"] in ("MemoryPressure","DiskPressure","PIDPressure") and c["status"] == "True":
            print(name + ": " + c["type"])
' 2>/dev/null) || true

  if [[ -n "$conditions" ]]; then
    while IFS= read -r line; do
      ISSUES+=("$line")
    done <<< "$conditions"
  fi
}

# --- 2. Pod Health ---
check_pods() {
  local bad
  bad=$( {
    $KUBECTL get pods -A --no-headers 2>/dev/null \
      | grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|Error' || true
  } | awk '!seen[$1,$2]++' | sed '/^$/d') || true

  if [[ -z "$bad" ]]; then return; fi

  while IFS= read -r line; do
    local ns pod status
    ns=$(echo "$line" | awk '{print $1}')
    pod=$(echo "$line" | awk '{print $2}')
    status=$(echo "$line" | awk '{print $4}')

    if [[ "$status" == "CrashLoopBackOff" ]]; then
      # Check if stuck for >1 hour
      local restart_count
      restart_count=$(echo "$line" | awk '{print $5}')
      if [[ "$AUTO_FIX" == true && "$restart_count" -gt 10 ]]; then
        $KUBECTL delete pod -n "$ns" "$pod" --grace-period=30 2>/dev/null && \
          FIXES+=("Restarted $ns/$pod (CrashLoopBackOff, $restart_count restarts)") || \
          WARNINGS+=("Failed to restart $ns/$pod")
      else
        ISSUES+=("CrashLoopBackOff: $ns/$pod ($restart_count restarts)")
      fi
    elif [[ "$status" == "ImagePullBackOff" || "$status" == "ErrImagePull" ]]; then
      ISSUES+=("ImagePullBackOff: $ns/$pod")
    else
      ISSUES+=("Error: $ns/$pod ($status)")
    fi
  done <<< "$bad"
}

# --- 3. Evicted/Failed Pods ---
check_evicted() {
  local evicted count
  evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)

  if [[ -z "$evicted" ]]; then return; fi
  count=$(echo "$evicted" | wc -l | tr -d ' ')

  if [[ "$AUTO_FIX" == true && "$count" -gt 0 ]]; then
    $KUBECTL delete pods -A --field-selector=status.phase=Failed 2>/dev/null && \
      FIXES+=("Deleted $count evicted/failed pod(s)") || \
      WARNINGS+=("Failed to delete evicted pods")
  else
    ISSUES+=("$count evicted/failed pod(s)")
  fi
}

# --- 4. Failed Deployments ---
check_deployments() {
  local deps
  deps=$($KUBECTL get deployments -A --no-headers 2>/dev/null) || return

  while IFS= read -r line; do
    local ns name ready current desired
    ns=$(echo "$line" | awk '{print $1}')
    name=$(echo "$line" | awk '{print $2}')
    ready=$(echo "$line" | awk '{print $3}')
    current=$(echo "$ready" | cut -d/ -f1)
    desired=$(echo "$ready" | cut -d/ -f2)

    if [[ "$current" != "$desired" ]]; then
      ISSUES+=("Deployment $ns/$name: $current/$desired ready")
    fi
  done <<< "$deps"
}

# --- 5. Pending PVCs ---
check_pvcs() {
  local pvcs
  pvcs=$($KUBECTL get pvc -A --no-headers 2>/dev/null) || return

  if [[ -z "$pvcs" || "$pvcs" == *"No resources found"* ]]; then return; fi

  while IFS= read -r line; do
    local ns name status
    ns=$(echo "$line" | awk '{print $1}')
    name=$(echo "$line" | awk '{print $2}')
    status=$(echo "$line" | awk '{print $3}')

    if [[ "$status" != "Bound" ]]; then
      ISSUES+=("PVC $ns/$name: $status")
    fi
  done <<< "$pvcs"
}

# --- 6. Resource Pressure ---
check_resources() {
  local top
  top=$($KUBECTL top nodes --no-headers 2>/dev/null) || return

  while IFS= read -r line; do
    local node cpu_pct mem_pct
    node=$(echo "$line" | awk '{print $1}')
    cpu_pct=$(echo "$line" | awk '{print $3}' | tr -d '%')
    mem_pct=$(echo "$line" | awk '{print $5}' | tr -d '%')

    [[ "$cpu_pct" == *"unknown"* || "$mem_pct" == *"unknown"* ]] && continue

    if [[ "$cpu_pct" -gt 90 || "$mem_pct" -gt 90 ]]; then
      ISSUES+=("High resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
    elif [[ "$cpu_pct" -gt 80 || "$mem_pct" -gt 80 ]]; then
      WARNINGS+=("Elevated resource usage on $node: CPU ${cpu_pct}%, Mem ${mem_pct}%")
    fi
  done <<< "$top"
}

# --- 7. CronJob Failures ---
check_cronjobs() {
  local failures
  failures=$($KUBECTL get jobs -A -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime, timezone, timedelta

data = json.load(sys.stdin)
cutoff = datetime.now(timezone.utc) - timedelta(hours=24)

for job in data.get("items", []):
    meta = job.get("metadata", {})
    ns = meta.get("namespace", "")
    name = meta.get("name", "")
    owners = meta.get("ownerReferences", [])
    if not any(o.get("kind") == "CronJob" for o in owners):
        continue
    for c in job.get("status", {}).get("conditions", []):
        if c.get("type") == "Failed" and c.get("status") == "True":
            ts = c.get("lastTransitionTime", "")
            if ts:
                try:
                    t = datetime.fromisoformat(ts.replace("Z", "+00:00"))
                    if t > cutoff:
                        print(f"{ns}/{name}")
                except:
                    print(f"{ns}/{name}")
' 2>/dev/null) || true

  if [[ -n "$failures" ]]; then
    local count
    count=$(echo "$failures" | wc -l | tr -d ' ')
    ISSUES+=("$count CronJob failure(s) in last 24h")
  fi
}

# --- 8. DaemonSet Health ---
check_daemonsets() {
  local ds
  ds=$($KUBECTL get daemonsets -A --no-headers 2>/dev/null) || return

  while IFS= read -r line; do
    local ns name desired ready
    ns=$(echo "$line" | awk '{print $1}')
    name=$(echo "$line" | awk '{print $2}')
    desired=$(echo "$line" | awk '{print $3}')
    ready=$(echo "$line" | awk '{print $5}')

    if [[ "$desired" != "$ready" ]]; then
      ISSUES+=("DaemonSet $ns/$name: desired=$desired ready=$ready")
    fi
  done <<< "$ds"
}

# --- Cluster summary stats ---
get_summary_stats() {
  local node_count ready_count pod_count
  node_count=$($KUBECTL get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
  ready_count=$($KUBECTL get nodes --no-headers 2>/dev/null | awk '$2 == "Ready"' | wc -l | tr -d ' ')
  pod_count=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
  echo "${ready_count}/${node_count} nodes | ${pod_count} pods running"
}

# --- Send Slack message ---
send_slack() {
  local webhook_url="$SLACK_WEBHOOK_URL"
  if [[ -z "${webhook_url:-}" ]]; then
    echo "WARNING: SLACK_WEBHOOK_URL not set, skipping Slack notification"
    return
  fi

  local summary issue_count fix_count warning_count
  summary=$(get_summary_stats)
  issue_count=${#ISSUES[@]}
  fix_count=${#FIXES[@]}
  warning_count=${#WARNINGS[@]}

  local text=""
  local total_problems=$((issue_count + warning_count))

  if [[ "$total_problems" -eq 0 && "$fix_count" -eq 0 ]]; then
    text=":white_check_mark: *Cluster Health Check — All Clear*\n${summary} | 0 issues"
  else
    if [[ "$issue_count" -gt 0 ]]; then
      text=":rotating_light: *Cluster Health Check — ${issue_count} Issue(s) Found*\n${summary}"
    elif [[ "$warning_count" -gt 0 ]]; then
      text=":warning: *Cluster Health Check — ${warning_count} Warning(s)*\n${summary}"
    else
      text=":white_check_mark: *Cluster Health Check — All Clear (auto-fixed ${fix_count})*\n${summary}"
    fi

    if [[ "$fix_count" -gt 0 ]]; then
      text+="\n\n*Auto-fixed:*"
      for fix in "${FIXES[@]}"; do
        text+="\n• ${fix}"
      done
    fi

    if [[ "$issue_count" -gt 0 ]]; then
      text+="\n\n*Needs attention:*"
      for issue in "${ISSUES[@]}"; do
        text+="\n• ${issue}"
      done
    fi

    if [[ "$warning_count" -gt 0 ]]; then
      text+="\n\n*Warnings:*"
      for warning in "${WARNINGS[@]}"; do
        text+="\n• ${warning}"
      done
    fi
  fi

  curl -s -X POST "$webhook_url" \
    -H 'Content-Type: application/json' \
    -d "{\"text\": \"${text}\"}" > /dev/null 2>&1
}

# --- Main ---
main() {
  echo "=== Cluster Health Check — $(date '+%Y-%m-%d %H:%M:%S') ==="

  check_nodes
  check_pods
  check_evicted
  check_deployments
  check_pvcs
  check_resources
  check_cronjobs
  check_daemonsets

  local issue_count=${#ISSUES[@]}
  local fix_count=${#FIXES[@]}
  local warning_count=${#WARNINGS[@]}

  echo ""
  echo "Results: ${issue_count} issue(s), ${fix_count} fix(es), ${warning_count} warning(s)"

  if [[ "$fix_count" -gt 0 ]]; then
    echo ""
    echo "Auto-fixed:"
    for fix in "${FIXES[@]}"; do echo "  - $fix"; done
  fi

  if [[ "$issue_count" -gt 0 ]]; then
    echo ""
    echo "Issues:"
    for issue in "${ISSUES[@]}"; do echo "  - $issue"; done
  fi

  if [[ "$warning_count" -gt 0 ]]; then
    echo ""
    echo "Warnings:"
    for warning in "${WARNINGS[@]}"; do echo "  - $warning"; done
  fi

  if [[ "$SEND_SLACK" == true ]]; then
    send_slack
    echo ""
    echo "Slack notification sent."
  fi

  # Exit code
  if [[ "$issue_count" -gt 0 ]]; then
    exit 1
  fi
  exit 0
}

main "$@"
```

**Step 2: Make it executable**

```bash
chmod +x .claude/cluster-health.sh
```

**Step 3: Test locally (dry run)**

```bash
KUBECONFIG=$(pwd)/config SLACK_WEBHOOK_URL="" bash .claude/cluster-health.sh --no-slack
```

Expected: Script runs, prints check results, no Slack post.

**Step 4: Commit**

```bash
git add .claude/cluster-health.sh
git commit -m "[ci skip] Add cluster health check script for OpenClaw agent"
```

---

### Task 3: Create the cluster-health skill

**Files:**
- Create: `.claude/skills/cluster-health/SKILL.md`

**Step 1: Write the skill document**

```markdown
---
name: cluster-health
description: |
  Check Kubernetes cluster health and fix common issues. Use when:
  (1) User asks to check the cluster, check health, or "what's wrong",
  (2) User asks about pod status, node health, or deployment issues,
  (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
  (4) User mentions "health check", "cluster status", "cluster health",
  (5) User asks "is everything running" or "any problems".
  Runs 8 standard K8s health checks with safe auto-fix for evicted pods
  and stuck CrashLoopBackOff pods.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---

# Cluster Health Check

## Overview
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
- **Schedule**: CronJob runs every 30 minutes, execs into this pod
- **Slack**: Posts results to `$SLACK_WEBHOOK_URL`
- **Auto-fix**: Deletes evicted pods, restarts CrashLoopBackOff pods (>10 restarts)

## Quick Check

Run the health check script:
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-slack
```

Or with Slack notification:
```bash
bash /workspace/infra/.claude/cluster-health.sh
```

Report-only (no auto-fix):
```bash
bash /workspace/infra/.claude/cluster-health.sh --no-fix
```

## What It Checks

| # | Check | Auto-Fix | Alert |
|---|-------|----------|-------|
| 1 | Node health (NotReady, conditions) | No | Yes |
| 2 | Pod health (CrashLoopBackOff, ImagePullBackOff, Error) | Restart if >10 restarts | Yes |
| 3 | Evicted/failed pods | Delete all | Yes |
| 4 | Deployment availability (current != desired) | No | Yes |
| 5 | PVC status (not Bound) | No | Yes |
| 6 | Resource pressure (CPU/Mem >80%) | No | Yes |
| 7 | CronJob failures (last 24h) | No | Yes |
| 8 | DaemonSet health (desired != ready) | No | Yes |

## Safe Auto-Fix Rules

These are the ONLY things the script auto-fixes:
1. **Evicted/failed pods**: `kubectl delete pods -A --field-selector=status.phase=Failed`
2. **CrashLoopBackOff pods with >10 restarts**: `kubectl delete pod -n <ns> <pod> --grace-period=30`

Everything else is alert-only. NEVER auto-fix:
- Node NotReady (could be maintenance)
- ImagePullBackOff (needs image tag or registry fix)
- Pending PVCs (needs storage investigation)
- Failed deployments (needs config investigation)

## Deep Investigation

When the script reports issues and the user asks for more detail, use these commands:

### Node issues
```bash
kubectl describe node <node-name>
kubectl top node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>
```

### Pod issues
```bash
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --tail=100
kubectl logs -n <namespace> <pod-name> --previous --tail=100
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
```

### Deployment issues
```bash
kubectl describe deployment -n <namespace> <deployment-name>
kubectl rollout status deployment -n <namespace> <deployment-name>
kubectl rollout history deployment -n <namespace> <deployment-name>
```

### PVC issues
```bash
kubectl describe pvc -n <namespace> <pvc-name>
kubectl get pv
kubectl get events -n <namespace> --field-selector involvedObject.name=<pvc-name>
```

### Resource pressure
```bash
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
```

## Common Remediation

### CrashLoopBackOff (persistent)
1. Check logs: `kubectl logs -n <ns> <pod> --previous --tail=100`
2. Check events: `kubectl describe pod -n <ns> <pod>`
3. Common causes: OOMKilled (increase memory limit), bad config, missing env var
4. If image issue: check if newer image exists, update in Terraform

### OOMKilled
1. Check current limits: `kubectl describe pod -n <ns> <pod> | grep -A2 Limits`
2. Fix: Update resource limits in Terraform module for the service
3. Apply: `terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config"`

### ImagePullBackOff
1. Check image: `kubectl describe pod -n <ns> <pod> | grep Image`
2. Check registry: Is the image tag valid? Is the registry reachable?
3. Check pull-through cache: Docker registry at 10.0.20.10

### Node NotReady
1. Check kubelet: SSH to node, `systemctl status kubelet`
2. Check resources: `kubectl top node <node>`
3. Check conditions: `kubectl describe node <node> | grep -A10 Conditions`

## Slack Webhook

Messages are posted to the webhook at `$SLACK_WEBHOOK_URL`. Format:
- All clear: green check + summary stats
- Issues found: red siren + list of issues + auto-fix actions taken
- Warnings only: yellow warning + elevated metrics

## Infrastructure

- **Terraform module**: `modules/kubernetes/openclaw/main.tf`
- **CronJob**: Runs in `openclaw` namespace every 30 min
- **Existing healthcheck**: `scripts/cluster_healthcheck.sh` (local-only, not for OpenClaw)
- **Repo path inside pod**: `/workspace/infra/`
```

**Step 2: Commit**

```bash
git add .claude/skills/cluster-health/SKILL.md
git commit -m "[ci skip] Add cluster-health skill for OpenClaw agent"
```

---

### Task 4: Add CronJob and RBAC to Terraform

**Files:**
- Modify: `modules/kubernetes/openclaw/main.tf` (append CronJob + ServiceAccount + Role + RoleBinding)

**Step 1: Add CronJob resources**

Append the following to `modules/kubernetes/openclaw/main.tf` after the `module "ingress"` block:

```hcl
# --- CronJob: Scheduled cluster health check ---

resource "kubernetes_service_account" "healthcheck" {
  metadata {
    name      = "cluster-healthcheck"
    namespace = kubernetes_namespace.openclaw.metadata[0].name
  }
}

resource "kubernetes_role" "healthcheck_exec" {
  metadata {
    name      = "healthcheck-pod-exec"
    namespace = kubernetes_namespace.openclaw.metadata[0].name
  }
  rule {
    api_groups = [""]
    resources  = ["pods"]
    verbs      = ["get", "list"]
  }
  rule {
    api_groups = [""]
    resources  = ["pods/exec"]
    verbs      = ["create"]
  }
}

resource "kubernetes_role_binding" "healthcheck_exec" {
  metadata {
    name      = "healthcheck-pod-exec"
    namespace = kubernetes_namespace.openclaw.metadata[0].name
  }
  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.healthcheck.metadata[0].name
    namespace = kubernetes_namespace.openclaw.metadata[0].name
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = kubernetes_role.healthcheck_exec.metadata[0].name
  }
}

resource "kubernetes_cron_job_v1" "cluster_healthcheck" {
  metadata {
    name      = "cluster-healthcheck"
    namespace = kubernetes_namespace.openclaw.metadata[0].name
    labels = {
      app  = "cluster-healthcheck"
      tier = var.tier
    }
  }
  spec {
    schedule                      = "*/30 * * * *"
    concurrency_policy            = "Forbid"
    failed_jobs_history_limit     = 3
    successful_jobs_history_limit = 3

    job_template {
      metadata {
        labels = {
          app = "cluster-healthcheck"
        }
      }
      spec {
        active_deadline_seconds = 300
        template {
          metadata {
            labels = {
              app = "cluster-healthcheck"
            }
          }
          spec {
            service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
            restart_policy       = "Never"

            container {
              name    = "healthcheck"
              image   = "bitnami/kubectl:1.34"
              command = ["bash", "-c", <<-EOF
                # Find the openclaw pod
                POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
                if [ -z "$POD" ]; then
                  echo "ERROR: OpenClaw pod not found"
                  exit 1
                fi
                echo "Executing health check in pod $POD..."
                kubectl exec -n openclaw "$POD" -c openclaw -- bash /workspace/infra/.claude/cluster-health.sh
              EOF
              ]

              resources {
                requests = {
                  cpu    = "50m"
                  memory = "64Mi"
                }
                limits = {
                  memory = "128Mi"
                }
              }
            }
          }
        }
      }
    }
  }
}
```

**Step 2: Verify Terraform formatting**

```bash
terraform fmt modules/kubernetes/openclaw/main.tf
```

**Step 3: Verify Terraform plan**

```bash
terraform plan -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config"
```

Expected: Plan shows 4 new resources (ServiceAccount, Role, RoleBinding, CronJobV1). No destructive changes to existing resources.

**Step 4: Commit**

```bash
git add modules/kubernetes/openclaw/main.tf
git commit -m "[ci skip] Add cluster health check CronJob to OpenClaw module"
```

---

### Task 5: Deploy and verify

**Step 1: Apply Terraform**

```bash
terraform apply -target=module.kubernetes_cluster.module.openclaw -var="kube_config_path=$(pwd)/config" -auto-approve
```

**Step 2: Verify CronJob exists**

```bash
kubectl --kubeconfig $(pwd)/config get cronjob -n openclaw
```

Expected: `cluster-healthcheck` with schedule `*/30 * * * *`

**Step 3: Verify RBAC**

```bash
kubectl --kubeconfig $(pwd)/config get serviceaccount,role,rolebinding -n openclaw
```

Expected: `cluster-healthcheck` SA, `healthcheck-pod-exec` role and rolebinding

**Step 4: Trigger a manual run**

```bash
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/cluster-healthcheck healthcheck-manual-test -n openclaw
```

**Step 5: Check job output**

```bash
kubectl --kubeconfig $(pwd)/config wait --for=condition=complete job/healthcheck-manual-test -n openclaw --timeout=120s
kubectl --kubeconfig $(pwd)/config logs job/healthcheck-manual-test -n openclaw
```

Expected: Health check output with results. If `SLACK_WEBHOOK_URL` is set, check Slack for the message.

**Step 6: Clean up test job**

```bash
kubectl --kubeconfig $(pwd)/config delete job healthcheck-manual-test -n openclaw
```

**Step 7: Final commit**

```bash
git add -A modules/kubernetes/openclaw/ .claude/skills/cluster-health/ .claude/cluster-health.sh
git commit -m "[ci skip] OpenClaw cluster health agent: script + skill + CronJob"
```