304 lines
10 KiB
Markdown
304 lines
10 KiB
Markdown
|
|
---
|
||
|
|
name: cluster-health
|
||
|
|
description: |
|
||
|
|
Check Kubernetes cluster health and fix common issues. Use when:
|
||
|
|
(1) User asks to check the cluster, check health, or "what's wrong",
|
||
|
|
(2) User asks about pod status, node health, or deployment issues,
|
||
|
|
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||
|
|
(4) User mentions "health check", "cluster status", "cluster health",
|
||
|
|
(5) User asks "is everything running" or "any problems".
|
||
|
|
Runs 8 standard K8s health checks with safe auto-fix for evicted pods
|
||
|
|
and stuck CrashLoopBackOff pods.
|
||
|
|
author: Claude Code
|
||
|
|
version: 1.0.0
|
||
|
|
date: 2026-02-21
|
||
|
|
---
|
||
|
|
|
||
|
|
# Cluster Health Check
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
- **Script**: `/workspace/infra/.claude/cluster-health.sh`
|
||
|
|
- **Schedule**: CronJob runs every 30 minutes in the `openclaw` namespace
|
||
|
|
- **Slack notifications**: Posts results to the webhook URL in `$SLACK_WEBHOOK_URL`
|
||
|
|
- **Auto-fix**: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
|
||
|
|
- **Exit code**: 0 = healthy, 1 = issues found
|
||
|
|
|
||
|
|
## Quick Check
|
||
|
|
|
||
|
|
Run the health check interactively:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Report only, no Slack notification
|
||
|
|
bash /workspace/infra/.claude/cluster-health.sh --no-slack
|
||
|
|
|
||
|
|
# Full run with Slack notification
|
||
|
|
bash /workspace/infra/.claude/cluster-health.sh
|
||
|
|
|
||
|
|
# Report only, no auto-fix and no Slack
|
||
|
|
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack
|
||
|
|
```
|
||
|
|
|
||
|
|
## What It Checks
|
||
|
|
|
||
|
|
| # | Check | Auto-Fix | Alerts |
|
||
|
|
|---|-------|----------|--------|
|
||
|
|
| 1 | **Node Health** — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure | No | Yes |
|
||
|
|
| 2 | **Pod Health** — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error | Yes (CrashLoop >10 restarts) | Yes |
|
||
|
|
| 3 | **Evicted/Failed Pods** — Pods in `Failed` phase | Yes (deletes all) | Yes |
|
||
|
|
| 4 | **Failed Deployments** — Deployments with ready != desired replicas | No | Yes |
|
||
|
|
| 5 | **Pending PVCs** — PersistentVolumeClaims not in `Bound` state | No | Yes |
|
||
|
|
| 6 | **Resource Pressure** — Node CPU or memory >80% (warn) or >90% (issue) | No | Yes |
|
||
|
|
| 7 | **CronJob Failures** — Failed CronJob-owned Jobs in the last 24h | No | Yes |
|
||
|
|
| 8 | **DaemonSet Health** — DaemonSets with desired != ready | No | Yes |
|
||
|
|
|
||
|
|
## Safe Auto-Fix Rules
|
||
|
|
|
||
|
|
### Safe to auto-fix (the script does these automatically)
|
||
|
|
|
||
|
|
1. **Evicted/Failed pods** — These are already terminated and just cluttering the namespace:
|
||
|
|
```bash
|
||
|
|
kubectl delete pods -A --field-selector=status.phase=Failed
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **CrashLoopBackOff pods with >10 restarts** — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
|
||
|
|
```bash
|
||
|
|
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
|
||
|
|
```
|
||
|
|
|
||
|
|
### NEVER auto-fix (requires human investigation)
|
||
|
|
|
||
|
|
- **NotReady nodes** — Could be network, kubelet, or hardware issue; needs SSH investigation
|
||
|
|
- **DiskPressure / MemoryPressure / PIDPressure** — Root cause must be identified
|
||
|
|
- **ImagePullBackOff** — Usually a wrong image tag or registry issue; needs config fix
|
||
|
|
- **Failed deployments** — Could be resource limits, bad config, missing secrets
|
||
|
|
- **Pending PVCs** — Usually NFS export missing or storage class issue
|
||
|
|
- **Resource pressure >90%** — Need to identify which pods are consuming resources
|
||
|
|
- **CronJob failures** — Need to check job logs to understand why it failed
|
||
|
|
- **DaemonSet issues** — Could be node taints, resource limits, or image issues
|
||
|
|
|
||
|
|
## Deep Investigation
|
||
|
|
|
||
|
|
When the health check reports issues, use these commands to investigate further.
|
||
|
|
|
||
|
|
### Node Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Describe the problematic node (events, conditions, capacity)
|
||
|
|
kubectl describe node <node-name>
|
||
|
|
|
||
|
|
# Check resource usage across all nodes
|
||
|
|
kubectl top nodes
|
||
|
|
|
||
|
|
# Check recent events on a specific node
|
||
|
|
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
|
||
|
|
|
||
|
|
# SSH to the node for direct inspection
|
||
|
|
ssh root@<node-ip>
|
||
|
|
systemctl status kubelet
|
||
|
|
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
||
|
|
df -h
|
||
|
|
free -h
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pod Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Describe the pod (events, conditions, container statuses)
|
||
|
|
kubectl describe pod -n <namespace> <pod-name>
|
||
|
|
|
||
|
|
# Check current logs
|
||
|
|
kubectl logs -n <namespace> <pod-name> --tail=100
|
||
|
|
|
||
|
|
# Check logs from the previous crashed container
|
||
|
|
kubectl logs -n <namespace> <pod-name> --previous --tail=100
|
||
|
|
|
||
|
|
# Check events in the namespace
|
||
|
|
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||
|
|
|
||
|
|
# Check all pods in a namespace
|
||
|
|
kubectl get pods -n <namespace> -o wide
|
||
|
|
```
|
||
|
|
|
||
|
|
### Deployment Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Describe the deployment (strategy, conditions, events)
|
||
|
|
kubectl describe deployment -n <namespace> <deployment-name>
|
||
|
|
|
||
|
|
# Check rollout status
|
||
|
|
kubectl rollout status deployment -n <namespace> <deployment-name>
|
||
|
|
|
||
|
|
# Check rollout history
|
||
|
|
kubectl rollout history deployment -n <namespace> <deployment-name>
|
||
|
|
|
||
|
|
# Check the replicaset
|
||
|
|
kubectl get rs -n <namespace> -l app=<app-label>
|
||
|
|
```
|
||
|
|
|
||
|
|
### PVC Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Describe the PVC (events, status, storage class)
|
||
|
|
kubectl describe pvc -n <namespace> <pvc-name>
|
||
|
|
|
||
|
|
# Check PVs
|
||
|
|
kubectl get pv
|
||
|
|
|
||
|
|
# Check events related to PVCs
|
||
|
|
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||
|
|
|
||
|
|
# Verify NFS export exists
|
||
|
|
showmount -e 10.0.10.15 | grep <service-name>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Resource Pressure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Top nodes (CPU and memory usage)
|
||
|
|
kubectl top nodes
|
||
|
|
|
||
|
|
# Top pods sorted by memory (cluster-wide)
|
||
|
|
kubectl top pods -A --sort-by=memory | head -20
|
||
|
|
|
||
|
|
# Top pods sorted by CPU (cluster-wide)
|
||
|
|
kubectl top pods -A --sort-by=cpu | head -20
|
||
|
|
|
||
|
|
# Check resource requests/limits in a namespace
|
||
|
|
kubectl describe resourcequota -n <namespace>
|
||
|
|
kubectl describe limitrange -n <namespace>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Common Remediation
|
||
|
|
|
||
|
|
### Persistent CrashLoopBackOff
|
||
|
|
|
||
|
|
A pod keeps crashing even after the auto-fix deletes it.
|
||
|
|
|
||
|
|
1. **Check logs from the crashed container**:
|
||
|
|
```bash
|
||
|
|
kubectl logs -n <namespace> <pod-name> --previous --tail=200
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check the pod description for clues**:
|
||
|
|
```bash
|
||
|
|
kubectl describe pod -n <namespace> <pod-name>
|
||
|
|
```
|
||
|
|
Look for:
|
||
|
|
- `OOMKilled` in Last State — the container ran out of memory
|
||
|
|
- `Error` with exit code 1 — application error (bad config, missing env var, DB connection failure)
|
||
|
|
- `Error` with exit code 137 — killed by OOM killer or liveness probe
|
||
|
|
- `Error` with exit code 143 — SIGTERM (graceful shutdown failure)
|
||
|
|
|
||
|
|
3. **Common causes**:
|
||
|
|
- **OOMKilled**: Increase memory limits in Terraform (see below)
|
||
|
|
- **Bad config**: Check environment variables, secrets, config maps
|
||
|
|
- **DB connection failure**: Verify the database pod is running (`kubectl get pods -n dbaas`)
|
||
|
|
- **NFS mount failure**: Verify NFS export exists (`showmount -e 10.0.10.15`)
|
||
|
|
- **Missing secret**: Check if TLS secret or other secrets exist in the namespace
|
||
|
|
|
||
|
|
### OOMKilled
|
||
|
|
|
||
|
|
The container was killed because it exceeded its memory limit.
|
||
|
|
|
||
|
|
1. **Check current limits**:
|
||
|
|
```bash
|
||
|
|
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Fix in Terraform** — Edit `modules/kubernetes/<service>/main.tf` and increase the memory limit:
|
||
|
|
```hcl
|
||
|
|
resources {
|
||
|
|
limits = {
|
||
|
|
memory = "2Gi" # Increase from current value
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Apply the change**:
|
||
|
|
```bash
|
||
|
|
cd /workspace/infra
|
||
|
|
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve
|
||
|
|
```
|
||
|
|
|
||
|
|
### ImagePullBackOff
|
||
|
|
|
||
|
|
The container image cannot be pulled.
|
||
|
|
|
||
|
|
1. **Check the exact error**:
|
||
|
|
```bash
|
||
|
|
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Common causes**:
|
||
|
|
- **Wrong image tag**: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
|
||
|
|
- **Private registry without credentials**: Check if imagePullSecrets are configured
|
||
|
|
- **Pull-through cache issue**: The registry cache at `10.0.20.10` may have a stale entry
|
||
|
|
```bash
|
||
|
|
# Check pull-through cache ports:
|
||
|
|
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
|
||
|
|
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
|
||
|
|
```
|
||
|
|
- **Registry rate limit**: Docker Hub free tier has pull limits; pull-through cache helps avoid this
|
||
|
|
|
||
|
|
3. **Fix**: Update the image tag in the service's Terraform module and re-apply.
|
||
|
|
|
||
|
|
### Node NotReady
|
||
|
|
|
||
|
|
A node has gone NotReady.
|
||
|
|
|
||
|
|
1. **Check node conditions**:
|
||
|
|
```bash
|
||
|
|
kubectl describe node <node-name> | grep -A 20 "Conditions"
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **SSH to the node and check kubelet**:
|
||
|
|
```bash
|
||
|
|
ssh root@<node-ip>
|
||
|
|
systemctl status kubelet
|
||
|
|
journalctl -u kubelet --since "10 minutes ago" | tail -50
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check resources**:
|
||
|
|
```bash
|
||
|
|
# On the node
|
||
|
|
df -h # Disk space
|
||
|
|
free -h # Memory
|
||
|
|
top -bn1 # CPU/processes
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Node IPs** (for SSH):
|
||
|
|
- `10.0.20.100` — k8s-master
|
||
|
|
- `10.0.20.101` — k8s-node1 (GPU)
|
||
|
|
- `10.0.20.102` — k8s-node2
|
||
|
|
- `10.0.20.103` — k8s-node3
|
||
|
|
- `10.0.20.104` — k8s-node4
|
||
|
|
|
||
|
|
## Slack Webhook
|
||
|
|
|
||
|
|
The script posts results to the Slack incoming webhook URL in `$SLACK_WEBHOOK_URL`. The message format uses Slack mrkdwn:
|
||
|
|
- All clear: green checkmark with node/pod count
|
||
|
|
- Warnings only: warning icon with details
|
||
|
|
- Issues found: red alert icon with auto-fixes applied and remaining issues
|
||
|
|
|
||
|
|
The webhook URL is passed as an environment variable from `openclaw_skill_secrets` in `terraform.tfvars`.
|
||
|
|
|
||
|
|
## Infrastructure
|
||
|
|
|
||
|
|
| Component | Path / Location |
|
||
|
|
|-----------|----------------|
|
||
|
|
| Health check script | `/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo) |
|
||
|
|
| Terraform module | `modules/kubernetes/openclaw/main.tf` |
|
||
|
|
| CronJob definition | Defined in the OpenClaw Terraform module |
|
||
|
|
| Existing full healthcheck | `scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output) |
|
||
|
|
| Infra repo (in pod) | `/workspace/infra` |
|
||
|
|
| kubectl (in pod) | `/tools/kubectl` |
|
||
|
|
| terraform (in pod) | `/tools/terraform` |
|
||
|
|
|
||
|
|
## Notes
|
||
|
|
|
||
|
|
1. This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
|
||
|
|
2. The full `scripts/cluster_healthcheck.sh` script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
|
||
|
|
3. When investigating issues interactively, prefer running commands directly rather than re-running the script
|
||
|
|
4. All Terraform changes must go through the `.tf` files — never use `kubectl apply/edit/patch` for persistent changes
|