Viktor Barzin 7bb9ec2934 Add agent task tracking documentation

Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-15 17:11:26 +00:00

12 KiB

Raw Blame History

name	description	author	version	date
cluster-health	Check Kubernetes cluster health and fix common issues. Use when: (1) User asks to check the cluster, check health, or "what's wrong", (2) User asks about pod status, node health, or deployment issues, (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff, (4) User mentions "health check", "cluster status", "cluster health", (5) User asks "is everything running" or "any problems". Runs 8 standard K8s health checks with safe auto-fix for evicted pods and stuck CrashLoopBackOff pods.	Claude Code	1.0.0	2026-02-21

Cluster Health Check

Overview

Script: /workspace/infra/.claude/cluster-health.sh
Schedule: CronJob runs every 30 minutes in the openclaw namespace
Slack notifications: Posts results to the webhook URL in $SLACK_WEBHOOK_URL
Auto-fix: Automatically deletes evicted/failed pods and CrashLoopBackOff pods with >10 restarts
Exit code: 0 = healthy, 1 = issues found

Quick Check

Run the health check interactively:

# Report only, no Slack notification
bash /workspace/infra/.claude/cluster-health.sh --no-slack

# Full run with Slack notification
bash /workspace/infra/.claude/cluster-health.sh

# Report only, no auto-fix and no Slack
bash /workspace/infra/.claude/cluster-health.sh --no-fix --no-slack

What It Checks

#	Check	Auto-Fix	Alerts
1	Node Health — NotReady nodes, MemoryPressure, DiskPressure, PIDPressure	No	Yes
2	Pod Health — CrashLoopBackOff, ImagePullBackOff, ErrImagePull, Error	Yes (CrashLoop >10 restarts)	Yes
3	Evicted/Failed Pods — Pods in `Failed` phase	Yes (deletes all)	Yes
4	Failed Deployments — Deployments with ready != desired replicas	No	Yes
5	Pending PVCs — PersistentVolumeClaims not in `Bound` state	No	Yes
6	Resource Pressure — Node CPU or memory >80% (warn) or >90% (issue)	No	Yes
7	CronJob Failures — Failed CronJob-owned Jobs in the last 24h	No	Yes
8	DaemonSet Health — DaemonSets with desired != ready	No	Yes

Safe Auto-Fix Rules

Safe to auto-fix (the script does these automatically)

Evicted/Failed pods — These are already terminated and just cluttering the namespace:
```
kubectl delete pods -A --field-selector=status.phase=Failed
```
CrashLoopBackOff pods with >10 restarts — The pod is stuck in a crash loop; deleting lets the controller recreate it with a fresh backoff timer:
```
kubectl delete pod -n <namespace> <pod-name> --grace-period=0
```

NEVER auto-fix (requires human investigation)

NotReady nodes — Could be network, kubelet, or hardware issue; needs SSH investigation
DiskPressure / MemoryPressure / PIDPressure — Root cause must be identified
ImagePullBackOff — Usually a wrong image tag or registry issue; needs config fix
Failed deployments — Could be resource limits, bad config, missing secrets
Pending PVCs — Usually NFS export missing or storage class issue
Resource pressure >90% — Need to identify which pods are consuming resources
CronJob failures — Need to check job logs to understand why it failed
DaemonSet issues — Could be node taints, resource limits, or image issues

Deep Investigation

When the health check reports issues, use these commands to investigate further.

Node Issues

# Describe the problematic node (events, conditions, capacity)
kubectl describe node <node-name>

# Check resource usage across all nodes
kubectl top nodes

# Check recent events on a specific node
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'

# SSH to the node for direct inspection
ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "30 minutes ago" | tail -100
df -h
free -h

Pod Issues

# Describe the pod (events, conditions, container statuses)
kubectl describe pod -n <namespace> <pod-name>

# Check current logs
kubectl logs -n <namespace> <pod-name> --tail=100

# Check logs from the previous crashed container
kubectl logs -n <namespace> <pod-name> --previous --tail=100

# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Check all pods in a namespace
kubectl get pods -n <namespace> -o wide

Deployment Issues

# Describe the deployment (strategy, conditions, events)
kubectl describe deployment -n <namespace> <deployment-name>

# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>

# Check rollout history
kubectl rollout history deployment -n <namespace> <deployment-name>

# Check the replicaset
kubectl get rs -n <namespace> -l app=<app-label>

PVC Issues

# Describe the PVC (events, status, storage class)
kubectl describe pvc -n <namespace> <pvc-name>

# Check PVs
kubectl get pv

# Check events related to PVCs
kubectl get events -n <namespace> --field-selector reason=FailedMount --sort-by='.lastTimestamp'

# Verify NFS export exists
showmount -e 10.0.10.15 | grep <service-name>

Resource Pressure

# Top nodes (CPU and memory usage)
kubectl top nodes

# Top pods sorted by memory (cluster-wide)
kubectl top pods -A --sort-by=memory | head -20

# Top pods sorted by CPU (cluster-wide)
kubectl top pods -A --sort-by=cpu | head -20

# Check resource requests/limits in a namespace
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>

Common Remediation

Persistent CrashLoopBackOff

A pod keeps crashing even after the auto-fix deletes it.

Check logs from the crashed container:

kubectl logs -n <namespace> <pod-name> --previous --tail=200

Check the pod description for clues:
```
kubectl describe pod -n <namespace> <pod-name>
```
Look for:
- OOMKilled in Last State — the container ran out of memory
- Error with exit code 1 — application error (bad config, missing env var, DB connection failure)
- Error with exit code 137 — killed by OOM killer or liveness probe
- Error with exit code 143 — SIGTERM (graceful shutdown failure)
Common causes:
- OOMKilled: Increase memory limits in Terraform (see below)
- Bad config: Check environment variables, secrets, config maps
- DB connection failure: Verify the database pod is running (kubectl get pods -n dbaas)
- NFS mount failure: Verify NFS export exists (showmount -e 10.0.10.15)
- Missing secret: Check if TLS secret or other secrets exist in the namespace

OOMKilled

The container was killed because it exceeded its memory limit.

Check current limits:

kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Limits"

Fix in Terraform — Edit modules/kubernetes/<service>/main.tf and increase the memory limit:

resources {
  limits = {
    memory = "2Gi"  # Increase from current value
  }
}

Apply the change:

cd /workspace/infra
terraform apply -target=module.kubernetes_cluster.module.<service> -auto-approve

ImagePullBackOff

The container image cannot be pulled.

Check the exact error:

kubectl describe pod -n <namespace> <pod-name> | grep -A 5 "Events"

Common causes:
- Wrong image tag: Verify the tag exists on the registry (Docker Hub, ghcr.io, etc.)
- Private registry without credentials: Check if imagePullSecrets are configured
- Pull-through cache issue: The registry cache at 10.0.20.10 may have a stale entry
```
# Check pull-through cache ports:
# 5000 = docker.io, 5010 = ghcr.io, 5020 = quay.io, 5030 = registry.k8s.io
curl -s http://10.0.20.10:5000/v2/_catalog | python3 -m json.tool
```
- Registry rate limit: Docker Hub free tier has pull limits; pull-through cache helps avoid this
Fix: Update the image tag in the service's Terraform module and re-apply.

Node NotReady

A node has gone NotReady.

Check node conditions:

kubectl describe node <node-name> | grep -A 20 "Conditions"

SSH to the node and check kubelet:

ssh root@<node-ip>
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" | tail -50

Check resources:

# On the node
df -h          # Disk space
free -h        # Memory
top -bn1       # CPU/processes

Node IPs (for SSH):
- 10.0.20.100 — k8s-master
- 10.0.20.101 — k8s-node1 (GPU)
- 10.0.20.102 — k8s-node2
- 10.0.20.103 — k8s-node3
- 10.0.20.104 — k8s-node4

Slack Webhook

The script posts results to the Slack incoming webhook URL in $SLACK_WEBHOOK_URL. The message format uses Slack mrkdwn:

All clear: green checkmark with node/pod count
Warnings only: warning icon with details
Issues found: red alert icon with auto-fixes applied and remaining issues

The webhook URL is passed as an environment variable from openclaw_skill_secrets in terraform.tfvars.

Infrastructure

Component	Path / Location
Health check script	`/workspace/infra/.claude/cluster-health.sh` (in-pod) or `.claude/cluster-health.sh` (repo)
Terraform module	`modules/kubernetes/openclaw/main.tf`
CronJob definition	Defined in the OpenClaw Terraform module
Existing full healthcheck	`scripts/cluster_healthcheck.sh` (local-only, 24 checks with color output)
Infra repo (in pod)	`/workspace/infra`
kubectl (in pod)	`/tools/kubectl`
terraform (in pod)	`/tools/terraform`

Auto-File Incidents for SEV1/SEV2

After running health checks, if SEV1 or SEV2 issues are found (node down, multiple services affected, core service outage, or single important service down), auto-file a GitHub Issue:

Severity Classification

SEV1: Node NotReady, multiple services down, data at risk, core service outage (DNS, auth, ingress, databases)
SEV2: Single non-core service down, degraded performance, persistent CrashLoopBackOff
SEV3: Warnings only, resource pressure <90%, cosmetic — do NOT auto-file

Workflow

Dedup check: Before filing, query open incidents:

GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
curl -s -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues?labels=incident&state=open&per_page=50"

If an open issue already covers the same service/namespace, skip filing.

File the issue with labels incident, sev1 or sev2, postmortem-required:
- Title: [AUTO] <Service/Namespace> — <brief symptom>
- Body: full diagnostic dump (pod status, events, alerts, node state)
- The issue-automation GHA workflow will trigger the post-mortem pipeline automatically

Auto-close recovered services: If a service that previously had an auto-filed incident is now healthy:

# Comment and close
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
  -d '{"body": "**Resolved** — Service recovered. Auto-closed by cluster health check."}'
curl -s -X PATCH -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/ViktorBarzin/infra/issues/<N>" \
  -d '{"state": "closed"}'

Post-Mortem Auto-Suggest

After running a healthcheck, if the cluster has recovered from an unhealthy state (previous run showed FAIL items that are now resolved), suggest writing a post-mortem:

The cluster has recovered from the previous unhealthy state. Would you like me to write a post-mortem? Run /post-mortem to generate one.

This ensures incidents are documented while context is fresh.

Notes

This script is designed to run inside the OpenClaw pod where kubectl is pre-configured via the ServiceAccount
The full scripts/cluster_healthcheck.sh script runs 24 checks and is meant for local interactive use; this skill's script runs 8 core checks optimized for automated CronJob execution
When investigating issues interactively, prefer running commands directly rather than re-running the script
All Terraform changes must go through the .tf files — never use kubectl apply/edit/patch for persistent changes

12 KiB Raw Blame History