[cluster-health] Expand to 42 checks, remove pod CronJob path

- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
  readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
  monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
  +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
  to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
  lines) and the openclaw cluster_healthcheck CronJob (local CLI is
  now the single authoritative runner). Keep the healthcheck SA +
  Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
  the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
  the script first, refreshes the 42-check table, drops stale
  CronJob/Slack/post-mortem sections, documents the monorepo-canonical
  + hardlink layout. File is hardlinked to
  /home/wizard/code/.claude/skills/cluster-health/SKILL.md for
  dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 15:13:03 +00:00
parent 5ea079181f
commit a0d770d9a7
7 changed files with 853 additions and 2105 deletions

View file

@ -83,7 +83,7 @@ For secrets requiring admin access (shared infra passwords, API keys):
| \`modules/kubernetes/nfs_volume/\` | NFS volume module (CSI-backed, soft mount) |
| \`config.tfvars\` | Non-secret configuration (plaintext) |
| \`secrets.sops.json\` | All secrets (SOPS-encrypted JSON) |
| \`scripts/cluster_healthcheck.sh\` | 25-check cluster health script |
| \`scripts/cluster_healthcheck.sh\` | 42-check cluster health script |
| \`AGENTS.md\` | Full AI agent instructions (auto-loaded by most agents) |
### Tier System