[cluster-health] Expand to 42 checks, remove pod CronJob path
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
5ea079181f
commit
a0d770d9a7
7 changed files with 853 additions and 2105 deletions
|
|
@ -83,7 +83,7 @@ For secrets requiring admin access (shared infra passwords, API keys):
|
|||
| \`modules/kubernetes/nfs_volume/\` | NFS volume module (CSI-backed, soft mount) |
|
||||
| \`config.tfvars\` | Non-secret configuration (plaintext) |
|
||||
| \`secrets.sops.json\` | All secrets (SOPS-encrypted JSON) |
|
||||
| \`scripts/cluster_healthcheck.sh\` | 25-check cluster health script |
|
||||
| \`scripts/cluster_healthcheck.sh\` | 42-check cluster health script |
|
||||
| \`AGENTS.md\` | Full AI agent instructions (auto-loaded by most agents) |
|
||||
|
||||
### Tier System
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue