[ci skip] Extend cluster healthcheck from 14 to 24 checks

Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
This commit is contained in:
Viktor Barzin 2026-02-21 23:57:04 +00:00
parent 4700743560
commit 86d1d50ad0
No known key found for this signature in database
GPG key ID: 0EB088298288D958
2 changed files with 490 additions and 4 deletions

View file

@ -179,7 +179,7 @@ kubectl get pods -A
**Cluster Health Check** (`scripts/cluster_healthcheck.sh`):
- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
- Runs 14 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
- **When adding new healthchecks or monitoring**: Always update this script to validate the new component
**Terraform target examples:**