[ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response: ResourceQuota pressure, StatefulSets, node disk usage, Helm release health, Kyverno policy engine, NFS connectivity, DNS resolution, TLS certificate expiry, GPU health, and Cloudflare tunnel status.
This commit is contained in:
parent
4700743560
commit
86d1d50ad0
2 changed files with 490 additions and 4 deletions
|
|
@ -179,7 +179,7 @@ kubectl get pods -A
|
|||
|
||||
**Cluster Health Check** (`scripts/cluster_healthcheck.sh`):
|
||||
- **ALWAYS use this script** to check cluster health — whether the user asks explicitly, after deploying/updating services, or whenever you need to verify cluster state. Never use ad-hoc kubectl commands to assess overall cluster health; use the script instead.
|
||||
- Runs 14 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma
|
||||
- Runs 24 checks: nodes, resources, conditions, pods, evicted, DaemonSets, deployments, PVCs, HPAs, CronJobs, CrowdSec, ingress, Prometheus alerts, Uptime Kuma, ResourceQuota pressure, StatefulSets, node disk, Helm releases, Kyverno, NFS, DNS, TLS certs, GPU, Cloudflare tunnel
|
||||
- **When adding new healthchecks or monitoring**: Always update this script to validate the new component
|
||||
|
||||
**Terraform target examples:**
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue