infra/.claude
OpenClaw f30c62ee5c feat(health-check): Add Prometheus-based CPU and power monitoring
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)

FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure

CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption

Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-17 16:51:02 +00:00
..
agents [ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents 2026-03-06 23:27:46 +00:00
commands [ci skip] update kubectl skill to use local kubeconfig 2026-02-07 13:42:35 +00:00
reference resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
skills [ci skip] claudeception: extract 2 skills from today's session 2026-03-07 15:46:36 +00:00
calendar-query.py add claude [ci skip] 2026-02-06 20:10:02 +00:00
CLAUDE.md [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern 2026-03-08 20:03:50 +00:00
cluster-health.sh feat(health-check): Add Prometheus-based CPU and power monitoring 2026-03-17 16:51:02 +00:00
home-assistant-sofia.py [ci skip] Add ha-sofia Home Assistant deployment to skills 2026-02-07 21:26:05 +00:00
home-assistant.py add claude [ci skip] 2026-02-06 20:10:02 +00:00
internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK add claude [ci skip] 2026-02-06 20:10:02 +00:00
pfsense.py [ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
settings.json add claude files [ci skip] 2026-01-18 15:40:43 +00:00