SECTIONS ADDED: - Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics) - Section 26: Power Monitoring (DCGM GPU power + host power) FEATURES: - 5-minute CPU usage averages (more accurate than kubectl top) - Tesla T4 GPU power consumption monitoring - CPU thresholds: 70% warn, 85% critical - GPU power thresholds: 50W active, 65W high - Maps IP addresses to friendly node names - Integrates with existing health check infrastructure CURRENT STATUS: - All nodes have healthy disk usage (~10%) - k8s-node4 flagged at 87% CPU (explains resource pressure) - GPU operating normally at 30.9W - Enhanced monitoring prevents issues like node2 containerd corruption Total health check sections: 26 (was 24) Addresses node2 incident prevention requirements |
||
|---|---|---|
| .. | ||
| agents | ||
| commands | ||
| reference | ||
| skills | ||
| calendar-query.py | ||
| CLAUDE.md | ||
| cluster-health.sh | ||
| home-assistant-sofia.py | ||
| home-assistant.py | ||
| internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK | ||
| pfsense.py | ||
| settings.json | ||