No description
SECTIONS ADDED: - Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics) - Section 26: Power Monitoring (DCGM GPU power + host power) FEATURES: - 5-minute CPU usage averages (more accurate than kubectl top) - Tesla T4 GPU power consumption monitoring - CPU thresholds: 70% warn, 85% critical - GPU power thresholds: 50W active, 65W high - Maps IP addresses to friendly node names - Integrates with existing health check infrastructure CURRENT STATUS: - All nodes have healthy disk usage (~10%) - k8s-node4 flagged at 87% CPU (explains resource pressure) - GPU operating normally at 30.9W - Enhanced monitoring prevents issues like node2 containerd corruption Total health check sections: 26 (was 24) Addresses node2 incident prevention requirements |
||
|---|---|---|
| .claude | ||
| .git-crypt | ||
| .planning | ||
| .woodpecker | ||
| cli | ||
| diagram | ||
| docs/plans | ||
| modules | ||
| playbooks | ||
| scripts | ||
| secrets | ||
| stacks | ||
| .gitattributes | ||
| .gitignore | ||
| .sops.yaml | ||
| AGENTS.md | ||
| config.tfvars | ||
| LICENSE.txt | ||
| MEMORY.md | ||
| README.md | ||
| secrets.sops.json | ||
| terragrunt.hcl | ||
| tiers.tf | ||
This repo contains my infra-as-code sources.
My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.
Read more by visiting my website: https://viktorbarzin.me
git-crypt setup
To decrypt the secrets, you need to setup git-crypt.
- Install git-crypt.
- Setup gpg keys on the machine
git-crypt unlock
This will unlock the secrets and will lock them on commit