Viktor Barzin
24a23709a5
fix: update healthcheck to report internal and external monitors separately
...
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:44:20 +00:00
Viktor Barzin
38d51ab0af
deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
...
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
916aa6c6cb
update claude knowledge: OpenClaw deployment and tg wrapper learnings [ci skip]
2026-03-14 23:42:17 +00:00
OpenClaw
4a9bd89b11
feat(health-check): Add Prometheus-based CPU and power monitoring
...
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)
FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure
CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption
Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-13 07:32:36 +00:00
OpenClaw
a09967e098
feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
...
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring
INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures
Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies
[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-13 07:16:56 +00:00
Viktor Barzin
bef0c073d5
fix DNS health check: use system resolver instead of hardcoded 10.0.20.101
...
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 09:08:34 +00:00
Viktor Barzin
0638e2cc2e
[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
...
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
87ef313888
[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
...
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
00dc78e0d2
[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls
2026-02-22 01:37:28 +00:00
Viktor Barzin
a5f9c1595f
[ci skip] Fix Python f-string quoting in Slack formatter
2026-02-22 01:23:56 +00:00
Viktor Barzin
b8029351fd
[ci skip] Improve Slack message formatting with human-readable names and structured blocks
2026-02-22 01:16:47 +00:00
Viktor Barzin
a90710950b
[ci skip] Upgrade cluster-health.sh to full 24-check version with Slack
2026-02-22 01:09:02 +00:00
Viktor Barzin
284fee15ca
[ci skip] Fix Slack message formatting to use Block Kit mrkdwn
2026-02-22 01:00:48 +00:00
Viktor Barzin
9233276f62
[ci skip] Add cluster health check script for OpenClaw agent
2026-02-22 00:00:47 +00:00