infra

Author	SHA1	Message	Date
Viktor Barzin	4bedabb9e8	healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) - HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL = https://ha-sofia.viktorbarzin.me. - cert-manager: add cert_manager_installed() probe (kubectl get crd certificates.cert-manager.io). When not installed — which is our current state — report PASS "N/A" instead of noisy WARN "CRDs unavailable". - LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check always WARN'd. Fixed to `grep _snap`. After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).	2026-04-19 22:13:32 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	422dadafe5	[ci skip] replace resource overcommitment check with actual usage Check real CPU/memory usage via kubectl top nodes instead of limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL. Limits overcommit is expected with 70+ services on 3 worker nodes.	2026-03-06 20:28:55 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	d041459ef2	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00
Viktor Barzin	db659b1f7a	[ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure - Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled during webpack build on startup - Rewrite DNS healthcheck to test from inside the Technitium pod via kubectl exec, since MetalLB virtual IPs aren't reachable from outside the L2 network - Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled, not mounted by kured DaemonSet)	2026-02-22 12:46:12 +00:00
Viktor Barzin	00dc78e0d2	[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls	2026-02-22 01:37:28 +00:00
Viktor Barzin	98b711ff8d	[ci skip] Extend cluster healthcheck from 14 to 24 checks Add 10 new checks covering gaps discovered during incident response: ResourceQuota pressure, StatefulSets, node disk usage, Helm release health, Kyverno policy engine, NFS connectivity, DNS resolution, TLS certificate expiry, GPU health, and Cloudflare tunnel status.	2026-02-21 23:57:04 +00:00
Viktor Barzin	038d4434c4	[ci skip] Fix health check false positives for completed CronJob pods	2026-02-21 19:56:39 +00:00
Viktor Barzin	2bae6ccce3	Add Uptime Kuma monitor check to cluster health script [ci skip] Adds check #14 that queries Uptime Kuma API for application-level monitor status, complementing the kubectl-level checks with HTTP/ping health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.	2026-02-15 17:49:40 +00:00
Viktor Barzin	9c4ff21d58	Add cluster health check script with 13 diagnostic sections [ci skip]	2026-02-15 17:34:22 +00:00

16 commits