infra

Author	SHA1	Message	Date
Viktor Barzin	ae252f9116	cluster-health: ha_integrations — skip disabled + ignored config entries check_ha_integrations counted any config entry with state=not_loaded as a problem, but HA marks intentionally-off entries that way too: disabled_by set (user/integration disabled it) and source=="ignore" (a discovered integration the user chose to ignore — never meant to load). On ha-sofia 2026-06-04 this false-WARNed on 6 entries that are all intentional — wyoming faster-whisper/piper + ollama (disabled_by=user) and mass_queue/dlna_dms(EMO-LAPTOP2)/yalexs_ble (source=ignore). Skip disabled/ignored entries; only genuine setup_error/setup_retry/ not_loaded (without disabled/ignore) now flag. Verified: check #27 -> PASS "All 96 integrations loaded".	2026-06-05 09:19:11 +00:00
Viktor Barzin	31b8104b43	cluster-health: uptime_kuma check — only count status==0 as down check_uptime_kuma flagged a monitor as down whenever its last heartbeat status != 1, and treated "no beats" as down too. But uptime-kuma status 2 = PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no data. So a monitor caught in a momentary pending/retry state at check time produced a false "internal/external down(N)" WARN — observed twice on 2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged ZERO downs against over 24h (0/2880 and 0/288 beats). Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending, maintenance, and no-data are not-down. Real outages still flag (uptime-kuma persists status==0 beats for genuine downs).	2026-06-05 09:19:10 +00:00
Viktor Barzin	b64d8d6168	cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (query-pci QMP timeouts) that the scheduler's 28-LUN guard can't see — exactly the drift that wedged the MAM grabber on node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks. Single `qm list` + one `qm config` per VM keeps it ~3s (the naive once-per-VM version timed out the parallel runner). Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by 138894cd): `pct=$(kubectl exec … \| tr -d ' ')` and the dur_ms probe were unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and tripped `set -e`, killing the check before json_add. It silently dropped from every parallel report and broke --serial entirely (whole run aborted). Guarded both substitutions with `\|\| true`; the existing `=~` numeric checks already handle the empty case. immich_search now reports PASS/WARN instead of vanishing.	2026-06-05 09:19:10 +00:00
Viktor Barzin	23d87d8885	cluster-health #20 : fix false NFS FAIL on Linux (nc -G is macOS-only) The NFS connectivity check fell through to `nc -z -G 3 192.168.1.127 2049` when `showmount` is absent (the DevVM ships no nfs-common). But `-G` is a macOS/Darwin-only connect-timeout flag — OpenBSD/GNU nc on Linux rejects it with "invalid option -- 'G'", so the elif failed and the check reported "NFS unreachable" on every Linux run even though port 2049 was wide open (confirmed via /dev/tcp). All deployment/PVC/statefulset checks were green throughout — a real PVE NFS outage would have taken down 30+ services. Fix: use the portable `-w` timeout flag, and add a final bash /dev/tcp fallback so the probe is correct even on hosts with neither showmount nor a usable nc.	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	a2fa912b44	cluster-health: add check #45 — HA Sofia Status Dashboard Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.	2026-06-05 09:19:06 +00:00
Viktor Barzin	f677794379	cluster_healthcheck.sh: run checks in parallel (~3x speedup) Each check function only reads cluster state and mutates in-memory counters; that makes it safe to isolate each one in a subshell, write stdout to a per-check temp file, and replay outputs in original order after all jobs finish. Counters/JSON_RESULTS replicated through marker lines (###HCK###PASS:N etc.) so the aggregate state matches the serial run exactly. Pre-fetch the HA Sofia cache once in the parent so the four HA checks share a single API round-trip instead of each subshell re-fetching. Auto-fix mode forces --serial so mutation order stays deterministic. New flags: --parallel N (default 12, env HEALTHCHECK_PARALLEL_JOBS), --serial. Diminishing returns past ~12 workers. Benchmark (--quiet, 44 checks): 53s serial -> 18s parallel-12.	2026-05-27 19:46:40 +00:00
Viktor Barzin	4830230984	cluster-health #43 : tighten PVE thermal threshold to 65 C Per Viktor: healthy baseline range is 55-65 C; anything above 65 C is a signal a VM/workload is using too much CPU and warrants investigation. Previous thresholds were calibrated to the hardware's TjMax (75/83 C) — that was too lax, since cluster-load-driven elevation arrives a long time before throttling. The 65 C cutoff matches the live Prometheus baseline (Apr 20-May 8 2026: peak 61-69 C, avg 51-55 C) and the session-observed correlation: above 65 C means the cluster is doing sustained work that should be looked at, even if hardware is still nowhere near its limit. Updated: PASS < 65 C (within 55-65 baseline) WARN 65-82 C (elevated; check top kvm processes for the culprit) FAIL >= 83 C (at/above TjMax — throttling imminent) Verified live: 67 C now WARN (was PASS under the 75 C threshold).	2026-05-22 14:09:08 +00:00
Viktor Barzin	8228171104	cluster-health: add checks 43 + 44 (PVE host thermals + load) Both new checks SSH read-only to the PVE host and emit PASS/WARN/FAIL via the standard healthcheck output + JSON. They run alongside the existing 42 checks and surface the same alerts the 2026-05-20/21 optimization session had to gather by hand. #43 PVE Host Thermals — Xeon E5-2699v4 package + per-core temps Reads every /sys/class/hwmon/hwmon0/temp*_input in one SSH round-trip. Thresholds tuned to the live TjMax=83 / Tcrit=93: PASS < 75 °C package WARN 75-82 °C (approaching max, action time) FAIL >= 83 °C (at/above TjMax, throttling imminent) Reports hottest core label too so a single hot core doesn't hide in the package average. #44 PVE Host Load — load avg vs 44-thread capacity Reads /proc/loadavg, compares 5-min to thread count (44): PASS load_5 < 30 (< 70% threads busy) WARN 30-37 (oversubscribed but not saturating) FAIL >= 38 (~85%+ threads busy — scheduler saturation) Uses 5-min so brief work spikes don't false-fail. Both gracefully WARN-degrade if SSH BatchMode fails, matching the existing check 36 (LVM PVC snapshots) pattern. TOTAL_CHECKS bumped 42 -> 44 and the dispatcher updated.	2026-05-22 09:55:11 +00:00
Viktor Barzin	545cd2b854	healthcheck: probe uptime-kuma via internal Service (port-forward), not public URL The Uptime Kuma check was hitting https://uptime.viktorbarzin.me, which sits behind Authentik forward-auth. Authentik 302-redirects the Socket.IO handshake the uptime-kuma-api library uses, and the library can't complete the OAuth flow, so every healthcheck reported "Connection failed" even though the pod was healthy and serving 225 monitors. Fix: open a transient `kubectl port-forward` to svc/uptime-kuma in the uptime-kuma namespace for the duration of the check, connect the library to http://127.0.0.1:<port> (no auth gate), then SIGKILL the port-forward on the way out. The disown is to suppress bash's "Killed" job notification on stderr, which corrupted stdout when stderr was merged for JSON parsing. Verified end-to-end: healthcheck now reports the real signal — "external down(3): www, xray-vless, hermes-agent" — the same 3 Cloudflare-facing endpoints flagging in the uptime-kuma logs.	2026-05-11 20:02:57 +00:00
Viktor Barzin	2f0e8c88a9	healthcheck: tune noise filters + nvidia-exporter auth=none Six tuning changes to cluster_healthcheck.sh so PASS sections actually reflect "nothing to act on": 1. prometheus_alerts: only count severity=warning\|critical. Info-level alerts (RecentNodeReboot soak, PVAutoExpanding) are by design — the alert rule itself sets severity; the script should respect it. 2. tls_certs: lower WARN threshold 30d → 14d. cnpg-webhook-cert auto-rotates at 7d before expiry, kyverno tls pairs at 15d, the Lets Encrypt wildcard renews weekly; <14d is the only window where human attention is genuinely useful. 3. ha_entities: skip mobile_app/device_tracker/notify/button/scene/ event/image/update domains (transient by design), skip friendly names containing iphone/ipad/macbook/tv/bravia/laptop/etc., and only count entities whose last_changed > 24h. Was 431/1470, most of which were "phone in standby" noise. 4. ha_automations: only flag DISABLED automations as abandoned if they've also been untouched (last_changed) for >180 days; raise stale threshold 30d → 180d. Was flagging seasonal/holiday-only automations as broken. 5. problematic_pods + evicted_pods: exclude pods owned by Jobs. CronJob retry leftovers (Error/Failed phase pods that K8s keeps around for log inspection) aren't problematic at the cluster level. 6. uptime_kuma: retry the WebSocket login 3x with backoff. Single- shot failures were a recurring false-positive even though the service was healthy. Also: nvidia-exporter ingress auth=required → auth=none. HA Sofia's nvidia REST sensors (Tesla_T4_GPU_Temperature, Power_Usage, etc.) poll /metrics and got 302'd to Authentik like the idrac/snmp ones did. Same fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 22:27:39 +00:00
Viktor Barzin	64c71615e8	scripts: cluster_healthcheck defaults to ~/.kube/config The previous default of $(pwd)/config required running the script from the infra/ directory or always passing --kubeconfig. From a parent shell or any other working directory, the lookup hit a non-existent file and kubectl returned a stale-token error, masking real check results. Now: use $KUBECONFIG if set, then ~/.kube/config, then fall back to $(pwd)/config for backwards compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:12:40 +00:00
Viktor Barzin	4bedabb9e8	healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) - HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL = https://ha-sofia.viktorbarzin.me. - cert-manager: add cert_manager_installed() probe (kubectl get crd certificates.cert-manager.io). When not installed — which is our current state — report PASS "N/A" instead of noisy WARN "CRDs unavailable". - LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check always WARN'd. Fixed to `grep _snap`. After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).	2026-04-19 22:13:32 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	422dadafe5	[ci skip] replace resource overcommitment check with actual usage Check real CPU/memory usage via kubectl top nodes instead of limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL. Limits overcommit is expected with 70+ services on 3 worker nodes.	2026-03-06 20:28:55 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	d041459ef2	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00
Viktor Barzin	db659b1f7a	[ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure - Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled during webpack build on startup - Rewrite DNS healthcheck to test from inside the Technitium pod via kubectl exec, since MetalLB virtual IPs aren't reachable from outside the L2 network - Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled, not mounted by kured DaemonSet)	2026-02-22 12:46:12 +00:00
Viktor Barzin	00dc78e0d2	[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls	2026-02-22 01:37:28 +00:00
Viktor Barzin	98b711ff8d	[ci skip] Extend cluster healthcheck from 14 to 24 checks Add 10 new checks covering gaps discovered during incident response: ResourceQuota pressure, StatefulSets, node disk usage, Helm release health, Kyverno policy engine, NFS connectivity, DNS resolution, TLS certificate expiry, GPU health, and Cloudflare tunnel status.	2026-02-21 23:57:04 +00:00
Viktor Barzin	038d4434c4	[ci skip] Fix health check false positives for completed CronJob pods	2026-02-21 19:56:39 +00:00
Viktor Barzin	2bae6ccce3	Add Uptime Kuma monitor check to cluster health script [ci skip] Adds check #14 that queries Uptime Kuma API for application-level monitor status, complementing the kubectl-level checks with HTTP/ping health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.	2026-02-15 17:49:40 +00:00
Viktor Barzin	9c4ff21d58	Add cluster health check script with 13 diagnostic sections [ci skip]	2026-02-15 17:34:22 +00:00

28 commits