infra

Author	SHA1	Message	Date
Viktor Barzin	b64d8d6168	cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (query-pci QMP timeouts) that the scheduler's 28-LUN guard can't see — exactly the drift that wedged the MAM grabber on node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks. Single `qm list` + one `qm config` per VM keeps it ~3s (the naive once-per-VM version timed out the parallel runner). Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by 138894cd): `pct=$(kubectl exec … \| tr -d ' ')` and the dur_ms probe were unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and tripped `set -e`, killing the check before json_add. It silently dropped from every parallel report and broke --serial entirely (whole run aborted). Guarded both substitutions with `\|\| true`; the existing `=~` numeric checks already handle the empty case. immich_search now reports PASS/WARN instead of vanishing.	2026-06-05 09:19:10 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	a2fa912b44	cluster-health: add check #45 — HA Sofia Status Dashboard Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.	2026-06-05 09:19:06 +00:00
Viktor Barzin	247afdb220	cluster-health skill: document tightened #43 thermal threshold (65 C)	2026-05-22 14:09:12 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	8b5b389f31	[ci skip] Add cluster-health skill for OpenClaw agent	2026-02-22 00:04:15 +00:00

8 commits