infra

Author	SHA1	Message	Date
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	82b7866bc9	[claude-agent-service] Remove orphaned DevVM SSH key wiring ## Context The remote-executor pattern that SSHed into the DevVM (10.0.10.10) to run `claude -p` was fully migrated to the in-cluster service `claude-agent-service.claude-agent.svc:8080/execute` in commits `42f1c3cf` and `99180bec` (2026-04-18). Five parallel codebase audits (GH Actions, Woodpecker + scripts, K8s CronJobs/Deployments, n8n, local scripts/hooks/docs) confirmed zero remaining SSH+claude sites. This commit removes two cleanup artifacts left behind by that migration. ## This change 1. Deletes `.claude/skills/archived/setup-remote-executor.md` — the archived skill doc for the obsolete SSH-based pattern. Already in `archived/`, harmless but noise; deleting prevents anyone copy-pasting the old approach. 2. Removes `kubernetes_secret.ssh_key` from `stacks/claude-agent-service/main.tf`. The Secret was created from the `devvm_ssh_key` field at Vault `secret/ci/infra` but was never mounted into the agent pod. The pod's `git-init` init container uses HTTPS + `$GITHUB_TOKEN` exclusively and force-rewrites every `git@github.com:` and `https://github.com/` URL via `git config url.insteadOf`, so no downstream `git` invocation could fall through to SSH even if it tried. 3. Removes the now-orphaned `data "vault_kv_secret_v2" "ci_secrets"` block — the SSH key resource was its only consumer. ## What is NOT in this change - The `devvm_ssh_key` field at Vault `secret/ci/infra` stays in place. Removing it requires read/modify/put of the full secret and the upside is one unused Vault key. Not worth it without strong justification. - DevVM host decommission is out of scope (separate audit needed for non-Claude users of the host). - Pre-existing `terraform fmt` warnings at lines 464-505 (CronJob alignment) left untouched per no-adjacent-refactor rule. ## Test plan ### Automated - `terraform fmt -check stacks/claude-agent-service/main.tf` — only the pre-existing lines 464-505 are flagged; no new fmt warnings introduced by these deletions. ### Manual verification 1. `cd infra/stacks/claude-agent-service && ../../scripts/tg apply` 2. Expect exactly one resource destroyed: `kubernetes_secret.ssh_key`. The `ci_secrets` data source removal is plan-time only; does not appear in resource counts. 3. `kubectl -n claude-agent get secret ssh-key` → `NotFound`. 4. `kubectl -n claude-agent get pod` → both pods Running, no restart events. 5. Submit a synthetic agent job via HTTP API to confirm pipeline still works: curl -X POST http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute with a minimal prompt; expect job completes with `exit_code=0`. Closes: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:31:15 +00:00
Viktor Barzin	50e8184d99	[uptime-kuma] Codify MySQL monitor (id=663) via idempotent sync CronJob ## Context Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via the `uptime-kuma-api` Python library when the dbaas stack migrated from InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older backup, the monitor would be lost. ## This change Adds declarative, self-healing management for internal-service monitors (databases, non-HTTP endpoints) that can't be discovered from ingress annotations. Modelled on the existing `external-monitor-sync` CronJob. - `local.internal_monitors` — list of desired monitors (name, type, connection string, Vault password key, interval, retries). Seeded with the MySQL Standalone monitor. Add new entries here to manage more. - `kubernetes_secret.internal_monitor_sync` — pulls admin password and all referenced DB passwords from Vault `secret/viktor` at apply time. Secret key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`). - `kubernetes_config_map_v1.internal_monitor_targets` — renders the target list to JSON for the sync container. - `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min, looks up monitors by name, creates if missing, patches if drifted, leaves id and history untouched when already in desired state. ## Why this approach (Option B, not a Terraform provider) The `louislam/uptime-kuma` Terraform provider does NOT exist in the public registry (verified — only a CLI tool of the same name). Option A from the task brief was therefore unavailable. Option B (idempotent K8s CronJob) matches the established pattern in the same module for `external-monitor-sync` — no new machinery introduced. ## Monitor 663: no-op on first sync Manual import was not possible (no provider → no state to import). The sync job correctly identifies the existing monitor by name and reports: Monitor MySQL Standalone (dbaas) (id=663) already in desired state Internal monitor sync complete DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and `Rows: 1` responses every 60s — no disruption. ## Vault key — left manual (by design) `secret/viktor` is not Terraform-managed anywhere in the repo (only read via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding 135 keys. The `uptimekuma_db_password` key was added manually yesterday; this change does NOT codify it. Codifying the whole `secret/viktor` entry is out of scope for this task (would need a separate migration + rotation story). The sync job reads the existing value at apply time — so if the value is ever rotated in Vault, the next sync picks it up. ## Plan + apply Plan: 3 to add, 0 to change, 0 to destroy. Apply complete! Resources: 3 added, 0 changed, 0 destroyed. Re-plan: No changes. Your infrastructure matches the configuration. Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern. Closes: code-ed2	2026-04-18 12:04:17 +00:00
Viktor Barzin	5e9e487661	feat(setup-project): auto-PR working Dockerfiles back to upstream ## Context The setup-project skill treats "build from a Dockerfile" as priority 6 — "last resort, avoid if possible" — with no formalized path for apps whose upstream lacks a working Dockerfile. When we end up writing one to get the deploy green, that Dockerfile stays private in the infra repo and upstream never benefits. ## This change Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against the upstream repo so the self-hosting community gets the working recipe. Flow: 1. Classify dockerfile_state during research phase (image-used / used-as-is / fixed-broken-upstream / written-from-scratch). Persist to modules/kubernetes/<service>/.contribution-state.json. 2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready + HTTP 200 every 30s x 20 iterations, requires 18/20 successes. 3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the GitHub API dance: fork → merge-upstream → branch → commit Dockerfile / .dockerignore / BUILD.md via Contents API → open PR with body rendered from templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork, existing branch, open PR, upstream landed a Dockerfile mid-deploy). GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token pulled from Vault (`secret/viktor` → `github_pat`). Commits include Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile` for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with timestamp suffix on collision. ## Files - SKILL.md — state classification table, quality bar checklist, §8b stability gate, §10 contribute-upstream step, checklist updates - scripts/stability-gate.sh — 10-minute health probe - scripts/contribute-dockerfile.sh — GitHub API orchestrator - templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description - templates/Dockerfile.README.md — BUILD.md template shipped with the PR ## What is NOT in this change - No Woodpecker / GHA changes (skill-local flow). - No auto-tracking of merge/reject outcomes upstream (manual follow-up). - Not yet exercised end-to-end; first real-world run will validate the API dance. Plan to dry-run against a throwaway sink repo before pointing at a real upstream. ## Test Plan ### Automated - bash -n on both scripts → pass - Manual read-through of SKILL.md — step numbering coherent, existing §1-9 untouched semantics, new §8b/§10 reference real files ### Manual Verification 1. Next time setup-project onboards a Dockerfile-less app: - Confirm .contribution-state.json is written with `written-from-scratch` - Run stability-gate.sh — expect 18/20 passes on a healthy deploy - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin - Verify contribution_pr_url is back-written to the state file 2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent) 3. Upstream-archived case: manually archive a test upstream → re-run → expect SKIP, no PR created [ci skip] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:13 +00:00
Viktor Barzin	26abd8fe94	[skill] Add /disk-wear skill for periodic disk write analysis ## Context After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day of disk writes, this methodology should be reusable for periodic health reviews. ## This change: Adds `/disk-wear` skill that combines three data sources: - SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health - Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total joined with node_disk_device_mapper_info for dm->LVM mapping) - kubectl for PVC UUID -> pod/namespace mapping Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC. Includes baselines, red flag detection, and annualized wear projections. Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track block device writes per container), so per-app attribution uses the PVE host's dm-device level metrics mapped through Prometheus and kubectl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:15:26 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	cc670d949c	docs: add ha-sofia Version Control add-on to HA skill [ci skip] HomeAssistantVersionControl v1.2.0 installed on ha-sofia for git-based config tracking. Auto-commits on file change, pushes hourly to private GitHub repo ViktorBarzin/ha-sofia-config.	2026-04-12 11:37:02 +01:00
Viktor Barzin	469fcb12b5	remove duplicate deploy-app skill, now global agent [ci skip]	2026-03-23 00:17:53 +02:00
Viktor Barzin	e51c063600	docs(add-user): update skill with actual working flow (no auto TF apply)	2026-03-18 00:28:46 +00:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	ccbcebb670	feat(vault): automate SOPS onboarding for namespace-owners - Add Transit mount + per-stack Transit keys to vault stack TF - Auto-create sops-user-<name> policy scoping decrypt to owned stacks - Auto-create sops-<name> external group + alias for Authentik mapping - Add sops-admin policy to authentik-admins group - Attach sops-user policy to namespace-owner identity entities - Update add-user skill with SOPS onboarding steps and Authentik group - Adding a user to k8s_users + applying vault stack = full SOPS access [ci skip]	2026-03-17 23:15:25 +00:00
Viktor Barzin	0abb6b83ad	add deploy-app skill and agent for automated repo→app deployment [ci skip]	2026-03-16 18:06:24 +00:00
Viktor Barzin	6c8a42b4e3	add add-user skill for cluster onboarding Interactive skill that collects user info, updates Vault KV k8s_users, and applies vault/platform/woodpecker stacks. Includes verification checklist and auto-generated resource table.	2026-03-15 22:28:54 +00:00
Viktor Barzin	160fda882f	authentik: cleanup unused resources + add invitation enrollment flow [ci skip] Cleanup: - Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment) - Deleted 8 orphaned stages bound only to deleted flows - Deleted authentik Read-only group and role (0 users) - Deleted 2 unbound policies (map github username, Map Google Attributes) Invitation enrollment: - Created invitation-enrollment flow with 5 stages (invitation validation, identification with social login, prompt, user write, auto-login) - Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment - New users can only sign up via single-use invitation links - Added authentik-invite.sh script for invitation management - Updated reference docs and authentik skill	2026-03-13 22:21:10 +00:00
Viktor Barzin	7cc7991ce6	[ci skip] claudeception: extract 2 skills from today's session 1. sops-age-secrets-migration: Complete guide for migrating from git-crypt to SOPS+age. Covers JSON format requirement, race condition avoidance, CI integration, complex types, and migration sequence. 2. iterative-plan-review-with-subagents: Design pattern for reviewing plans with parallel security + implementation subagents. 2-3 iterations to zero CRITICALs. Used successfully for the SOPS migration design.	2026-03-07 15:46:36 +00:00
Viktor Barzin	5907e50fda	[ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS) Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user, config path, and platform info to reflect HA OS on 192.168.8.103.	2026-03-07 14:34:44 +00:00
Viktor Barzin	bcbe8b23b4	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	53be356f41	[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).	2026-03-01 21:04:19 +00:00
Viktor Barzin	f2c66f070b	[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting New skill documenting the NFSv4 idmapd UID mapping crisis where all file UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody. Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.	2026-03-01 18:14:37 +00:00
Viktor Barzin	4beadc2ca2	[ci skip] add openclaw-k8s-deployment skill from claudeception Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes: - wizard block required for Telegram, exec.host valid values, - VPA resource overrides, file permissions, startup command, - modelrelay sidecar, NFS caching strategy	2026-03-01 18:10:33 +00:00
Viktor Barzin	f7acc31d83	[ci skip] update NFS mount skill: add stale mount variant after node reboots New variant documents ghost Running pods with frozen processes after kured rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.	2026-02-28 19:38:30 +00:00
Viktor Barzin	abe89c926e	[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data CLAUDE.md changes: - Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md - Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md - Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md - Extract Authentik state snapshot → .claude/reference/authentik-state.md - Remove Init Container pattern (duplicates setup-project skill) - Remove Poison Fountain service notes (duplicates Anti-AI section) - Consolidate Authentik section (link to skills + reference) - Remove resource limit tables (kept tier definitions inline) Skill merges (37→32): - helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting - containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching - (traefik merges in previous commits)	2026-02-22 22:11:31 +00:00
Viktor Barzin	d3d0b4281c	[ci skip] Merge 3 Traefik skills into traefik-helm-configuration Consolidated traefik-http3-quic, traefik-udp-cross-namespace, and traefik-plugin-download-failure-404 into a single skill with sections for HTTP/3 (QUIC), UDP cross-namespace routing, and plugin download failure troubleshooting.	2026-02-22 22:09:26 +00:00
Viktor Barzin	92a90d129a	[ci skip] Merge 2 rewrite-body skills into traefik-rewrite-body-troubleshooting	2026-02-22 22:09:03 +00:00
Viktor Barzin	7557c8ca4a	[ci skip] Add rewrite-body Accept header skill, update NFS skill New skill: traefik-rewrite-body-accept-header — rewrite-body plugin silently skips injection when request Accept header doesn't contain text/html (curl default Accept: / doesn't match). Updated: k8s-nfs-mount-troubleshooting v1.1.0 — added variant for non-root container UID permission denied on NFS writes.	2026-02-22 21:41:07 +00:00
Viktor Barzin	8b5b389f31	[ci skip] Add cluster-health skill for OpenClaw agent	2026-02-22 00:04:15 +00:00
Viktor Barzin	9b2ec7716e	[ci skip] Add skills: pfsense-nat-rule-creation, coturn-k8s-without-hostnetwork	2026-02-21 18:29:32 +00:00
Viktor Barzin	f3361e3a47	[ci skip] Add Music Assistant librespot stale credentials skill New skill: music-assistant-librespot-wrong-account - Documents fix for Spotify playback failing with "librespot does not support free accounts" when cached credentials point to wrong Spotify account - Includes step-by-step solution: find container, inspect cache, clear and restart Updated: home-assistant skill with Music Assistant addon details for ha-sofia	2026-02-21 11:23:24 +00:00
Viktor Barzin	41d3358cc1	[ci skip] Add skills: authentik-oidc-kubernetes, kubelet-static-pod-manifest-update Two skills extracted from multi-user k8s access implementation: - authentik-oidc-kubernetes: 6 gotchas for Authentik OIDC + kube-apiserver - kubelet-static-pod-manifest-update: full restart cycle for static pod changes	2026-02-17 22:56:03 +00:00
Viktor Barzin	7e73965bdd	[ci skip] Add Authentik management skill for API-based identity provider control	2026-02-17 22:55:41 +00:00
Viktor Barzin	7e3286e572	[ci skip] Pass skill secrets to moltbot container and fix Python env - Add skill_secrets variable to moltbot module with HA tokens and Uptime Kuma password as container env vars - Install Python packages (requests, caldav, icalendar, uptime-kuma-api) in init container with PYTHONPATH for main container access - Update all skills to use python3 directly instead of ~/.venvs/claude venv path that doesn't exist in the container - Remove hardcoded Uptime Kuma password from skill, use env var	2026-02-17 21:53:32 +00:00
Viktor Barzin	5a2803736d	[ci skip] Import Claude skills into OpenClaw moltbot - Convert setup-project and extend-vm-storage from standalone .md to directory-based SKILL.md format with YAML frontmatter - Add symlink in moltbot init container to expose Claude skills at ~/.openclaw/skills/ for auto-discovery by OpenClaw - Update CLAUDE.md skill path references	2026-02-17 21:09:12 +00:00
Viktor Barzin	80ea818476	[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0	2026-02-16 22:30:57 +00:00
Viktor Barzin	6f33c3008f	[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood Documents how Kubernetes ndots:5 search domain expansion floods external DNS with NxDomain queries, and the CoreDNS template block fix.	2026-02-15 21:52:27 +00:00
Viktor Barzin	3da35166ab	[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure	2026-02-15 17:18:17 +00:00
Viktor Barzin	606a79078e	[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404	2026-02-15 14:36:50 +00:00
Viktor Barzin	a7f2d6b9e6	[ci skip] Add uptime-kuma management skill with tiered monitoring	2026-02-15 14:35:53 +00:00
Viktor Barzin	c473663b98	[ci skip] Add pfSense firewall management skill	2026-02-14 12:42:10 +00:00
Viktor Barzin	ca43b97fa0	[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup	2026-02-13 23:47:45 +00:00
Viktor Barzin	08ea489fe0	[ci skip] Add extend-vm-storage script and skills - Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon) - Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts) - Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)	2026-02-13 22:08:46 +00:00
Viktor Barzin	92f392f64c	[ci skip] Add skill: local-llm-gpu-selection	2026-02-13 19:26:19 +00:00
Viktor Barzin	d48052276e	[ci skip] Add skill: traefik-rewrite-body-compression Extracted from debugging session where packruler/rewrite-body plugin corrupted gzip responses, breaking HA Companion app auth flow and WebSocket connections. Fix: strip Accept-Encoding header before rewrite-body plugin so backends send uncompressed responses.	2026-02-11 21:42:07 +00:00
Viktor Barzin	c82f82af57	[ci skip] Add ingress-factory-migration skill	2026-02-10 21:31:48 +00:00
Viktor Barzin	945d2d90a7	[ci skip] update claude knowledge: always apply cloudflared module for DNS When deploying a new service, the cloudflared module must also be applied to create the Cloudflare DNS record. Updated CLAUDE.md and setup-project skill.	2026-02-08 02:30:19 +00:00
Viktor Barzin	7f871d7675	[ci skip] update add-service skill: require NFS setup before deployment Add step 3 (NFS Storage Setup) to ensure NFS directories are created and exported on TrueNAS before deploying services that need persistent storage. Prevents pods getting stuck in ContainerCreating due to missing NFS mounts.	2026-02-08 01:51:44 +00:00
Viktor Barzin	4671ef34a3	[ci skip] Add LLM agents, voice stack, and automations to ha-london knowledge map	2026-02-07 22:40:12 +00:00
Viktor Barzin	c6a05d8e26	[ci skip] Add ha-london knowledge map: RPi Docker setup, smart plugs, air quality, e-bike ha-london runs on Raspberry Pi at 192.168.8.104 (Docker rootless, HA 2025.9.1). Key systems: TP-Link Kasa smart plugs with energy monitoring, Apollo AIR-1 air quality sensor (ESPHome), Cowboy e-bike, UptimeRobot, Oral-B BLE toothbrush. SSH access via pi@192.168.8.104, config at /home/pi/docker/homeAssistant/.	2026-02-07 22:39:20 +00:00
Viktor Barzin	f8c25d9c23	[ci skip] Add skill: traefik-udp-cross-namespace Extracted from debugging DNS forwarding through Traefik v3. Documents two non-obvious requirements for custom UDP entrypoints in the Helm chart: expose.default=true (port not added to Service by default) and allowCrossNamespace=true (IngressRouteUDP cross-namespace refs blocked by default). Both issues compound silently.	2026-02-07 22:25:54 +00:00

1 2

58 commits