migrate cc-config to chezmoi: add all skills, agents, and openclaw installer

- Add 4 missing skills: chromedp-alpine-container, claude-memory-api, openclaw-custom-model-provider, webrtc-turn-shared-secret - Add 9 custom agents: sre, dba, devops-engineer, platform-engineer, security-engineer, network-engineer, observability-engineer, home-automation-engineer, cluster-health-checker - Add openclaw-install.sh: standalone script to clone dotfiles and install skills/agents/hooks/settings to OpenClaw's home directory Replaces the cc-config NFS volume + sync.sh approach
2026-03-15 16:02:05 +00:00 · 2026-03-15 16:02:05 +00:00 · c95ffa03c5
commit c95ffa03c5
parent ba3ec6ced5
16 changed files with 1013 additions and 2 deletions
--- a/dot_claude/agents/cluster-health-checker.md
+++ b/dot_claude/agents/cluster-health-checker.md
@ -0,0 +1,48 @@
+---
+name: cluster-health-checker
+description: Check Kubernetes cluster health, diagnose issues, and apply safe auto-fixes. Use when asked to check cluster status, health, or fix common pod issues.
+tools: Read, Bash, Grep, Glob
+model: haiku
+---
+
+You are a Kubernetes cluster health checker for a homelab cluster managed via Terraform/Terragrunt.
+
+## Your Job
+
+Run the cluster healthcheck script and interpret the results. If issues are found, investigate root causes and apply safe fixes.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Healthcheck script**: `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet`
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+
+## Workflow
+
+1. Run `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet`
+2. Parse the output — identify PASS/WARN/FAIL counts and specific issues
+3. For each FAIL or WARN, investigate the root cause:
+   - **Problematic pods**: `kubectl describe pod`, `kubectl logs --previous`
+   - **Failed deployments**: check rollout status, events
+   - **StatefulSet issues**: check pod readiness, GR status for MySQL
+   - **Prometheus alerts**: query via kubectl exec into prometheus-server
+4. Apply safe auto-fixes:
+   - Delete evicted/failed pods: `kubectl delete pods -A --field-selector=status.phase=Failed`
+   - Delete stale failed jobs: `kubectl delete jobs -n <ns> --field-selector=status.successful=0`
+   - Restart stuck pods (>10 restarts): `kubectl delete pod -n <ns> <pod> --grace-period=0`
+5. Report findings concisely
+
+## NEVER Do
+
+- Never `kubectl apply/edit/patch` — all changes go through Terraform
+- Never restart NFS on TrueNAS
+- Never modify secrets or tfvars
+- Never push to git
+- Never scale deployments to 0
+
+## Known Expected Conditions
+
+These are not actionable — just report them:
+- **ha-london** Uptime Kuma monitor down — external Home Assistant, not in this cluster
+- **Resource usage >80%** on nodes — WARN only if actual usage is high, not limits overcommit
+- **PVFillingUp** for navidrome-music — Synology NAS volume, threshold is 95%
--- a/dot_claude/agents/dba.md
+++ b/dot_claude/agents/dba.md
@ -0,0 +1,49 @@
+---
+name: dba
+description: Check database health — MySQL InnoDB Cluster, PostgreSQL (CNPG), SQLite. Monitor replication, backups, connections, and slow queries.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a DBA for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+All databases — MySQL InnoDB Cluster (3 instances), PostgreSQL via CNPG, SQLite-on-NFS.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run diagnostic scripts:
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/db-health.sh` — MySQL GR + CNPG + connections
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/backup-verify.sh` — backup freshness
+3. Investigate specific issues:
+   - **MySQL InnoDB Cluster**: Group Replication status via `kubectl exec sts/mysql-cluster -n dbaas -- mysql -e 'SELECT * FROM performance_schema.replication_group_members'`
+   - **CNPG PostgreSQL**: Cluster health via `kubectl get cluster,backup -A`
+   - **Backups**: CNPG backup CRD timestamps, MySQL dump timestamps on NFS
+   - **Connections**: Connection counts and slow queries
+   - **iSCSI volumes**: Health for database PVCs
+   - **SQLite**: WAL checkpoint status, integrity checks
+4. Report findings with clear root cause analysis
+
+## Safe Auto-Fix
+
+None — database operations are too risky for auto-fix. Advisory only.
+
+## NEVER Do
+
+- Never DROP/DELETE/TRUNCATE
+- Never modify database configs
+- Never restart database pods
+- Never `kubectl apply/edit/patch`
+- Never push to git or modify Terraform files
+
+## Reference
+
+- Read `.claude/reference/service-catalog.md` for which services use which database
--- a/dot_claude/agents/devops-engineer.md
+++ b/dot_claude/agents/devops-engineer.md
@ -0,0 +1,46 @@
+---
+name: devops-engineer
+description: Check deployment rollouts, CI/CD builds, image pull errors, and post-deploy health. Use for stalled deployments, Woodpecker CI issues, or deploy verification.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a DevOps Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+Deployments, CI/CD (Woodpecker), rollouts, Docker images, post-deploy verification.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/deploy-status.sh` to check deployment health
+3. Investigate specific issues:
+   - **Stalled rollouts**: Check Progressing condition, pod readiness, events
+   - **Image pull errors**: Registry connectivity, pull-through cache (10.0.20.10), tag existence
+   - **Woodpecker CI**: Build status via `kubectl exec` into woodpecker-server pod
+   - **Post-deploy health**: Verify via Uptime Kuma (use `uptime-kuma` skill) and service endpoints
+   - **DIUN**: Check for available image updates, report digest
+4. Report findings with clear remediation steps
+
+## Safe Auto-Fix
+
+None — deployments are Terraform-owned.
+
+## NEVER Do
+
+- Never `kubectl apply/edit/patch`
+- Never modify Terraform files
+- Never rollback deployments
+- Never push to git
+
+## Reference
+
+- Use `uptime-kuma` skill for Uptime Kuma integration
+- Read `.claude/reference/service-catalog.md` for service inventory
--- a/dot_claude/agents/holiday-deals.md
+++ b/dot_claude/agents/holiday-deals.md
@ -48,9 +48,9 @@ Search for all-inclusive or flight+hotel packages on:
 - On the Beach
 - Love Holidays

-### 5. Free Activities & Walking Tours
+### 5. Free Activities & Walking Tours (HIGH PRIORITY — user loves these)
 Search for:
- Free walking tours (GuruWalk, Free Tour)
+- **Free walking tours** (GuruWalk, Free Tour, Civitatis free tours) — find ALL available tours, especially history-focused ones. Include meeting point, duration, and booking links.
 - Free museums / free entry days
 - Free viewpoints, parks, beaches
 - Local markets and street food areas
--- a/dot_claude/agents/holiday-itinerary.md
+++ b/dot_claude/agents/holiday-itinerary.md
@ -13,6 +13,8 @@ tools:
 You create a detailed day-by-day itinerary for a holiday trip, synthesizing all research from Phase 1 agents (flights, timing/safety, deals).

 ## User Preference Profile
+- **Loves free walking tours** — always include at least one per city, prioritize history-focused ones (GuruWalk, Free Tour, Civitatis free tours)
+- **Passionate about city history** — weave historical context into the itinerary (key dates, events, significance of sites)
 - Culture + adventure mix
 - Historical sites, food markets, hiking, outdoor activities
 - Local/authentic over tourist traps
--- a/dot_claude/agents/home-automation-engineer.md
+++ b/dot_claude/agents/home-automation-engineer.md
@ -0,0 +1,61 @@
+---
+name: home-automation-engineer
+description: Check Home Assistant device health, Frigate NVR cameras, automations, and battery levels. Use for smart home diagnostics across ha-london and ha-sofia instances.
+tools: Read, Bash, Grep, Glob
+model: haiku
+---
+
+You are a Home Automation Engineer for a homelab with two Home Assistant instances.
+
+## Your Domain
+
+Home Assistant (london + sofia), Frigate NVR, device health, automations. These are external services on separate hardware, not K8s-managed.
+
+## Environment
+
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **HA London script**: `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py`
+- **HA Sofia script**: `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant-sofia.py`
+
+### Instances
+
+| Instance | URL | Default? |
+|----------|-----|----------|
+| **ha-london** | `https://ha-london.viktorbarzin.me` | Yes |
+| **ha-sofia** | `https://ha-sofia.viktorbarzin.me` | No |
+
+- **Default**: ha-london (use unless user specifies "sofia" or "ha-sofia")
+- **Aliases**: "ha" or "HA" = ha-london
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches (ha-london Uptime Kuma monitor is a known suppressed item)
+2. Use existing Python scripts directly (no wrapper scripts needed):
+   - `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py states` — all device states (ha-london)
+   - `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant-sofia.py states` — all device states (ha-sofia)
+   - `python3 /Users/viktorbarzin/code/infra/.claude/home-assistant.py services` — available services
+3. Check for issues:
+   - **Device availability**: Look for `unavailable` or `unknown` state entities
+   - **Frigate cameras**: 9 cameras on ha-sofia — check camera entity states
+   - **Automations**: Review automation run history for failures
+   - **Climate zones**: Temperature/HVAC status
+   - **Alarm**: Security system status
+   - **Battery levels**: All battery-powered devices — warn if <20%
+   - **Energy**: Consumption monitoring
+4. Report findings organized by instance
+
+## Safe Auto-Fix
+
+None — home automation actions require user intent.
+
+## NEVER Do
+
+- Never turn off alarm system
+- Never unlock doors
+- Never change climate settings
+- Never disable automations without explicit request
+- Never expose API tokens
+
+## Reference
+
+- Use `home-assistant` skill for HA interaction patterns
--- a/dot_claude/agents/network-engineer.md
+++ b/dot_claude/agents/network-engineer.md
@ -0,0 +1,54 @@
+---
+name: network-engineer
+description: Check pfSense firewall, DNS (Technitium + Cloudflare), VPN (WireGuard/Headscale), routing, and MetalLB. Use for connectivity issues, DNS problems, or network diagnostics.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a Network Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+pfSense firewall, DNS (Technitium + Cloudflare), VPN (WireGuard/Headscale), routing, MetalLB.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+- **pfSense**: Access via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py`
+- **VLANs**: 10.0.10.0/24 (storage), 10.0.20.0/24 (k8s), 192.168.1.0/24 (management)
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run diagnostic scripts:
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/dns-check.sh` — DNS resolution verification
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/network-health.sh` — pfSense + VPN + MetalLB
+3. Investigate specific issues:
+   - **pfSense**: System health via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py status`
+   - **Firewall states**: Connection table via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py pfctl`
+   - **DNS**: Resolution for all services (internal `.lan` + external `.me`)
+   - **Technitium**: DNS server health and zone status
+   - **WireGuard/Headscale**: Tunnel status via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py wireguard`
+   - **Routing**: Between VLANs
+   - **MetalLB**: L2 advertisement health
+4. Report findings with clear root cause analysis
+
+## Safe Auto-Fix
+
+None — network changes are high-blast-radius.
+
+## NEVER Do
+
+- Never modify firewall rules
+- Never change DNS records (Terraform-owned)
+- Never modify VPN configs
+- Never restart pfSense services
+- Never `kubectl apply/edit/patch`
+- Never push to git or modify Terraform files
+
+## Reference
+
+- Use `pfsense` skill for pfSense access patterns
+- Read `k8s-ndots` skill for DNS search domain issues
--- a/dot_claude/agents/observability-engineer.md
+++ b/dot_claude/agents/observability-engineer.md
@ -0,0 +1,49 @@
+---
+name: observability-engineer
+description: Check monitoring stack health (Prometheus, Grafana, Alertmanager, Uptime Kuma, SNMP exporters). Use for alert issues, monitoring problems, or dashboard diagnostics.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are an Observability Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+Prometheus, Grafana, Alertmanager, Uptime Kuma, SNMP exporters. Note: Loki and Alloy are NOT deployed — log queries use `kubectl logs`.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run diagnostic script:
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/monitoring-health.sh` — monitoring pod health, alerts, Grafana datasources, SNMP exporters
+3. Investigate specific issues:
+   - **Monitoring stack health**: Verify Prometheus (`deploy/prometheus-server`), Alertmanager (`sts/prometheus-alertmanager`), Grafana (`deploy/grafana`) pods are running and responsive
+   - **Alert analysis**: Why alerts are firing or not firing — check Alertmanager routing, silences, inhibitions
+   - **Grafana**: Datasource connectivity via `kubectl exec deploy/grafana -n monitoring -- curl -s 'http://localhost:3000/api/datasources'`
+   - **SNMP exporters**: snmp-exporter (UPS), idrac-redfish-exporter (iDRAC), proxmox-exporter scraping status
+   - **Prometheus storage**: Usage and retention
+   - **Alert routing**: Receivers, matchers, inhibitions
+   - **Uptime Kuma**: Use the `uptime-kuma` skill for monitor management
+4. Report findings with clear root cause analysis
+
+## Safe Auto-Fix
+
+None — monitoring config is Terraform-owned.
+
+## NEVER Do
+
+- Never modify Prometheus rules, Grafana dashboards, or alert configs directly
+- Never `kubectl apply/edit/patch`
+- Never commit secrets
+- Never push to git or modify Terraform files
+
+## Reference
+
+- Use `uptime-kuma` skill for Uptime Kuma management
+- Use `cluster-health` skill for quick cluster triage
--- a/dot_claude/agents/platform-engineer.md
+++ b/dot_claude/agents/platform-engineer.md
@ -0,0 +1,65 @@
+---
+name: platform-engineer
+description: Check K8s platform health, NFS/iSCSI storage, Proxmox VMs, Traefik, Kyverno, VPA. Use for node issues, storage problems, or platform-level diagnostics.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a Platform Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+K8s platform (Traefik, MetalLB, Kyverno, VPA), Proxmox VMs, NFS/iSCSI storage, node management.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+- **K8s nodes**: k8s-master (10.0.20.100), k8s-node1 (10.0.20.101), k8s-node2 (10.0.20.102), k8s-node3 (10.0.20.103), k8s-node4 (10.0.20.104) — SSH user: `wizard`
+- **TrueNAS**: `ssh root@10.0.10.15`
+- **Proxmox**: `ssh root@192.168.1.127`
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run diagnostic scripts to gather data:
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/nfs-health.sh` — NFS mount health across all nodes
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/truenas-status.sh` — ZFS pools, SMART, replication, iSCSI
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/platform-status.sh` — Traefik, Kyverno, VPA, pull-through cache, Proxmox
+3. Investigate specific issues:
+   - NFS: SSH to affected nodes, check mount status, detect stale file handles
+   - TrueNAS: ZFS pool status, SMART health, replication tasks via SSH
+   - PVCs: Check pending PVCs, unbound PVs, capacity usage
+   - iSCSI: democratic-csi volume health
+   - Traefik: IngressRoute health, middleware status
+   - Kyverno: Resource governance (LimitRange + ResourceQuota per namespace)
+   - VPA/Goldilocks: Status and unexpected updateMode settings
+   - Proxmox: Host resources via SSH
+   - Node conditions: kubelet status
+   - Pull-through cache: Registry health (10.0.20.10)
+4. Report findings with clear root cause analysis
+
+## Proactive Mode
+
+Daily NFS + TrueNAS health check — storage failures cascade across all 70+ services.
+
+## Safe Auto-Fix
+
+None. NFS remount via SSH can hang on dead TrueNAS; PV cleanup destroys data.
+
+## NEVER Do
+
+- Never restart NFS on TrueNAS
+- Never delete datasets/pools/snapshots
+- Never modify PVCs via kubectl
+- Never delete PVs
+- Never `kubectl apply/edit/patch`
+- Never change Kyverno policies directly
+- Never push to git or modify Terraform files
+
+## Reference
+
+- Read `.claude/reference/patterns.md` for governance tables
+- Read `.claude/reference/proxmox-inventory.md` for VM details
+- Use `extend-vm-storage` skill for storage extension workflow
--- a/dot_claude/agents/security-engineer.md
+++ b/dot_claude/agents/security-engineer.md
@ -0,0 +1,61 @@
+---
+name: security-engineer
+description: Check TLS certs, CrowdSec WAF, Authentik SSO, Kyverno policies, Snort IDS, and Cloudflare tunnel. Use for security audits, cert expiry, or access control issues.
+tools: Read, Bash, Grep, Glob
+model: sonnet
+---
+
+You are a Security Engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+TLS certs, CrowdSec WAF, Authentik SSO, Kyverno policies, Snort IDS, Cloudflare tunnel.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+- **pfSense**: Access via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py`
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Run diagnostic scripts:
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/tls-check.sh` — cert expiry scan
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/crowdsec-status.sh` — CrowdSec LAPI/agent health
+   - `bash /Users/viktorbarzin/code/infra/.claude/scripts/authentik-audit.sh` — user/group audit
+3. Investigate specific issues:
+   - **TLS certs**: Check in-cluster `kubernetes.io/tls` secrets + `secrets/fullchain.pem`, alert <14 days to expiry
+   - **cert-manager**: Certificate/CertificateRequest/Order CRDs for renewal failures
+   - **CrowdSec**: LAPI health via `kubectl exec` + `cscli`, agent DaemonSet, recent decisions
+   - **Authentik**: Users/groups via `kubectl exec deploy/goauthentik-server -n authentik`, outpost health
+   - **Snort IDS**: Review alerts via `python3 /Users/viktorbarzin/code/infra/.claude/pfsense.py snort`
+   - **Kyverno**: Policies in expected state (Audit mode, not Enforce)
+   - **Cloudflare tunnel**: Pod health
+   - **Sealed-secrets**: Controller operational
+4. Report findings with clear remediation steps
+
+## Proactive Mode
+
+Daily TLS cert expiry check only. All other checks on-demand.
+
+## Safe Auto-Fix
+
+Delete stale CrowdSec machine registrations via `cscli machines delete` — only machines not seen in >7 days. Always run `cscli machines list` first and show what would be deleted before acting. Reversible — agents re-register on next heartbeat.
+
+## NEVER Do
+
+- Never read/expose raw secret values
+- Never modify CrowdSec config (Terraform-owned)
+- Never create/delete Authentik users without explicit request
+- Never modify firewall rules
+- Never disable security policies
+- Never commit secrets
+- Never `kubectl apply/edit/patch`
+- Never push to git or modify Terraform files
+
+## Reference
+
+- Use `pfsense` skill for pfSense access patterns
+- Read `.claude/reference/authentik-state.md` for Authentik configuration
--- a/dot_claude/agents/sre.md
+++ b/dot_claude/agents/sre.md
@ -0,0 +1,68 @@
+---
+name: sre
+description: Investigate OOMKilled pods, capacity issues, and complex multi-system incidents. The escalation point when specialist agents aren't enough.
+tools: Read, Bash, Grep, Glob
+model: opus
+---
+
+You are an SRE / On-Call engineer for a homelab Kubernetes cluster managed via Terraform/Terragrunt.
+
+## Your Domain
+
+Incident response, OOM investigation, capacity planning, root cause analysis. You are the escalation point when specialist agents aren't enough.
+
+## Environment
+
+- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
+- **Infra repo**: `/Users/viktorbarzin/code/infra`
+- **Scripts**: `/Users/viktorbarzin/code/infra/.claude/scripts/`
+- **K8s nodes**: k8s-master (10.0.20.100), k8s-node1-4 (10.0.20.101-104) — SSH user: `wizard`
+
+## Two Modes
+
+### Mode 1 — OOM/Capacity (most common)
+
+1. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/oom-investigator.sh` to find OOMKilled pods
+2. For each OOMKilled pod:
+   - Identify the container that was killed
+   - Check LimitRange defaults in the namespace
+   - Check actual usage vs limit
+   - Read Goldilocks VPA recommendations
+   - Compare to Terraform-defined resources in the stack
+3. Run `bash /Users/viktorbarzin/code/infra/.claude/scripts/resource-report.sh` for cluster-wide capacity
+4. Produce actionable Terraform snippets for resource fixes
+
+### Mode 2 — Incident Response (rare, complex)
+
+1. **Pre-check**: Verify monitoring pods are running (`kubectl get pods -n monitoring`). If monitoring is down, fall back to kubectl events/logs and SSH-based investigation.
+2. Query Prometheus via `kubectl exec deploy/prometheus-server -n monitoring -- wget -qO- 'http://localhost:9090/api/v1/query?query=...'`
+3. Query Alertmanager via `kubectl exec sts/prometheus-alertmanager -n monitoring -- wget -qO- 'http://localhost:9093/api/v2/...'`
+4. Aggregate logs via `kubectl logs` across pods/namespaces (Loki is NOT deployed)
+5. Correlate across: pod events, node conditions, pfSense logs, CrowdSec decisions
+6. SSH to nodes for kubelet logs (`journalctl -u kubelet`), dmesg, systemd status
+7. Produce incident reports with root cause + remediation
+
+## Workflow
+
+1. Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches
+2. Determine which mode applies based on the user's request
+3. Run appropriate scripts and investigations
+4. Report findings with clear root cause analysis and actionable remediation
+
+## Safe Auto-Fix
+
+None — purely investigative.
+
+## NEVER Do
+
+- Never `kubectl apply/edit/patch`
+- Never modify any files
+- Never restart services
+- Never push to git
+- Never commit secrets
+
+## Reference
+
+- All other agents' scripts are available in `.claude/scripts/`
+- Read `.claude/reference/patterns.md` for governance tables
+- Read `.claude/reference/proxmox-inventory.md` for VM details
--- a/dot_claude/executable_openclaw-install.sh
+++ b/dot_claude/executable_openclaw-install.sh
@ -0,0 +1,99 @@
+#!/bin/bash
+# Install Claude Code config for OpenClaw from the dotfiles repo.
+#
+# Usage:
+#   # First time (clone + install):
+#   curl -fsSL https://raw.githubusercontent.com/ViktorBarzin/dot_files/master/dot_claude/executable_openclaw-install.sh | bash
+#
+#   # Update (pull + reinstall):
+#   ~/.openclaw/dotfiles/dot_claude/executable_openclaw-install.sh
+#
+# Environment:
+#   OPENCLAW_HOME  - OpenClaw home directory (default: /home/node/.openclaw or ~/.openclaw)
+#   DOTFILES_REPO  - Git repo URL (default: https://github.com/ViktorBarzin/dot_files.git)
+#   DOTFILES_DIR   - Where to clone the repo (default: $OPENCLAW_HOME/dotfiles)
+
+set -euo pipefail
+
+log() { echo "[openclaw-install] $*"; }
+
+# Detect environment
+if [ -d "/home/node/.openclaw" ]; then
+  OPENCLAW_HOME="${OPENCLAW_HOME:-/home/node/.openclaw}"
+elif [ -d "$HOME/.openclaw" ]; then
+  OPENCLAW_HOME="${OPENCLAW_HOME:-$HOME/.openclaw}"
+else
+  OPENCLAW_HOME="${OPENCLAW_HOME:-$HOME/.claude}"
+fi
+
+DOTFILES_REPO="${DOTFILES_REPO:-https://github.com/ViktorBarzin/dot_files.git}"
+DOTFILES_DIR="${DOTFILES_DIR:-$OPENCLAW_HOME/dotfiles}"
+SRC="$DOTFILES_DIR/dot_claude"
+
+log "OPENCLAW_HOME=$OPENCLAW_HOME"
+log "DOTFILES_DIR=$DOTFILES_DIR"
+
+# Clone or pull
+if [ -d "$DOTFILES_DIR/.git" ]; then
+  log "Pulling latest dotfiles..."
+  git -C "$DOTFILES_DIR" pull --ff-only 2>/dev/null || git -C "$DOTFILES_DIR" pull --rebase || true
+else
+  log "Cloning dotfiles..."
+  git clone --depth 1 "$DOTFILES_REPO" "$DOTFILES_DIR"
+fi
+
+# Install skills
+if [ -d "$SRC/skills" ]; then
+  mkdir -p "$OPENCLAW_HOME/skills"
+  rsync -a --delete "$SRC/skills/" "$OPENCLAW_HOME/skills/"
+  log "Installed $(ls "$OPENCLAW_HOME/skills/" | wc -l | tr -d ' ') skills"
+fi
+
+# Install agents
+if [ -d "$SRC/agents" ]; then
+  mkdir -p "$OPENCLAW_HOME/agents"
+  rsync -a --delete "$SRC/agents/" "$OPENCLAW_HOME/agents/"
+  log "Installed $(ls "$OPENCLAW_HOME/agents/" | wc -l | tr -d ' ') agents"
+fi
+
+# Install hooks (skip executable_ prefix renaming — OpenClaw doesn't use chezmoi)
+if [ -d "$SRC/hooks" ]; then
+  mkdir -p "$OPENCLAW_HOME/hooks"
+  for f in "$SRC/hooks/"*; do
+    base=$(basename "$f")
+    # Strip chezmoi executable_ prefix if present
+    dest="${base#executable_}"
+    cp "$f" "$OPENCLAW_HOME/hooks/$dest"
+    chmod +x "$OPENCLAW_HOME/hooks/$dest" 2>/dev/null || true
+  done
+  log "Installed $(ls "$OPENCLAW_HOME/hooks/" | wc -l | tr -d ' ') hooks"
+fi
+
+# Install commands
+if [ -d "$SRC/commands" ]; then
+  mkdir -p "$OPENCLAW_HOME/commands"
+  rsync -a --delete "$SRC/commands/" "$OPENCLAW_HOME/commands/"
+  log "Installed commands"
+fi
+
+# Install CLAUDE.md (global knowledge)
+if [ -f "$SRC/CLAUDE.md" ]; then
+  cp "$SRC/CLAUDE.md" "$OPENCLAW_HOME/CLAUDE.md"
+  log "Installed CLAUDE.md"
+fi
+
+# Install settings (render template: replace {{HOME}} and {{CLAUDE_DIR}} with actual paths)
+if [ -f "$SRC/settings.json" ]; then
+  sed -e "s|{{CLAUDE_DIR}}|$OPENCLAW_HOME|g" \
+      -e "s|{{HOME}}|$(dirname "$OPENCLAW_HOME")|g" \
+      "$SRC/settings.json" > "$OPENCLAW_HOME/settings.json"
+  log "Installed settings.json (templated)"
+fi
+
+# Fix ownership if running as root (init container)
+if [ "$(id -u)" = "0" ]; then
+  chown -R 1000:1000 "$OPENCLAW_HOME" 2>/dev/null || true
+  log "Fixed ownership to UID 1000"
+fi
+
+log "Done. Installed to $OPENCLAW_HOME"
--- a/dot_claude/skills/chromedp-alpine-container/SKILL.md
+++ b/dot_claude/skills/chromedp-alpine-container/SKILL.md
@ -0,0 +1,102 @@
+---
+name: chromedp-alpine-container
+description: |
+  Fix Chrome/Chromium startup failures in Alpine Linux containers when using chromedp
+  (or similar CDP tools). Use when: (1) chromedp fails with "websocket url timeout reached",
+  (2) Chrome crashes with "ZINK: vkCreateInstance failed" or "eglInitialize SwANGLE failed"
+  or "glx: failed to create drisw screen", (3) running Chrome non-headless on Xvfb in
+  Alpine containers, (4) Chrome starts but DevTools connection times out. Root causes:
+  missing mesa software GL drivers, missing dbus, and chromedp's default WSURLReadTimeout
+  being too short for containers with GL fallback overhead.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-21
+---
+
+# Chrome/Chromedp in Alpine Containers
+
+## Problem
+Chrome/Chromium fails to start or chromedp times out connecting to DevTools when running
+in Alpine Linux containers, especially when running non-headless on Xvfb for screen capture.
+
+## Context / Trigger Conditions
+- `websocket url timeout reached` from chromedp
+- `MESA: error: ZINK: vkCreateInstance failed (VK_ERROR_INCOMPATIBLE_DRIVER)`
+- `glx: failed to create drisw screen`
+- `eglInitialize SwANGLE failed with error EGL_NOT_INITIALIZED`
+- `Initialization of all EGL display types failed`
+- `Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket`
+- Chrome works in headless mode but fails non-headless on Xvfb
+
+## Solution
+
+### 1. Install required Alpine packages
+
+```dockerfile
+RUN apk add --no-cache \
+    chromium nss freetype harfbuzz ttf-freefont \
+    mesa-dri-gallium mesa-gl \
+    dbus \
+    xvfb-run xorg-server
+```
+
+Key packages:
+- `mesa-dri-gallium` — software GL rasterizer (llvmpipe/softpipe) Chrome needs
+- `mesa-gl` — OpenGL library
+- `dbus` — Chrome queries dbus for accessibility/services; without it, startup is slow
+
+### 2. Start dbus before Chrome
+
+```go
+exec.Command("mkdir", "-p", "/var/run/dbus").Run()
+exec.Command("dbus-daemon", "--system", "--nofork").Start()
+```
+
+### 3. Increase chromedp WSURLReadTimeout
+
+Chrome takes longer to start in containers due to GL fallback attempts. The default
+chromedp timeout is often too short:
+
+```go
+opts := append(chromedp.DefaultExecAllocatorOptions[:],
+    chromedp.Flag("headless", false),
+    chromedp.Flag("no-sandbox", true),
+    chromedp.Flag("disable-gpu", true),
+    chromedp.Flag("disable-software-rasterizer", true),
+    chromedp.Flag("disable-dev-shm-usage", true),
+    chromedp.WSURLReadTimeout(30 * time.Second),  // default is too short
+)
+```
+
+### 4. Required Chrome flags for containers
+
+```
+--no-sandbox                 # Required when running as root
+--disable-gpu                # No hardware GPU available
+--disable-software-rasterizer # Avoid SwANGLE failures
+--disable-dev-shm-usage      # /dev/shm is only 64MB in k8s by default
+```
+
+## Verification
+
+Test Chrome starts and DevTools listens:
+
+```sh
+Xvfb :50 -screen 0 1280x720x24 -ac -nolisten tcp &
+sleep 2
+DISPLAY=:50 chromium-browser --no-sandbox --disable-gpu \
+  --disable-software-rasterizer --remote-debugging-port=9222 about:blank 2>&1
+# Should see: DevTools listening on ws://127.0.0.1:9222/devtools/browser/...
+```
+
+## Notes
+- GL errors like `ZINK: vkCreateInstance failed` are warnings, not fatal — Chrome
+  still runs after fallback, but fallback takes time (causing the timeout)
+- `--disable-gpu` alone is NOT sufficient — Chrome still tries to initialize GL
+  for compositing even with GPU disabled
+- The dbus errors are non-fatal but cause Chrome to retry connections repeatedly,
+  slowing startup
+- Default k8s `/dev/shm` is 64MB; use `--disable-dev-shm-usage` or mount a larger
+  emptyDir at `/dev/shm`
+- `chromedp.Flag("headless", false)` removes the `--headless` flag that
+  `DefaultExecAllocatorOptions` includes by default
--- a/dot_claude/skills/claude-memory-api/SKILL.md
+++ b/dot_claude/skills/claude-memory-api/SKILL.md
@ -0,0 +1,47 @@
+---
+name: claude-memory-api
+description: Store and recall persistent memories using the memory-tool CLI. Use when the user asks to remember something, recall a previous memory, or when you want to persist knowledge across sessions.
+---
+
+# Claude Memory API
+
+You have access to a persistent memory system via the `memory-tool` CLI command.
+
+## When to Use
+
+- User says "remember this", "save this", "note that..."
+- User asks "do you remember...", "what do you know about...", "recall..."
+- You discover important facts worth persisting (user preferences, project patterns, debugging insights)
+- You need to check if you already know something before asking the user
+
+## Commands
+
+### Store a memory
+```bash
+memory-tool store "content to remember" --category <category> --tags "tag1,tag2"
+```
+Categories: `facts`, `preferences`, `patterns`, `debugging`, `architecture`
+
+### Recall memories (semantic search)
+```bash
+memory-tool recall "search query"
+```
+
+### List all memories
+```bash
+memory-tool list
+memory-tool list --category facts
+```
+
+### Delete a memory
+```bash
+memory-tool delete <memory-id>
+```
+
+## Guidelines
+
+- Always `recall` before storing to avoid duplicates
+- Use specific, descriptive content — memories should be self-contained
+- Choose the most relevant category
+- Add tags for better recall later
+- When the user says "remember X", store it immediately and confirm
--- a/dot_claude/skills/openclaw-custom-model-provider/SKILL.md
+++ b/dot_claude/skills/openclaw-custom-model-provider/SKILL.md
@ -0,0 +1,155 @@
+---
+name: openclaw-custom-model-provider
+description: |
+  Configure custom model providers in OpenClaw (openclaw.ai). Use when:
+  (1) adding a new LLM provider (Llama API, LM Studio, custom proxy) to OpenClaw,
+  (2) changing the default model in OpenClaw, (3) enabling/disabling tools and
+  commands in OpenClaw, (4) user mentions openclaw.json or openclaw configuration.
+  Covers the models.providers JSON structure, agent defaults, and tool permissions.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-16
+---
+
+# OpenClaw Custom Model Provider Configuration
+
+## Problem
+OpenClaw supports custom OpenAI-compatible model providers, but the configuration
+structure requires checking multiple documentation pages to assemble correctly.
+
+## Context / Trigger Conditions
+- User wants to add a new LLM provider to OpenClaw
+- User has an API key for Llama API, OpenRouter, LM Studio, or another OpenAI-compatible service
+- User wants to change the default model OpenClaw uses
+- User wants to enable all tools/commands (remove denyCommands restrictions)
+
+## Solution
+
+### Config File Location
+`~/.openclaw/openclaw.json`
+
+### Adding a Custom Provider
+
+Add to the `models.providers` object:
+
+```json
+{
+  "models": {
+    "mode": "merge",
+    "providers": {
+      "my-provider": {
+        "baseUrl": "https://api.example.com/compat/v1",
+        "apiKey": "YOUR_API_KEY",
+        "api": "openai-completions",
+        "models": [
+          {
+            "id": "model-id",
+            "name": "Display Name",
+            "reasoning": false,
+            "input": ["text"],
+            "cost": {
+              "input": 0,
+              "output": 0,
+              "cacheRead": 0,
+              "cacheWrite": 0
+            },
+            "contextWindow": 200000,
+            "maxTokens": 8192
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+**Key fields:**
+- `api`: Protocol — `"openai-completions"` | `"openai-responses"` | `"anthropic-messages"` | `"google-generative-ai"`
+- `mode`: `"merge"` (default, keeps built-in providers) or `"replace"` (only custom)
+- `cost`: Set all to `0` for free/self-hosted models
+- Model reference format: `provider-name/model-id` (e.g., `llama-as-openai/Llama-4-Maverick-17B-128E-Instruct-FP8`)
+
+### Setting Default Model
+
+```json
+{
+  "agents": {
+    "defaults": {
+      "model": {
+        "primary": "my-provider/model-id",
+        "fallbacks": ["ollama/local-model"]
+      },
+      "models": {
+        "my-provider/model-id": {},
+        "ollama/local-model": {}
+      }
+    }
+  }
+}
+```
+
+### Enabling All Tools/Commands
+
+To remove tool restrictions:
+
+```json
+{
+  "commands": {
+    "native": true,
+    "nativeSkills": true
+  },
+  "gateway": {
+    "nodes": {
+      "denyCommands": []
+    }
+  }
+}
+```
+
+Default `denyCommands` blocks: `camera.snap`, `camera.clip`, `screen.record`,
+`calendar.add`, `contacts.add`, `reminders.add`.
+
+### Common Provider Examples
+
+**Llama API:**
+```json
+"llama-as-openai": {
+  "baseUrl": "https://api.llama.com/compat/v1",
+  "apiKey": "LLM|...",
+  "api": "openai-completions"
+}
+```
+
+**Local Ollama:**
+```json
+"ollama": {
+  "baseUrl": "http://127.0.0.1:11434/v1",
+  "apiKey": "none",
+  "api": "openai-completions"
+}
+```
+
+**LM Studio:**
+```json
+"lmstudio": {
+  "baseUrl": "http://127.0.0.1:1234/v1",
+  "apiKey": "lmstudio",
+  "api": "openai-responses"
+}
+```
+
+## Verification
+- Restart OpenClaw after config changes
+- Run `openclaw` and check that the new model appears in model selection
+- Send a test message to verify the provider responds
+
+## Notes
+- `mode: "merge"` is the default and recommended — it keeps built-in providers alongside custom ones
+- Optional fields: `authHeader` (boolean), `headers` (object for custom HTTP headers)
+- Set `reasoning: true` for models that support chain-of-thought (e.g., DeepSeek R1)
+- OpenClaw docs: https://docs.openclaw.ai/gateway/configuration-reference.md
+
+## References
+- [OpenClaw Configuration Reference](https://docs.openclaw.ai/gateway/configuration-reference.md)
+- [OpenClaw Configuration Examples](https://docs.openclaw.ai/gateway/configuration-examples.md)
+- [OpenClaw Model Providers](https://docs.openclaw.ai/concepts/model-providers.md)
--- a/dot_claude/skills/webrtc-turn-shared-secret/SKILL.md
+++ b/dot_claude/skills/webrtc-turn-shared-secret/SKILL.md
@ -0,0 +1,105 @@
+---
+name: webrtc-turn-shared-secret
+description: |
+  Generate ephemeral TURN credentials from a shared secret for coturn (--use-auth-secret mode).
+  Use when: (1) WebRTC ICE connection state goes to "failed" or stays at "checking",
+  (2) STUN-only config can't establish media path through NAT/k8s,
+  (3) coturn is configured with --use-auth-secret and you need time-limited credentials,
+  (4) need to pass TURN credentials to both server-side (pion/webrtc) and client-side
+  (browser RTCPeerConnection). Covers credential generation, Go implementation, and
+  client-side WebRTC configuration.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-21
+---
+
+# WebRTC TURN Server with Shared Secret Credentials
+
+## Problem
+WebRTC connections fail with `ICE connection state: failed` when peers are behind NAT
+(especially in Kubernetes pods). STUN alone can't establish a media path through
+symmetric NAT. A TURN server is needed, and coturn's shared secret mode requires
+generating ephemeral credentials.
+
+## Context / Trigger Conditions
+- `webrtc: ICE connection state: failed` in server logs
+- `ICE connection state: failed` in browser console
+- WebRTC signaling (offer/answer) succeeds but no media flows
+- Server is in a k8s pod with private IP, client is behind NAT
+- coturn configured with `--use-auth-secret` or `use-auth-secret` in turnserver.conf
+
+## Solution
+
+### Credential Generation (TURN REST API)
+
+```
+username = Unix timestamp of expiry (e.g., "1740200000")
+password = Base64(HMAC-SHA1(username, shared_secret))
+```
+
+### Go Implementation
+
+```go
+import (
+    "crypto/hmac"
+    "crypto/sha1"
+    "encoding/base64"
+    "fmt"
+    "time"
+)
+
+func GenerateTURNCredentials(turnURL, sharedSecret string, ttl time.Duration) (urls []string, username, credential string) {
+    expiry := time.Now().Add(ttl).Unix()
+    username = fmt.Sprintf("%d", expiry)
+    mac := hmac.New(sha1.New, []byte(sharedSecret))
+    mac.Write([]byte(username))
+    credential = base64.StdEncoding.EncodeToString(mac.Sum(nil))
+    return []string{turnURL}, username, credential
+}
+```
+
+### Server-side (pion/webrtc)
+
+```go
+iceServers := []webrtc.ICEServer{
+    {URLs: []string{"stun:stun.l.google.com:19302"}},
+    {
+        URLs:           []string{"turn:your-turn-server:3478"},
+        Username:       username,
+        Credential:     credential,
+        CredentialType: webrtc.ICECredentialTypePassword,
+    },
+}
+pc, _ := webrtc.NewPeerConnection(webrtc.Configuration{ICEServers: iceServers})
+```
+
+### Client-side (browser)
+
+Send ICE config from server to client via signaling channel (WebSocket),
+then create RTCPeerConnection with it:
+
+```javascript
+// Server sends: { type: "iceServers", iceServers: [...] }
+socket.onmessage = (e) => {
+    const msg = JSON.parse(e.data);
+    if (msg.type === 'iceServers') {
+        pc = new RTCPeerConnection({ iceServers: msg.iceServers });
+    }
+};
+```
+
+## Verification
+
+1. Server logs should show `ICE connection state: connected` (not `failed`)
+2. Browser console should show `ICE connection state: connected`
+3. Test TURN connectivity: `turnutils_uclient -u username -w credential turn-server-ip`
+
+## Notes
+- Both server and client need the TURN credentials — the server uses them for its
+  PeerConnection, and the client needs them for its RTCPeerConnection
+- Credentials are time-limited (TTL); generate fresh ones per session
+- If TURN server hostname doesn't resolve from k8s pods (CoreDNS custom zones),
+  use the IP address directly: `turn:1.2.3.4:3478`
+- STUN is still useful as a fallback for direct connections; keep it in the ICE
+  servers list alongside TURN
+- The shared secret must match coturn's `static-auth-secret` config