Commit graph

69 commits

Author SHA1 Message Date
Viktor Barzin
1742a79fb2
[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
a79c4bc062
[ci skip] Fix Python f-string quoting in Slack formatter 2026-02-22 01:23:56 +00:00
Viktor Barzin
6103445e53
[ci skip] Improve Slack message formatting with human-readable names and structured blocks 2026-02-22 01:16:47 +00:00
Viktor Barzin
86a60bcfe6
[ci skip] Upgrade cluster-health.sh to full 24-check version with Slack 2026-02-22 01:09:02 +00:00
Viktor Barzin
f6a3e15cd1
[ci skip] Fix Slack message formatting to use Block Kit mrkdwn 2026-02-22 01:00:48 +00:00
Viktor Barzin
a459533225
[ci skip] Add cluster-health skill for OpenClaw agent 2026-02-22 00:04:15 +00:00
Viktor Barzin
ac08707068
[ci skip] Add cluster health check script for OpenClaw agent 2026-02-22 00:00:47 +00:00
Viktor Barzin
86d1d50ad0
[ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
d6cdbeaabe
[ci skip] Increase tier-based resource quotas to prevent quota exhaustion
Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods
Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods
Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods

Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%),
and actualbudget (75%) quota exhaustion causing pod creation failures.
2026-02-21 23:26:00 +00:00
Viktor Barzin
5fde3ab145
[ci skip] Add GitHub & Drone CI API access documentation 2026-02-21 19:14:41 +00:00
Viktor Barzin
3739906747
[ci skip] Add skills: pfsense-nat-rule-creation, coturn-k8s-without-hostnetwork 2026-02-21 18:29:32 +00:00
Viktor Barzin
71a4fb172c
[ci skip] Add Music Assistant librespot stale credentials skill
New skill: music-assistant-librespot-wrong-account
- Documents fix for Spotify playback failing with "librespot does not support
  free accounts" when cached credentials point to wrong Spotify account
- Includes step-by-step solution: find container, inspect cache, clear and restart

Updated: home-assistant skill with Music Assistant addon details for ha-sofia
2026-02-21 11:23:24 +00:00
Viktor Barzin
238db327f9
[ci skip] Add ground rules: no secrets, CI/CD required, monitoring required 2026-02-19 23:48:44 +00:00
Viktor Barzin
66f1867ccb
[ci skip] Update knowledge base: add OpenClaw service, rename moltbot references 2026-02-18 22:39:58 +00:00
Viktor Barzin
f23246f9a3
[ci skip] Add skills: authentik-oidc-kubernetes, kubelet-static-pod-manifest-update
Two skills extracted from multi-user k8s access implementation:
- authentik-oidc-kubernetes: 6 gotchas for Authentik OIDC + kube-apiserver
- kubelet-static-pod-manifest-update: full restart cycle for static pod changes
2026-02-17 22:56:03 +00:00
Viktor Barzin
1457a04695
[ci skip] Add Authentik management skill for API-based identity provider control 2026-02-17 22:55:41 +00:00
Viktor Barzin
84d9a3f926
[ci skip] Update CLAUDE.md with OIDC gotchas and k8s multi-user notes 2026-02-17 22:16:46 +00:00
Viktor Barzin
03651080bb
[ci skip] Update Authentik API token reference to terraform.tfvars 2026-02-17 22:03:55 +00:00
Viktor Barzin
79ce0db11c
[ci skip] Pass skill secrets to moltbot container and fix Python env
- Add skill_secrets variable to moltbot module with HA tokens and
  Uptime Kuma password as container env vars
- Install Python packages (requests, caldav, icalendar, uptime-kuma-api)
  in init container with PYTHONPATH for main container access
- Update all skills to use python3 directly instead of ~/.venvs/claude
  venv path that doesn't exist in the container
- Remove hardcoded Uptime Kuma password from skill, use env var
2026-02-17 21:53:32 +00:00
Viktor Barzin
72ff14e2df
[ci skip] Add Authentik API management knowledge 2026-02-17 21:10:40 +00:00
Viktor Barzin
6a8efa69c4
[ci skip] Import Claude skills into OpenClaw moltbot
- Convert setup-project and extend-vm-storage from standalone .md
  to directory-based SKILL.md format with YAML frontmatter
- Add symlink in moltbot init container to expose Claude skills
  at ~/.openclaw/skills/ for auto-discovery by OpenClaw
- Update CLAUDE.md skill path references
2026-02-17 21:09:12 +00:00
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
1802b4c86d
[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0 2026-02-16 22:30:57 +00:00
Viktor Barzin
c08bab7da7
[ci skip] Update preference: always use cluster_healthcheck.sh for health checks 2026-02-16 21:19:49 +00:00
Viktor Barzin
32a7c04deb
[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks 2026-02-16 19:45:31 +00:00
Viktor Barzin
7854f89add
[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood
Documents how Kubernetes ndots:5 search domain expansion floods external
DNS with NxDomain queries, and the CoreDNS template block fix.
2026-02-15 21:52:27 +00:00
Viktor Barzin
8d38e1474b
[ci skip] Document Terraform state splitting plan for future implementation 2026-02-15 21:10:40 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
df840a0078
[ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
2d475c8496
[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure 2026-02-15 17:18:17 +00:00
Viktor Barzin
f68ed73434
[ci skip] Strengthen Terraform-only change policy in project instructions 2026-02-15 15:10:11 +00:00
Viktor Barzin
0da86577fb
[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404 2026-02-15 14:36:50 +00:00
Viktor Barzin
dca2b0cabd
[ci skip] Add uptime-kuma management skill with tiered monitoring 2026-02-15 14:35:53 +00:00
Viktor Barzin
36d32b49e7
[ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
22fdb8fbf0
[ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
Viktor Barzin
2b6bcae77f
[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup 2026-02-13 23:47:45 +00:00
Viktor Barzin
ecebf401bf
[ci skip] Update knowledge base with Loki + Alloy service notes 2026-02-13 23:46:01 +00:00
Viktor Barzin
9df9ab1654
[ci skip] Add extend-vm-storage script and skills
- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
2026-02-13 22:08:46 +00:00
Viktor Barzin
06f9a7fe74
[ci skip] Add skill: local-llm-gpu-selection 2026-02-13 19:26:19 +00:00
Viktor Barzin
81bdf53386
[ci skip] Add skill: traefik-rewrite-body-compression
Extracted from debugging session where packruler/rewrite-body plugin
corrupted gzip responses, breaking HA Companion app auth flow and
WebSocket connections. Fix: strip Accept-Encoding header before
rewrite-body plugin so backends send uncompressed responses.
2026-02-11 21:42:07 +00:00
Viktor Barzin
ed3cb6df07
[ci skip] Add ingress-factory-migration skill 2026-02-10 21:31:48 +00:00
Viktor Barzin
748c77ff9f
[ci skip] update claude knowledge: fix NFS scripts path to secrets/ 2026-02-08 02:41:42 +00:00
Viktor Barzin
87f851c20e
[ci skip] update claude knowledge: always apply cloudflared module for DNS
When deploying a new service, the cloudflared module must also be applied
to create the Cloudflare DNS record. Updated CLAUDE.md and setup-project skill.
2026-02-08 02:30:19 +00:00
Viktor Barzin
d911db6cd9
[ci skip] Deploy Gramps Web genealogy service
Add grampsweb module with web app + Celery worker in a single pod,
using shared Redis (DB 2/3), NFS storage, email via mailserver,
and Ollama AI integration. Available at family.viktorbarzin.me.
2026-02-08 02:30:18 +00:00
Viktor Barzin
71223c02ad
[ci skip] update claude knowledge: add health service 2026-02-08 01:55:30 +00:00
Viktor Barzin
00943e92fe
[ci skip] update add-service skill: require NFS setup before deployment
Add step 3 (NFS Storage Setup) to ensure NFS directories are created
and exported on TrueNAS before deploying services that need persistent
storage. Prevents pods getting stuck in ContainerCreating due to missing
NFS mounts.
2026-02-08 01:51:44 +00:00
Viktor Barzin
d89947c2fd
[ci skip] Deploy Wyoming Whisper STT service for Home Assistant voice input
Add Wyoming Faster Whisper (rhasspy/wyoming-whisper) as a new K8s service
exposed via Traefik TCP entrypoint on port 10300. Accessible from ha-london
RPi via VPN at 10.0.20.202:10300.
2026-02-08 01:51:43 +00:00
Viktor Barzin
e067504170
[ci skip] update claude knowledge: fix ha-london IP to 192.168.8.103 2026-02-08 01:51:42 +00:00
Viktor Barzin
9bc56b3f04
[ci skip] Add LLM agents, voice stack, and automations to ha-london knowledge map 2026-02-07 22:40:12 +00:00
Viktor Barzin
a5c782578c
[ci skip] Add ha-london knowledge map: RPi Docker setup, smart plugs, air quality, e-bike
ha-london runs on Raspberry Pi at 192.168.8.104 (Docker rootless, HA 2025.9.1).
Key systems: TP-Link Kasa smart plugs with energy monitoring, Apollo AIR-1 air
quality sensor (ESPHome), Cowboy e-bike, UptimeRobot, Oral-B BLE toothbrush.
SSH access via pi@192.168.8.104, config at /home/pi/docker/homeAssistant/.
2026-02-07 22:39:20 +00:00