From bcbe8b23b4be2f7a88683f9683e787ae42451403 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 6 Mar 2026 23:17:40 +0000 Subject: [PATCH] [ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context --- .claude/CLAUDE.md | 32 ++++++++++++- .claude/agents/cluster-health-checker.md | 48 +++++++++++++++++++ .../authentik-oidc-kubernetes/SKILL.md | 0 .../skills/{ => archived}/authentik/SKILL.md | 0 .../bluestacks-burp-interception/SKILL.md | 0 .../SKILL.md | 0 .../coturn-k8s-without-hostnetwork/SKILL.md | 0 .../SKILL.md | 0 .../fastapi-svelte-gpu-webui/SKILL.md | 0 .../grafana-stale-datasource-cleanup/SKILL.md | 0 .../helm-release-troubleshooting/SKILL.md | 0 .../ingress-factory-migration/SKILL.md | 0 .../k8s-container-image-caching/SKILL.md | 0 .../k8s-gpu-no-nvidia-devices/SKILL.md | 0 .../k8s-hpa-scaling-storm/SKILL.md | 0 .../k8s-nfs-mount-troubleshooting/SKILL.md | 0 .../SKILL.md | 0 .../local-llm-gpu-selection/SKILL.md | 0 .../loki-helm-deployment-pitfalls/SKILL.md | 0 .../SKILL.md | 0 .../nextcloud-calendar/SKILL.md | 0 .../nfsv4-idmapd-uid-mapping/SKILL.md | 0 .../openclaw-k8s-deployment/SKILL.md | 0 .../SKILL.md | 0 .../pfsense-nat-rule-creation/SKILL.md | 0 .../SKILL.md | 0 .../python-filename-sanitization/SKILL.md | 0 .../SKILL.md | 0 .../traefik-helm-configuration/SKILL.md | 0 .../SKILL.md | 0 30 files changed, 79 insertions(+), 1 deletion(-) create mode 100644 .claude/agents/cluster-health-checker.md rename .claude/skills/{ => archived}/authentik-oidc-kubernetes/SKILL.md (100%) rename .claude/skills/{ => archived}/authentik/SKILL.md (100%) rename .claude/skills/{ => archived}/bluestacks-burp-interception/SKILL.md (100%) rename .claude/skills/{ => archived}/clickhouse-k8s-nfs-system-log-bloat/SKILL.md (100%) rename .claude/skills/{ => archived}/coturn-k8s-without-hostnetwork/SKILL.md (100%) rename .claude/skills/{ => archived}/crowdsec-agent-registration-failure/SKILL.md (100%) rename .claude/skills/{ => archived}/fastapi-svelte-gpu-webui/SKILL.md (100%) rename .claude/skills/{ => archived}/grafana-stale-datasource-cleanup/SKILL.md (100%) rename .claude/skills/{ => archived}/helm-release-troubleshooting/SKILL.md (100%) rename .claude/skills/{ => archived}/ingress-factory-migration/SKILL.md (100%) rename .claude/skills/{ => archived}/k8s-container-image-caching/SKILL.md (100%) rename .claude/skills/{ => archived}/k8s-gpu-no-nvidia-devices/SKILL.md (100%) rename .claude/skills/{ => archived}/k8s-hpa-scaling-storm/SKILL.md (100%) rename .claude/skills/{ => archived}/k8s-nfs-mount-troubleshooting/SKILL.md (100%) rename .claude/skills/{ => archived}/kubelet-static-pod-manifest-update/SKILL.md (100%) rename .claude/skills/{ => archived}/local-llm-gpu-selection/SKILL.md (100%) rename .claude/skills/{ => archived}/loki-helm-deployment-pitfalls/SKILL.md (100%) rename .claude/skills/{ => archived}/music-assistant-librespot-wrong-account/SKILL.md (100%) rename .claude/skills/{ => archived}/nextcloud-calendar/SKILL.md (100%) rename .claude/skills/{ => archived}/nfsv4-idmapd-uid-mapping/SKILL.md (100%) rename .claude/skills/{ => archived}/openclaw-k8s-deployment/SKILL.md (100%) rename .claude/skills/{ => archived}/pfsense-dnsmasq-interface-binding/SKILL.md (100%) rename .claude/skills/{ => archived}/pfsense-nat-rule-creation/SKILL.md (100%) rename .claude/skills/{ => archived}/proxmox-vm-disk-expansion-pitfalls/SKILL.md (100%) rename .claude/skills/{ => archived}/python-filename-sanitization/SKILL.md (100%) rename .claude/skills/{ => archived}/terraform-state-identity-mismatch/SKILL.md (100%) rename .claude/skills/{ => archived}/traefik-helm-configuration/SKILL.md (100%) rename .claude/skills/{ => archived}/traefik-rewrite-body-troubleshooting/SKILL.md (100%) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index dfd4bf2f..c5a32e5b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -227,4 +227,34 @@ Custom quota namespaces: `authentik` (16 req CPU/16Gi req mem/48 lim CPU/96Gi li - **Architecture**: 3 server + 3 worker + 3 PgBouncer + embedded outpost - **Traefik integration**: Forward auth via `protected = true` in ingress_factory - **OIDC for K8s**: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public) -- For management tasks and OIDC gotchas: see `authentik` and `authentik-oidc-kubernetes` skills +- For management tasks and OIDC gotchas: see archived skills `authentik` and `authentik-oidc-kubernetes` + +## Archived Troubleshooting Runbooks +Skills moved to `.claude/skills/archived/` — reference when the specific issue arises: +- **authentik** / **authentik-oidc-kubernetes**: Authentik REST API management, OIDC for K8s setup +- **bluestacks-burp-interception**: Android HTTPS interception via BlueStacks + Burp Suite +- **clickhouse-k8s-nfs-system-log-bloat**: ClickHouse high CPU from unbounded system log tables on NFS +- **coturn-k8s-without-hostnetwork**: Deploy coturn on K8s with narrow relay port range + MetalLB +- **crowdsec-agent-registration-failure**: CrowdSec agents stuck after LAPI restart (stale machine registrations) +- **fastapi-svelte-gpu-webui**: Pattern for wrapping GPU CLI tools with FastAPI + Svelte web UI +- **grafana-stale-datasource-cleanup**: Fix stale Grafana datasources via direct MySQL access +- **helm-release-troubleshooting**: Fix stuck Helm releases (pending-upgrade, failed state) +- **ingress-factory-migration**: Migrate raw kubernetes_ingress_v1 to ingress_factory module +- **k8s-container-image-caching**: Pull-through cache setup/troubleshooting for containerd +- **k8s-gpu-no-nvidia-devices**: Fix pods with GPU allocation but no /dev/nvidia* devices +- **k8s-hpa-scaling-storm**: Fix HPA scaling to maxReplicas uncontrollably +- **k8s-nfs-mount-troubleshooting**: Debug NFS mount failures (ContainerCreating, permission denied, stale mounts) +- **kubelet-static-pod-manifest-update**: Force kubelet to pick up static pod manifest changes +- **local-llm-gpu-selection**: GPU selection guide for local LLM inference on Dell R730 +- **loki-helm-deployment-pitfalls**: Fix Loki Helm chart issues (read-only FS, canary, stuck releases) +- **music-assistant-librespot-wrong-account**: Fix librespot "free account" error from stale credential cache +- **nextcloud-calendar**: CalDAV calendar management via Nextcloud API +- **nfsv4-idmapd-uid-mapping**: Fix all UIDs showing as 65534 in containers (NFSv4 idmapd) +- **openclaw-k8s-deployment**: OpenClaw gateway K8s deployment gotchas +- **pfsense-dnsmasq-interface-binding**: Restrict dnsmasq to specific interfaces for port 53 forwarding +- **pfsense-nat-rule-creation**: Create NAT rules programmatically via PHP/SSH +- **proxmox-vm-disk-expansion-pitfalls**: Fix growpart/drain issues when expanding Proxmox VM disks +- **python-filename-sanitization**: Secure filename sanitization for Python web apps +- **terraform-state-identity-mismatch**: Fix "Unexpected Identity Change" via state rm + reimport +- **traefik-helm-configuration**: HTTP/3, UDP routing, plugin download failures +- **traefik-rewrite-body-troubleshooting**: Fix compression corruption and silent skip in rewrite-body plugin diff --git a/.claude/agents/cluster-health-checker.md b/.claude/agents/cluster-health-checker.md new file mode 100644 index 00000000..144ed6da --- /dev/null +++ b/.claude/agents/cluster-health-checker.md @@ -0,0 +1,48 @@ +--- +name: cluster-health-checker +description: Check Kubernetes cluster health, diagnose issues, and apply safe auto-fixes. Use when asked to check cluster status, health, or fix common pod issues. +tools: Read, Bash, Grep, Glob +model: haiku +--- + +You are a Kubernetes cluster health checker for a homelab cluster managed via Terraform/Terragrunt. + +## Your Job + +Run the cluster healthcheck script and interpret the results. If issues are found, investigate root causes and apply safe fixes. + +## Environment + +- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`) +- **Healthcheck script**: `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet` +- **Infra repo**: `/Users/viktorbarzin/code/infra` + +## Workflow + +1. Run `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet` +2. Parse the output — identify PASS/WARN/FAIL counts and specific issues +3. For each FAIL or WARN, investigate the root cause: + - **Problematic pods**: `kubectl describe pod`, `kubectl logs --previous` + - **Failed deployments**: check rollout status, events + - **StatefulSet issues**: check pod readiness, GR status for MySQL + - **Prometheus alerts**: query via port-forward to prometheus-server +4. Apply safe auto-fixes: + - Delete evicted/failed pods: `kubectl delete pods -A --field-selector=status.phase=Failed` + - Delete stale failed jobs: `kubectl delete jobs -n --field-selector=status.successful=0` + - Restart stuck pods (>10 restarts): `kubectl delete pod -n --grace-period=0` +5. Report findings concisely + +## NEVER Do + +- Never `kubectl apply/edit/patch` — all changes go through Terraform +- Never restart NFS on TrueNAS +- Never modify secrets or tfvars +- Never push to git +- Never scale deployments to 0 + +## Known Expected Conditions + +These are not actionable — just report them: +- **ha-london** Uptime Kuma monitor down — external Home Assistant, not in this cluster +- **Resource usage >80%** on nodes — WARN only if actual usage is high, not limits overcommit +- **PVFillingUp** for navidrome-music — Synology NAS volume, threshold is 95% diff --git a/.claude/skills/authentik-oidc-kubernetes/SKILL.md b/.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md similarity index 100% rename from .claude/skills/authentik-oidc-kubernetes/SKILL.md rename to .claude/skills/archived/authentik-oidc-kubernetes/SKILL.md diff --git a/.claude/skills/authentik/SKILL.md b/.claude/skills/archived/authentik/SKILL.md similarity index 100% rename from .claude/skills/authentik/SKILL.md rename to .claude/skills/archived/authentik/SKILL.md diff --git a/.claude/skills/bluestacks-burp-interception/SKILL.md b/.claude/skills/archived/bluestacks-burp-interception/SKILL.md similarity index 100% rename from .claude/skills/bluestacks-burp-interception/SKILL.md rename to .claude/skills/archived/bluestacks-burp-interception/SKILL.md diff --git a/.claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md b/.claude/skills/archived/clickhouse-k8s-nfs-system-log-bloat/SKILL.md similarity index 100% rename from .claude/skills/clickhouse-k8s-nfs-system-log-bloat/SKILL.md rename to .claude/skills/archived/clickhouse-k8s-nfs-system-log-bloat/SKILL.md diff --git a/.claude/skills/coturn-k8s-without-hostnetwork/SKILL.md b/.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md similarity index 100% rename from .claude/skills/coturn-k8s-without-hostnetwork/SKILL.md rename to .claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md diff --git a/.claude/skills/crowdsec-agent-registration-failure/SKILL.md b/.claude/skills/archived/crowdsec-agent-registration-failure/SKILL.md similarity index 100% rename from .claude/skills/crowdsec-agent-registration-failure/SKILL.md rename to .claude/skills/archived/crowdsec-agent-registration-failure/SKILL.md diff --git a/.claude/skills/fastapi-svelte-gpu-webui/SKILL.md b/.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md similarity index 100% rename from .claude/skills/fastapi-svelte-gpu-webui/SKILL.md rename to .claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md diff --git a/.claude/skills/grafana-stale-datasource-cleanup/SKILL.md b/.claude/skills/archived/grafana-stale-datasource-cleanup/SKILL.md similarity index 100% rename from .claude/skills/grafana-stale-datasource-cleanup/SKILL.md rename to .claude/skills/archived/grafana-stale-datasource-cleanup/SKILL.md diff --git a/.claude/skills/helm-release-troubleshooting/SKILL.md b/.claude/skills/archived/helm-release-troubleshooting/SKILL.md similarity index 100% rename from .claude/skills/helm-release-troubleshooting/SKILL.md rename to .claude/skills/archived/helm-release-troubleshooting/SKILL.md diff --git a/.claude/skills/ingress-factory-migration/SKILL.md b/.claude/skills/archived/ingress-factory-migration/SKILL.md similarity index 100% rename from .claude/skills/ingress-factory-migration/SKILL.md rename to .claude/skills/archived/ingress-factory-migration/SKILL.md diff --git a/.claude/skills/k8s-container-image-caching/SKILL.md b/.claude/skills/archived/k8s-container-image-caching/SKILL.md similarity index 100% rename from .claude/skills/k8s-container-image-caching/SKILL.md rename to .claude/skills/archived/k8s-container-image-caching/SKILL.md diff --git a/.claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md b/.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md similarity index 100% rename from .claude/skills/k8s-gpu-no-nvidia-devices/SKILL.md rename to .claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md diff --git a/.claude/skills/k8s-hpa-scaling-storm/SKILL.md b/.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md similarity index 100% rename from .claude/skills/k8s-hpa-scaling-storm/SKILL.md rename to .claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md diff --git a/.claude/skills/k8s-nfs-mount-troubleshooting/SKILL.md b/.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md similarity index 100% rename from .claude/skills/k8s-nfs-mount-troubleshooting/SKILL.md rename to .claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md diff --git a/.claude/skills/kubelet-static-pod-manifest-update/SKILL.md b/.claude/skills/archived/kubelet-static-pod-manifest-update/SKILL.md similarity index 100% rename from .claude/skills/kubelet-static-pod-manifest-update/SKILL.md rename to .claude/skills/archived/kubelet-static-pod-manifest-update/SKILL.md diff --git a/.claude/skills/local-llm-gpu-selection/SKILL.md b/.claude/skills/archived/local-llm-gpu-selection/SKILL.md similarity index 100% rename from .claude/skills/local-llm-gpu-selection/SKILL.md rename to .claude/skills/archived/local-llm-gpu-selection/SKILL.md diff --git a/.claude/skills/loki-helm-deployment-pitfalls/SKILL.md b/.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md similarity index 100% rename from .claude/skills/loki-helm-deployment-pitfalls/SKILL.md rename to .claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md diff --git a/.claude/skills/music-assistant-librespot-wrong-account/SKILL.md b/.claude/skills/archived/music-assistant-librespot-wrong-account/SKILL.md similarity index 100% rename from .claude/skills/music-assistant-librespot-wrong-account/SKILL.md rename to .claude/skills/archived/music-assistant-librespot-wrong-account/SKILL.md diff --git a/.claude/skills/nextcloud-calendar/SKILL.md b/.claude/skills/archived/nextcloud-calendar/SKILL.md similarity index 100% rename from .claude/skills/nextcloud-calendar/SKILL.md rename to .claude/skills/archived/nextcloud-calendar/SKILL.md diff --git a/.claude/skills/nfsv4-idmapd-uid-mapping/SKILL.md b/.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md similarity index 100% rename from .claude/skills/nfsv4-idmapd-uid-mapping/SKILL.md rename to .claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md diff --git a/.claude/skills/openclaw-k8s-deployment/SKILL.md b/.claude/skills/archived/openclaw-k8s-deployment/SKILL.md similarity index 100% rename from .claude/skills/openclaw-k8s-deployment/SKILL.md rename to .claude/skills/archived/openclaw-k8s-deployment/SKILL.md diff --git a/.claude/skills/pfsense-dnsmasq-interface-binding/SKILL.md b/.claude/skills/archived/pfsense-dnsmasq-interface-binding/SKILL.md similarity index 100% rename from .claude/skills/pfsense-dnsmasq-interface-binding/SKILL.md rename to .claude/skills/archived/pfsense-dnsmasq-interface-binding/SKILL.md diff --git a/.claude/skills/pfsense-nat-rule-creation/SKILL.md b/.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md similarity index 100% rename from .claude/skills/pfsense-nat-rule-creation/SKILL.md rename to .claude/skills/archived/pfsense-nat-rule-creation/SKILL.md diff --git a/.claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md b/.claude/skills/archived/proxmox-vm-disk-expansion-pitfalls/SKILL.md similarity index 100% rename from .claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md rename to .claude/skills/archived/proxmox-vm-disk-expansion-pitfalls/SKILL.md diff --git a/.claude/skills/python-filename-sanitization/SKILL.md b/.claude/skills/archived/python-filename-sanitization/SKILL.md similarity index 100% rename from .claude/skills/python-filename-sanitization/SKILL.md rename to .claude/skills/archived/python-filename-sanitization/SKILL.md diff --git a/.claude/skills/terraform-state-identity-mismatch/SKILL.md b/.claude/skills/archived/terraform-state-identity-mismatch/SKILL.md similarity index 100% rename from .claude/skills/terraform-state-identity-mismatch/SKILL.md rename to .claude/skills/archived/terraform-state-identity-mismatch/SKILL.md diff --git a/.claude/skills/traefik-helm-configuration/SKILL.md b/.claude/skills/archived/traefik-helm-configuration/SKILL.md similarity index 100% rename from .claude/skills/traefik-helm-configuration/SKILL.md rename to .claude/skills/archived/traefik-helm-configuration/SKILL.md diff --git a/.claude/skills/traefik-rewrite-body-troubleshooting/SKILL.md b/.claude/skills/archived/traefik-rewrite-body-troubleshooting/SKILL.md similarity index 100% rename from .claude/skills/traefik-rewrite-body-troubleshooting/SKILL.md rename to .claude/skills/archived/traefik-rewrite-body-troubleshooting/SKILL.md