[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent
- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context
This commit is contained in:
parent
614d14c47d
commit
bcbe8b23b4
30 changed files with 79 additions and 1 deletions
|
|
@ -227,4 +227,34 @@ Custom quota namespaces: `authentik` (16 req CPU/16Gi req mem/48 lim CPU/96Gi li
|
|||
- **Architecture**: 3 server + 3 worker + 3 PgBouncer + embedded outpost
|
||||
- **Traefik integration**: Forward auth via `protected = true` in ingress_factory
|
||||
- **OIDC for K8s**: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
|
||||
- For management tasks and OIDC gotchas: see `authentik` and `authentik-oidc-kubernetes` skills
|
||||
- For management tasks and OIDC gotchas: see archived skills `authentik` and `authentik-oidc-kubernetes`
|
||||
|
||||
## Archived Troubleshooting Runbooks
|
||||
Skills moved to `.claude/skills/archived/` — reference when the specific issue arises:
|
||||
- **authentik** / **authentik-oidc-kubernetes**: Authentik REST API management, OIDC for K8s setup
|
||||
- **bluestacks-burp-interception**: Android HTTPS interception via BlueStacks + Burp Suite
|
||||
- **clickhouse-k8s-nfs-system-log-bloat**: ClickHouse high CPU from unbounded system log tables on NFS
|
||||
- **coturn-k8s-without-hostnetwork**: Deploy coturn on K8s with narrow relay port range + MetalLB
|
||||
- **crowdsec-agent-registration-failure**: CrowdSec agents stuck after LAPI restart (stale machine registrations)
|
||||
- **fastapi-svelte-gpu-webui**: Pattern for wrapping GPU CLI tools with FastAPI + Svelte web UI
|
||||
- **grafana-stale-datasource-cleanup**: Fix stale Grafana datasources via direct MySQL access
|
||||
- **helm-release-troubleshooting**: Fix stuck Helm releases (pending-upgrade, failed state)
|
||||
- **ingress-factory-migration**: Migrate raw kubernetes_ingress_v1 to ingress_factory module
|
||||
- **k8s-container-image-caching**: Pull-through cache setup/troubleshooting for containerd
|
||||
- **k8s-gpu-no-nvidia-devices**: Fix pods with GPU allocation but no /dev/nvidia* devices
|
||||
- **k8s-hpa-scaling-storm**: Fix HPA scaling to maxReplicas uncontrollably
|
||||
- **k8s-nfs-mount-troubleshooting**: Debug NFS mount failures (ContainerCreating, permission denied, stale mounts)
|
||||
- **kubelet-static-pod-manifest-update**: Force kubelet to pick up static pod manifest changes
|
||||
- **local-llm-gpu-selection**: GPU selection guide for local LLM inference on Dell R730
|
||||
- **loki-helm-deployment-pitfalls**: Fix Loki Helm chart issues (read-only FS, canary, stuck releases)
|
||||
- **music-assistant-librespot-wrong-account**: Fix librespot "free account" error from stale credential cache
|
||||
- **nextcloud-calendar**: CalDAV calendar management via Nextcloud API
|
||||
- **nfsv4-idmapd-uid-mapping**: Fix all UIDs showing as 65534 in containers (NFSv4 idmapd)
|
||||
- **openclaw-k8s-deployment**: OpenClaw gateway K8s deployment gotchas
|
||||
- **pfsense-dnsmasq-interface-binding**: Restrict dnsmasq to specific interfaces for port 53 forwarding
|
||||
- **pfsense-nat-rule-creation**: Create NAT rules programmatically via PHP/SSH
|
||||
- **proxmox-vm-disk-expansion-pitfalls**: Fix growpart/drain issues when expanding Proxmox VM disks
|
||||
- **python-filename-sanitization**: Secure filename sanitization for Python web apps
|
||||
- **terraform-state-identity-mismatch**: Fix "Unexpected Identity Change" via state rm + reimport
|
||||
- **traefik-helm-configuration**: HTTP/3, UDP routing, plugin download failures
|
||||
- **traefik-rewrite-body-troubleshooting**: Fix compression corruption and silent skip in rewrite-body plugin
|
||||
|
|
|
|||
48
.claude/agents/cluster-health-checker.md
Normal file
48
.claude/agents/cluster-health-checker.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
---
|
||||
name: cluster-health-checker
|
||||
description: Check Kubernetes cluster health, diagnose issues, and apply safe auto-fixes. Use when asked to check cluster status, health, or fix common pod issues.
|
||||
tools: Read, Bash, Grep, Glob
|
||||
model: haiku
|
||||
---
|
||||
|
||||
You are a Kubernetes cluster health checker for a homelab cluster managed via Terraform/Terragrunt.
|
||||
|
||||
## Your Job
|
||||
|
||||
Run the cluster healthcheck script and interpret the results. If issues are found, investigate root causes and apply safe fixes.
|
||||
|
||||
## Environment
|
||||
|
||||
- **Kubeconfig**: `/Users/viktorbarzin/code/infra/config` (always use `kubectl --kubeconfig /Users/viktorbarzin/code/infra/config`)
|
||||
- **Healthcheck script**: `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet`
|
||||
- **Infra repo**: `/Users/viktorbarzin/code/infra`
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Run `bash /Users/viktorbarzin/code/infra/scripts/cluster_healthcheck.sh --quiet`
|
||||
2. Parse the output — identify PASS/WARN/FAIL counts and specific issues
|
||||
3. For each FAIL or WARN, investigate the root cause:
|
||||
- **Problematic pods**: `kubectl describe pod`, `kubectl logs --previous`
|
||||
- **Failed deployments**: check rollout status, events
|
||||
- **StatefulSet issues**: check pod readiness, GR status for MySQL
|
||||
- **Prometheus alerts**: query via port-forward to prometheus-server
|
||||
4. Apply safe auto-fixes:
|
||||
- Delete evicted/failed pods: `kubectl delete pods -A --field-selector=status.phase=Failed`
|
||||
- Delete stale failed jobs: `kubectl delete jobs -n <ns> --field-selector=status.successful=0`
|
||||
- Restart stuck pods (>10 restarts): `kubectl delete pod -n <ns> <pod> --grace-period=0`
|
||||
5. Report findings concisely
|
||||
|
||||
## NEVER Do
|
||||
|
||||
- Never `kubectl apply/edit/patch` — all changes go through Terraform
|
||||
- Never restart NFS on TrueNAS
|
||||
- Never modify secrets or tfvars
|
||||
- Never push to git
|
||||
- Never scale deployments to 0
|
||||
|
||||
## Known Expected Conditions
|
||||
|
||||
These are not actionable — just report them:
|
||||
- **ha-london** Uptime Kuma monitor down — external Home Assistant, not in this cluster
|
||||
- **Resource usage >80%** on nodes — WARN only if actual usage is high, not limits overcommit
|
||||
- **PVFillingUp** for navidrome-music — Synology NAS volume, threshold is 95%
|
||||
Loading…
Add table
Add a link
Reference in a new issue