From 3da35166ab234b4f8d0fe5123a5802f2d40fc32c Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 15 Feb 2026 17:18:17 +0000 Subject: [PATCH] [ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure --- .../SKILL.md | 99 +++++++++++++++ .../helm-stuck-release-recovery/SKILL.md | 93 ++++++++++++++ .claude/skills/k8s-hpa-scaling-storm/SKILL.md | 113 ++++++++++++++++++ 3 files changed, 305 insertions(+) create mode 100644 .claude/skills/crowdsec-agent-registration-failure/SKILL.md create mode 100644 .claude/skills/helm-stuck-release-recovery/SKILL.md create mode 100644 .claude/skills/k8s-hpa-scaling-storm/SKILL.md diff --git a/.claude/skills/crowdsec-agent-registration-failure/SKILL.md b/.claude/skills/crowdsec-agent-registration-failure/SKILL.md new file mode 100644 index 00000000..099c0461 --- /dev/null +++ b/.claude/skills/crowdsec-agent-registration-failure/SKILL.md @@ -0,0 +1,99 @@ +--- +name: crowdsec-agent-registration-failure +description: | + Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale + machine registrations. Use when: (1) CrowdSec agent init container fails with + "user already exist" error during cscli lapi register, (2) agent pods show hundreds + of init container restarts, (3) LAPI was restarted or redeployed but agents kept + running with old credentials, (4) cscli machines list shows stale entries for + current agent pod names. Covers deleting stale registrations to allow re-registration. +author: Claude Code +version: 1.0.0 +date: 2026-02-15 +--- + +# CrowdSec Agent Registration Failure + +## Problem +After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their +credentials but LAPI retains the old machine registrations. When agents try to +re-register with the same pod name, the `wait-for-lapi-and-register` init container +fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts. + +## Context / Trigger Conditions +- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist` +- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts +- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register` +- LAPI pods were recently restarted or redeployed +- `cscli machines list` on LAPI shows entries matching the stuck agent pod names + +## Solution + +### Step 1: Identify stuck agents +```bash +kubectl --kubeconfig $(pwd)/config get pods -n crowdsec +``` +Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`). + +### Step 2: Confirm the init container error +```bash +kubectl --kubeconfig $(pwd)/config logs -n crowdsec -c wait-for-lapi-and-register --tail=5 +``` +Should show `user already exist` error. + +### Step 3: Find a running LAPI pod +```bash +kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi +``` + +### Step 4: Delete stale machine registrations from LAPI +```bash +kubectl --kubeconfig $(pwd)/config exec -n crowdsec -- cscli machines delete +``` +Repeat for each stuck agent. + +### Step 5: Wait for agents to recover +The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll +automatically retry registration and succeed after the stale entry is deleted. This can +take up to 5 minutes per agent depending on where they are in the backoff cycle. + +## Verification +```bash +# All agents should show Running status +kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent +# DaemonSet should show all pods READY +kubectl --kubeconfig $(pwd)/config get ds -n crowdsec +``` + +## Example +```bash +# Identify stuck agents +$ kubectl get pods -n crowdsec | grep agent +crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d +crowdsec-agent-jw76q 1/1 Running 8 3d +crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d +crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d + +# Delete stale registrations +$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7 +level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully" +$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh +$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l + +# Wait ~5 minutes, then verify +$ kubectl get pods -n crowdsec | grep agent +crowdsec-agent-jr5q7 1/1 Running 1 3d +crowdsec-agent-jw76q 1/1 Running 8 3d +crowdsec-agent-mtgxh 1/1 Running 1 3d +crowdsec-agent-pfw2l 1/1 Running 1 3d +``` + +## Notes +- This is a known limitation of the CrowdSec Helm chart — the init container registration + script is not idempotent (it doesn't handle "already exists" by deleting and re-registering). +- The `cscli machines list` output will show many historical stale entries from past + DaemonSet rollouts. These are harmless but can be cleaned up if desired. +- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects + agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes + the blocklist import. +- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures. diff --git a/.claude/skills/helm-stuck-release-recovery/SKILL.md b/.claude/skills/helm-stuck-release-recovery/SKILL.md new file mode 100644 index 00000000..ac9f999d --- /dev/null +++ b/.claude/skills/helm-stuck-release-recovery/SKILL.md @@ -0,0 +1,93 @@ +--- +name: helm-stuck-release-recovery +description: | + Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states. + Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is + in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback", + (3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop, + (4) helm upgrade fails with "an error occurred while finding last successful release". + Covers manual secret cleanup to restore Helm release to a deployable state. +author: Claude Code +version: 1.0.0 +date: 2026-02-15 +--- + +# Helm Stuck Release Recovery + +## Problem +Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install` +states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion). +Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress. + +## Context / Trigger Conditions +- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress` +- `helm history -n ` shows `pending-upgrade`, `pending-rollback`, or `pending-install` +- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout +- `helm upgrade` fails with: `an error occurred while finding last successful release` + +## Solution + +### Step 1: Identify the stuck release +```bash +helm --kubeconfig $(pwd)/config history -n | tail -5 +``` + +Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`. + +### Step 2: Delete the stuck Helm release secrets +Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1..v`. +Delete all stuck revisions: + +```bash +# Delete specific stuck revision (e.g., revision 5) +kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v5 -n + +# If multiple stuck revisions exist, delete all of them +kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1..v6 -n +``` + +### Step 3: Verify the release is clean +```bash +helm --kubeconfig $(pwd)/config history -n | tail -3 +``` + +The latest revision should now show `deployed` status. + +### Step 4: Retry the upgrade +```bash +terraform apply -target=module.kubernetes_cluster.module. -var="kube_config_path=$(pwd)/config" -auto-approve +``` + +## Important Notes + +- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`). + This changes the label but not the encoded release data inside the secret, leaving Helm in an + inconsistent state. Always delete the stuck secrets entirely. +- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment), + the next successful upgrade will reconcile the state. +- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value` + over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle. + +## Verification +After deleting stuck secrets and re-applying: +- `helm history` shows the new revision as `deployed` +- `terraform apply` completes without errors + +## Example +```bash +# Helm history shows stuck state +$ helm history nextcloud -n nextcloud | tail -3 +4 deployed nextcloud-8.8.1 Upgrade complete +5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout +6 pending-rollback nextcloud-8.8.1 Rollback to 4 + +# Fix: delete stuck revisions +$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud + +# Verify clean state +$ helm history nextcloud -n nextcloud | tail -1 +4 deployed nextcloud-8.8.1 Upgrade complete + +# Re-apply +$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve +``` diff --git a/.claude/skills/k8s-hpa-scaling-storm/SKILL.md b/.claude/skills/k8s-hpa-scaling-storm/SKILL.md new file mode 100644 index 00000000..cdeab5e8 --- /dev/null +++ b/.claude/skills/k8s-hpa-scaling-storm/SKILL.md @@ -0,0 +1,113 @@ +--- +name: k8s-hpa-scaling-storm +description: | + Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to + maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at + 200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes, + (3) cluster becomes unstable due to resource exhaustion from too many pods, + (4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests + to a deployment that previously had none causes HPA to miscalculate utilization. + Covers emergency response and prevention patterns. +author: Claude Code +version: 1.0.0 +date: 2026-02-15 +--- + +# Kubernetes HPA Scaling Storm + +## Problem +When an HPA is configured with a memory or CPU utilization target but the underlying +deployment has insufficient resource requests, the HPA calculates artificially high +utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi). +This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting +cluster resources and potentially crashing etcd and the API server. + +## Context / Trigger Conditions +- `kubectl get hpa` shows `/70%` or very high percentages (200%+) +- Pod count for a deployment rapidly increases to maxReplicas +- etcd timeout errors in `kubectl` or `terraform apply` +- API server becomes unreachable (`connection refused` or `network is unreachable`) +- Adding resource requests to a Helm chart that previously had none +- Memory-based HPA targets with real usage far exceeding requests + +## Solution + +### Emergency Response (stop the storm) + +**Step 1: Delete the HPA immediately** +```bash +kubectl --kubeconfig $(pwd)/config delete hpa -n +``` + +**Step 2: Scale the deployment down** +```bash +kubectl --kubeconfig $(pwd)/config scale deployment -n --replicas=2 +``` + +**Step 3: Wait for pods to terminate and cluster to stabilize** +```bash +# Watch pod count decrease +kubectl --kubeconfig $(pwd)/config get pods -n -l