[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure

This commit is contained in:
Viktor Barzin 2026-02-15 17:18:17 +00:00
parent 95013c9056
commit 3da35166ab
3 changed files with 305 additions and 0 deletions

View file

@ -0,0 +1,99 @@
---
name: crowdsec-agent-registration-failure
description: |
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
machine registrations. Use when: (1) CrowdSec agent init container fails with
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
running with old credentials, (4) cscli machines list shows stale entries for
current agent pod names. Covers deleting stale registrations to allow re-registration.
author: Claude Code
version: 1.0.0
date: 2026-02-15
---
# CrowdSec Agent Registration Failure
## Problem
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
credentials but LAPI retains the old machine registrations. When agents try to
re-register with the same pod name, the `wait-for-lapi-and-register` init container
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
## Context / Trigger Conditions
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
- LAPI pods were recently restarted or redeployed
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
## Solution
### Step 1: Identify stuck agents
```bash
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
```
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
### Step 2: Confirm the init container error
```bash
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
```
Should show `user already exist` error.
### Step 3: Find a running LAPI pod
```bash
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
```
### Step 4: Delete stale machine registrations from LAPI
```bash
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
```
Repeat for each stuck agent.
### Step 5: Wait for agents to recover
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
automatically retry registration and succeed after the stale entry is deleted. This can
take up to 5 minutes per agent depending on where they are in the backoff cycle.
## Verification
```bash
# All agents should show Running status
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
# DaemonSet should show all pods READY
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
```
## Example
```bash
# Identify stuck agents
$ kubectl get pods -n crowdsec | grep agent
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
crowdsec-agent-jw76q 1/1 Running 8 3d
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
# Delete stale registrations
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
# Wait ~5 minutes, then verify
$ kubectl get pods -n crowdsec | grep agent
crowdsec-agent-jr5q7 1/1 Running 1 3d
crowdsec-agent-jw76q 1/1 Running 8 3d
crowdsec-agent-mtgxh 1/1 Running 1 3d
crowdsec-agent-pfw2l 1/1 Running 1 3d
```
## Notes
- This is a known limitation of the CrowdSec Helm chart — the init container registration
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
- The `cscli machines list` output will show many historical stale entries from past
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
the blocklist import.
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.

View file

@ -0,0 +1,93 @@
---
name: helm-stuck-release-recovery
description: |
Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states.
Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is
in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback",
(3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
(4) helm upgrade fails with "an error occurred while finding last successful release".
Covers manual secret cleanup to restore Helm release to a deployable state.
author: Claude Code
version: 1.0.0
date: 2026-02-15
---
# Helm Stuck Release Recovery
## Problem
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
## Context / Trigger Conditions
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
- `helm upgrade` fails with: `an error occurred while finding last successful release`
## Solution
### Step 1: Identify the stuck release
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
```
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
### Step 2: Delete the stuck Helm release secrets
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
Delete all stuck revisions:
```bash
# Delete specific stuck revision (e.g., revision 5)
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
# If multiple stuck revisions exist, delete all of them
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
```
### Step 3: Verify the release is clean
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
```
The latest revision should now show `deployed` status.
### Step 4: Retry the upgrade
```bash
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
```
## Important Notes
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
This changes the label but not the encoded release data inside the secret, leaving Helm in an
inconsistent state. Always delete the stuck secrets entirely.
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
the next successful upgrade will reconcile the state.
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
## Verification
After deleting stuck secrets and re-applying:
- `helm history` shows the new revision as `deployed`
- `terraform apply` completes without errors
## Example
```bash
# Helm history shows stuck state
$ helm history nextcloud -n nextcloud | tail -3
4 deployed nextcloud-8.8.1 Upgrade complete
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
6 pending-rollback nextcloud-8.8.1 Rollback to 4
# Fix: delete stuck revisions
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
# Verify clean state
$ helm history nextcloud -n nextcloud | tail -1
4 deployed nextcloud-8.8.1 Upgrade complete
# Re-apply
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
```

View file

@ -0,0 +1,113 @@
---
name: k8s-hpa-scaling-storm
description: |
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
(3) cluster becomes unstable due to resource exhaustion from too many pods,
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
to a deployment that previously had none causes HPA to miscalculate utilization.
Covers emergency response and prevention patterns.
author: Claude Code
version: 1.0.0
date: 2026-02-15
---
# Kubernetes HPA Scaling Storm
## Problem
When an HPA is configured with a memory or CPU utilization target but the underlying
deployment has insufficient resource requests, the HPA calculates artificially high
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
cluster resources and potentially crashing etcd and the API server.
## Context / Trigger Conditions
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
- Pod count for a deployment rapidly increases to maxReplicas
- etcd timeout errors in `kubectl` or `terraform apply`
- API server becomes unreachable (`connection refused` or `network is unreachable`)
- Adding resource requests to a Helm chart that previously had none
- Memory-based HPA targets with real usage far exceeding requests
## Solution
### Emergency Response (stop the storm)
**Step 1: Delete the HPA immediately**
```bash
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
```
**Step 2: Scale the deployment down**
```bash
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
```
**Step 3: Wait for pods to terminate and cluster to stabilize**
```bash
# Watch pod count decrease
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
```
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
will restart static pods (etcd, kube-apiserver) automatically.
### Prevention
**Rule 1: Set resource requests to match actual usage**
Before enabling HPA, check actual resource consumption:
```bash
kubectl top pods -n <namespace> -l <label>
```
Set requests to the baseline (idle) usage, not the minimum possible value.
**Rule 2: Set reasonable maxReplicas**
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
Default of 100 is almost never appropriate for a home/small cluster.
**Rule 3: Prefer CPU-only HPA targets**
Memory-based scaling is problematic because:
- Memory usage grows over time and rarely decreases
- Memory-based scaling creates pods that never scale down
- CPU is more responsive to load changes
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
If adding resource requests to a deployment managed by HPA, temporarily disable
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
## Cascade Effects
A scaling storm can cause:
1. etcd storage exhaustion (too many pod objects)
2. API server OOM or connection limits
3. VPN/network connectivity loss (if VPN runs in the cluster)
4. Kyverno webhook failures (admission controller overwhelmed)
5. Other pods evicted or unable to schedule
## Verification
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
- Pod count is stable at expected replicas
- `kubectl get nodes` responds promptly
- No etcd timeout errors
## Example
```bash
# Observed: HPA scaling Collabora to 100 pods
$ kubectl get hpa -n nextcloud
NAME TARGETS MINPODS MAXPODS REPLICAS
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
# Emergency fix
$ kubectl delete hpa nextcloud-collabora -n nextcloud
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
# Root cause: 256Mi memory request, actual usage 570Mi
# Fix: increase request to 1Gi or disable memory target
```
## Notes
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
Helm upgrade will recreate it. You must also update the Helm values.
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
the HPA issue entirely.
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.