[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure
This commit is contained in:
parent
95013c9056
commit
3da35166ab
3 changed files with 305 additions and 0 deletions
99
.claude/skills/crowdsec-agent-registration-failure/SKILL.md
Normal file
99
.claude/skills/crowdsec-agent-registration-failure/SKILL.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
---
|
||||
name: crowdsec-agent-registration-failure
|
||||
description: |
|
||||
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
|
||||
machine registrations. Use when: (1) CrowdSec agent init container fails with
|
||||
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
|
||||
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
|
||||
running with old credentials, (4) cscli machines list shows stale entries for
|
||||
current agent pod names. Covers deleting stale registrations to allow re-registration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# CrowdSec Agent Registration Failure
|
||||
|
||||
## Problem
|
||||
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
|
||||
credentials but LAPI retains the old machine registrations. When agents try to
|
||||
re-register with the same pod name, the `wait-for-lapi-and-register` init container
|
||||
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
|
||||
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
|
||||
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
|
||||
- LAPI pods were recently restarted or redeployed
|
||||
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify stuck agents
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
|
||||
```
|
||||
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
|
||||
|
||||
### Step 2: Confirm the init container error
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
|
||||
```
|
||||
Should show `user already exist` error.
|
||||
|
||||
### Step 3: Find a running LAPI pod
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
|
||||
```
|
||||
|
||||
### Step 4: Delete stale machine registrations from LAPI
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
|
||||
```
|
||||
Repeat for each stuck agent.
|
||||
|
||||
### Step 5: Wait for agents to recover
|
||||
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
|
||||
automatically retry registration and succeed after the stale entry is deleted. This can
|
||||
take up to 5 minutes per agent depending on where they are in the backoff cycle.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# All agents should show Running status
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
|
||||
# DaemonSet should show all pods READY
|
||||
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
|
||||
```
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Identify stuck agents
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
|
||||
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
|
||||
|
||||
# Delete stale registrations
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
|
||||
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
|
||||
|
||||
# Wait ~5 minutes, then verify
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 1/1 Running 1 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 1/1 Running 1 3d
|
||||
crowdsec-agent-pfw2l 1/1 Running 1 3d
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a known limitation of the CrowdSec Helm chart — the init container registration
|
||||
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
|
||||
- The `cscli machines list` output will show many historical stale entries from past
|
||||
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
|
||||
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
|
||||
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
|
||||
the blocklist import.
|
||||
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.
|
||||
93
.claude/skills/helm-stuck-release-recovery/SKILL.md
Normal file
93
.claude/skills/helm-stuck-release-recovery/SKILL.md
Normal file
|
|
@ -0,0 +1,93 @@
|
|||
---
|
||||
name: helm-stuck-release-recovery
|
||||
description: |
|
||||
Fix Helm releases stuck in pending-upgrade, pending-rollback, or pending-install states.
|
||||
Use when: (1) terraform apply fails with "another operation (install/upgrade/rollback) is
|
||||
in progress", (2) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(3) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(4) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers manual secret cleanup to restore Helm release to a deployable state.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Helm Stuck Release Recovery
|
||||
|
||||
## Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
## Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
113
.claude/skills/k8s-hpa-scaling-storm/SKILL.md
Normal file
113
.claude/skills/k8s-hpa-scaling-storm/SKILL.md
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
---
|
||||
name: k8s-hpa-scaling-storm
|
||||
description: |
|
||||
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
|
||||
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
|
||||
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
|
||||
(3) cluster becomes unstable due to resource exhaustion from too many pods,
|
||||
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
|
||||
to a deployment that previously had none causes HPA to miscalculate utilization.
|
||||
Covers emergency response and prevention patterns.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Kubernetes HPA Scaling Storm
|
||||
|
||||
## Problem
|
||||
When an HPA is configured with a memory or CPU utilization target but the underlying
|
||||
deployment has insufficient resource requests, the HPA calculates artificially high
|
||||
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
|
||||
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
|
||||
cluster resources and potentially crashing etcd and the API server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
|
||||
- Pod count for a deployment rapidly increases to maxReplicas
|
||||
- etcd timeout errors in `kubectl` or `terraform apply`
|
||||
- API server becomes unreachable (`connection refused` or `network is unreachable`)
|
||||
- Adding resource requests to a Helm chart that previously had none
|
||||
- Memory-based HPA targets with real usage far exceeding requests
|
||||
|
||||
## Solution
|
||||
|
||||
### Emergency Response (stop the storm)
|
||||
|
||||
**Step 1: Delete the HPA immediately**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Step 2: Scale the deployment down**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
|
||||
```
|
||||
|
||||
**Step 3: Wait for pods to terminate and cluster to stabilize**
|
||||
```bash
|
||||
# Watch pod count decrease
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
|
||||
```
|
||||
|
||||
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
|
||||
will restart static pods (etcd, kube-apiserver) automatically.
|
||||
|
||||
### Prevention
|
||||
|
||||
**Rule 1: Set resource requests to match actual usage**
|
||||
Before enabling HPA, check actual resource consumption:
|
||||
```bash
|
||||
kubectl top pods -n <namespace> -l <label>
|
||||
```
|
||||
Set requests to the baseline (idle) usage, not the minimum possible value.
|
||||
|
||||
**Rule 2: Set reasonable maxReplicas**
|
||||
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
|
||||
Default of 100 is almost never appropriate for a home/small cluster.
|
||||
|
||||
**Rule 3: Prefer CPU-only HPA targets**
|
||||
Memory-based scaling is problematic because:
|
||||
- Memory usage grows over time and rarely decreases
|
||||
- Memory-based scaling creates pods that never scale down
|
||||
- CPU is more responsive to load changes
|
||||
|
||||
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
|
||||
If adding resource requests to a deployment managed by HPA, temporarily disable
|
||||
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
|
||||
|
||||
## Cascade Effects
|
||||
A scaling storm can cause:
|
||||
1. etcd storage exhaustion (too many pod objects)
|
||||
2. API server OOM or connection limits
|
||||
3. VPN/network connectivity loss (if VPN runs in the cluster)
|
||||
4. Kyverno webhook failures (admission controller overwhelmed)
|
||||
5. Other pods evicted or unable to schedule
|
||||
|
||||
## Verification
|
||||
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
|
||||
- Pod count is stable at expected replicas
|
||||
- `kubectl get nodes` responds promptly
|
||||
- No etcd timeout errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Observed: HPA scaling Collabora to 100 pods
|
||||
$ kubectl get hpa -n nextcloud
|
||||
NAME TARGETS MINPODS MAXPODS REPLICAS
|
||||
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
|
||||
|
||||
# Emergency fix
|
||||
$ kubectl delete hpa nextcloud-collabora -n nextcloud
|
||||
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
|
||||
|
||||
# Root cause: 256Mi memory request, actual usage 570Mi
|
||||
# Fix: increase request to 1Gi or disable memory target
|
||||
```
|
||||
|
||||
## Notes
|
||||
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
|
||||
Helm upgrade will recreate it. You must also update the Helm values.
|
||||
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
|
||||
the HPA issue entirely.
|
||||
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.
|
||||
Loading…
Add table
Add a link
Reference in a new issue