infra/.claude/skills/archived/helm-release-troubleshooting/SKILL.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

253 lines
9.7 KiB
Markdown

---
name: helm-release-troubleshooting
description: |
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
(6) helm history shows status "pending-upgrade" or "pending-rollback",
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
(8) helm upgrade fails with "an error occurred while finding last successful release".
Covers force re-rendering via state removal/reimport and stuck release recovery via
secret cleanup.
author: Claude Code
version: 1.0.0
date: 2026-02-22
---
# Helm Release Troubleshooting
## Force Re-render
### Problem
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
the new values. For example, adding a new port in Helm values doesn't result in that port
appearing in the Service spec.
### Context / Trigger Conditions
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
the old configuration
- Structural changes to Helm values (new ports, new containers, new volumes) are not
reflected in deployed resources
- The Helm chart templates need to be fully re-rendered, not just patched
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
includes resources based on values
### Root Cause
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
ones rather than doing a full template re-render. For structural changes (like enabling
HTTP/3 which adds a new UDP port to the Service template), the templates may not be
re-rendered with the new conditional branches active.
Additionally, Terraform may see the stored Helm release state as matching the desired state
even though the actual Kubernetes resources don't reflect it, creating a state drift that
Terraform doesn't detect.
### Solution
#### Step 1: Verify the Discrepancy
Confirm that K8s resources don't match Helm values:
```bash
# Check the actual resource
kubectl get svc <service-name> -n <namespace> -o yaml
# Check what Helm thinks is deployed
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
```
#### Step 2: Remove Helm Release from Terraform State
```bash
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
```
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
resources remain untouched in the cluster.
#### Step 3: Import the Helm Release Back
```bash
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
```
For Helm releases, the import ID format is `namespace/release-name`.
#### Step 4: Force Apply with Terraform
After reimporting, run terraform apply. Terraform should now detect the drift between
the desired Helm values and the actual release state:
```bash
terraform apply -target=module.kubernetes_cluster.module.<service>
```
If Terraform still shows "no changes", you may need to taint the resource:
```bash
terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
terraform apply -target=module.kubernetes_cluster.module.<service>
```
#### Step 5: Manual Helm Force Upgrade (Last Resort)
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
```bash
# Get the current values file
helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
# Edit /tmp/values.yaml to include the correct values, or use --set flags
# Force upgrade (re-renders all templates)
helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
# Then reimport into Terraform
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
terraform apply -target=module.kubernetes_cluster.module.<service>
```
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
afterward, and use `terraform apply` to verify Terraform is back in sync.
### Verification
```bash
# Check the K8s resources now match expected configuration
kubectl get svc <service-name> -n <namespace> -o yaml
kubectl get deployment <deployment-name> -n <namespace> -o yaml
# Verify Terraform is in sync
terraform plan -target=module.kubernetes_cluster.module.<service>
# Should show "No changes" or minimal expected drift
```
### Example: Traefik HTTP/3 UDP Port Not Appearing
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
successfully, but the Traefik Service only had TCP port 443, missing the expected
UDP port 443 (`websecure-http3`).
**Fix**:
```bash
# 1. Remove from state
terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
# 2. Reimport
terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
# 3. Apply (Terraform now detects the drift)
terraform apply -target=module.kubernetes_cluster.module.traefik
# 4. Verify
kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
# Should show: port: 443, protocol: UDP
```
### Notes
- This issue is more common with structural Helm value changes (new ports, new sidecars,
conditional template blocks) than with simple value changes (image tags, replica counts)
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
which causes brief downtime. Use with caution on production ingress controllers.
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
---
## Stuck Release Recovery
### Problem
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
### Context / Trigger Conditions
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
- `helm upgrade` fails with: `an error occurred while finding last successful release`
### Solution
#### Step 1: Identify the stuck release
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
```
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
#### Step 2: Delete the stuck Helm release secrets
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
Delete all stuck revisions:
```bash
# Delete specific stuck revision (e.g., revision 5)
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
# If multiple stuck revisions exist, delete all of them
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
```
#### Step 3: Verify the release is clean
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
```
The latest revision should now show `deployed` status.
#### Step 4: Retry the upgrade
```bash
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
```
### Important Notes
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
This changes the label but not the encoded release data inside the secret, leaving Helm in an
inconsistent state. Always delete the stuck secrets entirely.
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
the next successful upgrade will reconcile the state.
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
### Verification
After deleting stuck secrets and re-applying:
- `helm history` shows the new revision as `deployed`
- `terraform apply` completes without errors
### Example
```bash
# Helm history shows stuck state
$ helm history nextcloud -n nextcloud | tail -3
4 deployed nextcloud-8.8.1 Upgrade complete
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
6 pending-rollback nextcloud-8.8.1 Rollback to 4
# Fix: delete stuck revisions
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
# Verify clean state
$ helm history nextcloud -n nextcloud | tail -1
4 deployed nextcloud-8.8.1 Upgrade complete
# Re-apply
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
```
---
## See Also
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
## References
- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)