[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent

- Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/
- Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense,
  home-assistant, setup-project, extend-vm-storage, k8s-ndots
- Add one-line runbook index to CLAUDE.md for quick reference
- Create cluster-health-checker custom agent (haiku model, read-only + bash)
  for autonomous health checks without consuming main context
This commit is contained in:
Viktor Barzin 2026-03-06 23:17:40 +00:00
parent 614d14c47d
commit bcbe8b23b4
30 changed files with 79 additions and 1 deletions

View file

@ -0,0 +1,170 @@
---
name: authentik-oidc-kubernetes
description: |
Configure Authentik as OIDC provider for Kubernetes API server authentication.
Use when: (1) setting up OIDC auth for kubectl with Authentik, (2) kube-apiserver
rejects OIDC tokens with "oidc: email not verified", (3) JWKS endpoint returns
empty {} despite provider being configured, (4) kubelogin fails with "claim not
present" for email, (5) redirect_uri mismatch errors during kubelogin browser auth,
(6) kube-apiserver static pod manifest changes don't take effect after restart.
Covers all gotchas discovered when integrating Authentik 2025.10.x with Kubernetes
1.34.x using kubelogin (int128/kubelogin).
author: Claude Code
version: 1.0.0
date: 2026-02-17
---
# Authentik OIDC for Kubernetes API Authentication
## Problem
Setting up Authentik as an OIDC identity provider for Kubernetes kubectl access
involves multiple non-obvious pitfalls that cause silent failures at different
stages of the authentication flow.
## Context / Trigger Conditions
- Setting up multi-user kubectl access with OIDC
- Using Authentik as the identity provider and kubelogin (int128/kubelogin) as the kubectl plugin
- Any of these errors:
- `oidc: email not verified`
- `oidc: parse username claims "email": claim not present`
- `The request fails due to a missing, invalid, or mismatching redirection URI`
- JWKS endpoint (`/application/o/<app>/jwks/`) returns `{}`
- `Unauthorized` after successful browser login
## Solution
### Gotcha 1: Signing Key Must Be Assigned
Authentik's OAuth2 provider does NOT assign a signing key by default. Without it,
the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
**Fix:** Assign a signing key (e.g., "authentik Self-signed Certificate") to the
OAuth2 provider:
```python
# Via Django shell (kubectl exec into authentik server pod)
from authentik.providers.oauth2.models import OAuth2Provider
from authentik.crypto.models import CertificateKeyPair
provider = OAuth2Provider.objects.get(name='kubernetes')
cert = CertificateKeyPair.objects.filter(name='authentik Self-signed Certificate').first()
provider.signing_key = cert
provider.save()
```
Or via API:
```bash
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
"$AUTHENTIK_URL/api/v3/providers/oauth2/<pk>/" \
-d '{"signing_key": "<certificate-keypair-uuid>"}'
```
### Gotcha 2: Default Email Mapping Sets `email_verified: False`
Authentik's built-in email scope mapping hardcodes `email_verified: False`:
```python
return {
"email": request.user.email,
"email_verified": False # <-- This causes kube-apiserver to reject the token
}
```
kube-apiserver requires `email_verified: true` by default.
**Fix:** Create a custom scope mapping with `email_verified: True` and assign it
to the provider instead of the default:
```python
from authentik.providers.oauth2.models import OAuth2Provider, ScopeMapping
# Create custom mapping
mapping, _ = ScopeMapping.objects.get_or_create(
name='Kubernetes Email (verified)',
defaults={
'scope_name': 'email',
'expression': 'return {"email": request.user.email, "email_verified": True}'
}
)
# Replace default email mapping on the provider
provider = OAuth2Provider.objects.get(name='kubernetes')
default_email = ScopeMapping.objects.filter(
managed='goauthentik.io/providers/oauth2/scope-email'
).first()
if default_email:
provider.property_mappings.remove(default_email)
provider.property_mappings.add(mapping)
```
### Gotcha 3: kubelogin Needs Extra Scopes
By default, kubelogin only requests the `openid` scope. The token will lack
`email` and `groups` claims, causing:
```
oidc: parse username claims "email": claim not present
```
**Fix:** Add `--oidc-extra-scope` flags to the kubeconfig exec plugin:
```yaml
users:
- name: oidc-user
user:
exec:
command: kubectl
args:
- oidc-login
- get-token
- --oidc-issuer-url=https://authentik.example.com/application/o/kubernetes/
- --oidc-client-id=kubernetes
- --oidc-extra-scope=email # Required!
- --oidc-extra-scope=profile
- --oidc-extra-scope=groups
```
### Gotcha 4: Redirect URIs Must Use Regex Mode
kubelogin picks a random available port (tries 8000, 18000, then random).
Strict redirect URI matching like `http://localhost:8000/callback` will fail
when kubelogin uses a different port.
**Fix:** Use regex matching in the Authentik provider:
```json
{
"redirect_uris": [
{"matching_mode": "regex", "url": "http://localhost:.*"},
{"matching_mode": "regex", "url": "http://127\\.0\\.0\\.1:.*"}
]
}
```
### Gotcha 5: Property Mappings API Endpoint Changed
In Authentik 2025.10.x, scope mappings are at:
- `propertymappings/provider/scope/` (new, correct)
- NOT `propertymappings/scope/` (old, returns 405 Method Not Allowed on POST)
### Gotcha 6: Static Pod Manifest Changes Need Full Cycle
See skill: `kubelet-static-pod-manifest-update` for the full restart procedure.
## Verification
After all fixes:
```bash
# 1. JWKS has a key
curl -s https://authentik.example.com/application/o/kubernetes/jwks/ | jq '.keys | length'
# Expected: 1 (or more)
# 2. Test auth
KUBECONFIG=/path/to/oidc-kubeconfig kubectl get namespaces
# Expected: browser opens, login, namespaces returned
# 3. Check API server logs for success
ssh master "sudo kubectl logs -n kube-system kube-apiserver-* | grep oidc | tail -5"
# Expected: no "Unable to authenticate" errors
```
## Notes
- The OAuth2 provider should use `client_type: public` (no client secret needed for kubelogin)
- Set `sub_mode: user_email` so the OIDC subject matches the RBAC binding
- Set `include_claims_in_id_token: true` for the token to contain claims directly
- Use `issuer_mode: per_provider` for a clean issuer URL
- RBAC ClusterRoleBindings should match on the user's email (the `--oidc-username-claim=email` value)

View file

@ -0,0 +1,254 @@
---
name: authentik
description: |
Manage the Authentik identity provider via its REST API. Use when:
(1) User asks to create, update, or delete users in Authentik,
(2) User asks to manage groups or group memberships,
(3) User asks to create a new OAuth2/OIDC application or provider,
(4) User asks to protect a service with forward auth (Authentik + Traefik),
(5) User asks about SSO, single sign-on, authentication, or identity,
(6) User asks to manage Authentik flows, stages, or policies,
(7) User asks to configure social login (Google, GitHub, Facebook),
(8) User asks about OIDC for Kubernetes or who has access to what,
(9) User deploys a new service that needs authentication.
Authentik v2025.10.3 running in Kubernetes, managed via REST API.
author: Claude Code
version: 1.0.0
date: 2026-02-17
---
# Authentik Identity Provider Management
## Overview
- **URL**: `https://authentik.viktorbarzin.me`
- **Admin UI**: `https://authentik.viktorbarzin.me/if/admin/`
- **API Base**: `https://authentik.viktorbarzin.me/api/v3/`
- **API Docs**: `https://authentik.viktorbarzin.me/api/v3/docs/`
- **Helm Chart**: authentik v2025.10.3
- **Namespace**: `authentik`
## API Access
### Getting the Token
The API token is stored in `terraform.tfvars` (git-crypt encrypted):
```bash
AUTHENTIK_TOKEN=$(grep authentik_api_token terraform.tfvars | cut -d'"' -f2)
```
### Making API Calls
```bash
# Generic pattern
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
# With JSON body (POST/PATCH/PUT)
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/" \
-d '{"key": "value"}'
```
### Verify Token Works
```bash
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/core/users/me/" | python3 -m json.tool
```
## Key API Endpoints
| Endpoint | Methods | Purpose |
|----------|---------|---------|
| `core/users/` | GET, POST | List/create users |
| `core/users/{id}/` | GET, PATCH, DELETE | Get/update/delete user |
| `core/groups/` | GET, POST | List/create groups |
| `core/groups/{pk}/` | GET, PATCH, DELETE | Get/update/delete group |
| `core/applications/` | GET, POST | List/create applications |
| `core/tokens/` | GET, POST | List/create tokens |
| `core/tokens/{identifier}/view_key/` | GET | View token secret key |
| `providers/all/` | GET | List all providers |
| `providers/oauth2/` | GET, POST | OAuth2/OIDC providers |
| `providers/proxy/` | GET, POST | Proxy providers (forward auth) |
| `flows/instances/` | GET | List flows |
| `stages/all/` | GET | List stages |
| `sources/all/` | GET | List sources (social login) |
| `outposts/instances/` | GET | List outposts |
| `propertymappings/provider/scope/` | GET, POST | OIDC scope mappings |
| `rbac/roles/` | GET | List roles |
## Common Operations
### List All Users
```bash
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/core/users/?page_size=50" | \
python3 -c "
import json,sys
for u in json.load(sys.stdin)['results']:
groups=[g['name'] for g in u.get('groups_obj',[])]
print(f\" {u['username']:<40} {u['name']:<30} groups={groups}\")
"
```
### Create a New User
```bash
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/users/" \
-d '{
"username": "user@example.com",
"name": "Full Name",
"email": "user@example.com",
"is_active": true,
"type": "internal",
"path": "users"
}'
```
### Add User to Group
```bash
# First get the group to find current users
GROUP_PK="<group-uuid>"
CURRENT_USERS=$(curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" | \
python3 -c "import json,sys; print(json.load(sys.stdin)['users'])")
# Then PATCH with the updated user list (add new user pk)
curl -s -X PATCH \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" \
-d '{"users": [<existing_pks>, <new_pk>]}'
```
### Create a New Group
```bash
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/groups/" \
-d '{
"name": "My New Group",
"is_superuser": false,
"parent": "<parent-group-pk-or-null>"
}'
```
### Create OAuth2/OIDC Application (Full Flow)
**Step 1: Create the OAuth2 Provider**
```bash
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/providers/oauth2/" \
-d '{
"name": "Provider for myapp",
"authorization_flow": "<flow-pk>",
"invalidation_flow": "<invalidation-flow-pk>",
"client_type": "confidential",
"client_id": "<generated-or-custom>",
"client_secret": "<generated-or-custom>",
"redirect_uris": "https://myapp.viktorbarzin.me/callback",
"property_mappings": ["<scope-mapping-pks>"],
"signing_key": "<signing-key-pk>"
}'
```
**Step 2: Create the Application**
```bash
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/applications/" \
-d '{
"name": "My App",
"slug": "myapp",
"provider": <provider-pk-from-step-1>,
"meta_launch_url": "https://myapp.viktorbarzin.me"
}'
```
### List Applications
```bash
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/core/applications/?page_size=50" | \
python3 -c "
import json,sys
for a in json.load(sys.stdin)['results']:
ptype = a.get('provider_obj',{}).get('verbose_name','N/A')
print(f\" {a['name']:<30} slug={a['slug']:<25} provider={ptype}\")
"
```
### Create a Non-Expiring API Token
```bash
# Create token
curl -s -X POST \
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
-H "Content-Type: application/json" \
"https://authentik.viktorbarzin.me/api/v3/core/tokens/" \
-d '{
"identifier": "my-token-name",
"intent": "api",
"expiring": false,
"description": "Description here"
}'
# Retrieve the key
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
"https://authentik.viktorbarzin.me/api/v3/core/tokens/my-token-name/view_key/"
```
## Important Reference UUIDs
### Authorization Flows
| Flow | Slug | Use For |
|------|------|---------|
| Authorize Application (explicit consent) | `default-provider-authorization-explicit-consent` | Apps that should show consent screen |
| Authorize Application (implicit consent) | `default-provider-authorization-implicit-consent` | Internal/trusted apps, auto-redirect |
| Logout | `default-invalidation-flow` | Invalidation/logout flow |
### Common Property Mappings (OIDC Scopes)
These are the standard scope mappings used by most providers:
- `60e33a8c-66a2-414f-840c-b13012b4d4bd` — openid
- `1f51c659-f13b-4ad4-ba89-70458ef88e9c` — email
- `4c0bf430-7f74-4216-b9d7-23703ab544ba` — profile
### Login Sources
| Source | Slug | Matching Mode |
|--------|------|---------------|
| Google | `google` | identifier |
| GitHub | `github` | email_link |
| Facebook | `facebook` | email_link |
## Protecting a Service with Forward Auth
To protect a service via Authentik + Traefik forward auth:
1. In the service's Terraform module, set `protected = true` in the `ingress_factory` call
2. This adds the `authentik-forward-auth` Traefik middleware
3. Unauthenticated users get redirected to the Authentik login page
4. After login, these headers are forwarded to the service:
- `X-authentik-username`
- `X-authentik-uid`
- `X-authentik-email`
- `X-authentik-name`
- `X-authentik-groups`
## Gotchas
1. **API pagination**: All list endpoints return paginated results. Use `?page_size=50` or check `pagination.next` for more pages.
2. **Group user updates**: PATCH to groups replaces the entire user list — always fetch current users first, then append.
3. **Provider property mappings**: Must reference existing scope mapping UUIDs. Query `propertymappings/provider/scope/` to find them.
4. **Signing key for OIDC**: Must assign a signing key to OAuth2 providers or JWKS endpoint returns empty `{}`.
5. **Email verified claim**: Default email scope mapping sets `email_verified: False`. For Kubernetes OIDC, create a custom mapping that returns `True`.
6. **Token identifier uniqueness**: Token identifiers must be unique across the entire instance.
## Notes
- Authentik is classified as DEFCON Level 1 (Critical) — handle with care
- Changes to Authentik configuration (Helm chart, PgBouncer, etc.) must go through Terraform
- API-level changes (users, groups, applications) are fine to make directly via the API
- The embedded outpost auto-discovers providers assigned to it
- See also: `ingress-factory-migration` skill for protecting services

View file

@ -0,0 +1,175 @@
---
name: bluestacks-burp-interception
description: |
Intercept Android app HTTPS traffic using BlueStacks and Burp Suite on macOS.
Use when: (1) Need to analyze Android app API calls, (2) App ignores HTTP proxy,
(3) App uses SSL pinning that blocks interception, (4) Need to install Burp CA
as system certificate. Covers ADB setup, proxy configuration, Zygisk SSL unpinning,
and Magisk trustusercerts module for system CA installation.
author: Claude Code
version: 1.0.0
date: 2026-01-24
---
# BlueStacks + Burp Suite HTTPS Traffic Interception
## Problem
You want to intercept HTTPS traffic from an Android app running in BlueStacks to analyze
API calls, but the app either ignores the proxy or uses SSL certificate pinning.
## Context / Trigger Conditions
- Running BlueStacks on macOS with Burp Suite
- App traffic not appearing in Burp Suite
- App crashes or refuses to connect when proxy is set
- Need to bypass SSL pinning for security testing/research
## Prerequisites
- BlueStacks with Magisk (kitsune variant) and root enabled
- Zygisk-SSL-Unpinning module installed
- trustusercerts Magisk module installed
- Android SDK installed (for ADB)
- Burp Suite running on port 8080
## Solution
### Step 1: Connect ADB to BlueStacks
```bash
# ADB location on macOS (Android SDK)
ADB=~/Library/Android/sdk/platform-tools/adb
# Connect to BlueStacks
$ADB connect localhost:5555
# Verify connection
$ADB devices
# Should show: emulator-5554 or localhost:5555
```
Note: BlueStacks runs **arm64-v8a** (not x86 as you might expect).
### Step 2: Set HTTP Proxy
Use your Mac's WiFi IP address (not 10.0.2.2 or localhost):
```bash
# Get Mac WiFi IP
IP=$(ipconfig getifaddr en0)
# Set proxy (Burp default port 8080)
$ADB shell settings put global http_proxy ${IP}:8080
# Verify
$ADB shell settings get global http_proxy
# Disable proxy when done
$ADB shell settings put global http_proxy :0
```
### Step 3: Configure SSL Unpinning for Target App
```bash
# Find app package name
$ADB shell pm list packages | grep <keyword>
# Edit config
$ADB shell "su -c 'cat > /data/local/tmp/zyg.ssl/config.json << EOF
{
\"targets\": [
{
\"pkg_name\" : \"com.example.app\",
\"enable\": true,
\"start_safe\": true,
\"start_delay\": 1000
}
]
}
EOF'"
# Restart the app
$ADB shell am force-stop com.example.app
$ADB shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1
# Verify SSL unpinning is active
$ADB shell "logcat -d | grep -i ZygiskSSL | tail -10"
# Should show: "App detected: com.example.app" and "[*] SSL UNPINNING [#]"
```
### Step 4: Install Burp CA as System Certificate
```bash
# Download Burp CA cert
curl -x http://127.0.0.1:8080 http://burp/cert -o /tmp/burp-cert.der
# Convert to PEM
openssl x509 -inform DER -in /tmp/burp-cert.der -out /tmp/burp-cert.pem
# Get hash for Android cert store naming
HASH=$(openssl x509 -inform PEM -subject_hash_old -in /tmp/burp-cert.pem | head -1)
cp /tmp/burp-cert.pem /tmp/${HASH}.0
# Push to device
$ADB push /tmp/${HASH}.0 /sdcard/
# Install via trustusercerts Magisk module
$ADB shell "su -c 'cp /sdcard/${HASH}.0 /data/adb/modules/trustusercerts/system/etc/security/cacerts/'"
$ADB shell "su -c 'chmod 644 /data/adb/modules/trustusercerts/system/etc/security/cacerts/${HASH}.0'"
# Reboot required for Magisk overlay
$ADB shell "su -c 'reboot'"
# After reboot, verify cert is in system store
$ADB shell "su -c 'ls /system/etc/security/cacerts/${HASH}.0'"
```
### Step 5: Test Interception
1. Re-enable proxy after reboot: `$ADB shell settings put global http_proxy ${IP}:8080`
2. Launch target app
3. Check Burp Suite → Proxy → HTTP history for requests
## Verification
- Proxy set: `adb shell settings get global http_proxy` returns `<ip>:8080`
- SSL unpinning active: `logcat | grep ZygiskSSL` shows "SSL UNPINNING"
- Burp CA installed: `ls /system/etc/security/cacerts/<hash>.0` exists
- Traffic visible in Burp Suite HTTP history
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| No traffic in Burp | Proxy not set | Check `settings get global http_proxy` |
| App shows SSL error | Cert not installed | Verify cert in system store, reboot |
| SSL unpinning not working | Config not loaded | Force-stop app, check config.json syntax |
| ADB connection refused | BlueStacks ADB disabled | Enable in BlueStacks Settings → Advanced |
| Wrong cert hash | Using wrong openssl flag | Use `subject_hash_old` not `subject_hash` |
## Notes
- BlueStacks runs arm64-v8a, so Zygisk modules need arm64 support
- The trustusercerts module copies certs at boot via Magisk overlay
- System partition is read-only; use Magisk modules instead of direct mounting
- Burp cert hash is typically `9a5ba575` but verify for your instance
- Some apps may use additional protections (root detection, Frida detection)
## Quick Reference
```bash
# Set proxy
adb shell settings put global http_proxy <ip>:8080
# Disable proxy
adb shell settings put global http_proxy :0
# Check SSL unpinning logs
adb shell "logcat -d | grep -i ZygiskSSL"
# Force restart app
adb shell am force-stop <package> && adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1
```
## References
- [Zygisk-SSL-Unpinning](https://github.com/m0szy/Zygisk-SSL-Unpinning)
- [MagiskTrustUserCerts](https://github.com/NVISOsecurity/MagiskTrustUserCerts)
- [Burp Suite Documentation](https://portswigger.net/burp/documentation)

View file

@ -0,0 +1,189 @@
---
name: clickhouse-k8s-nfs-system-log-bloat
description: |
Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
NFS storage, caused by unbounded system log table growth triggering continuous background
merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
(2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
(3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
system log truncation.
author: Claude Code
version: 1.0.0
date: 2026-03-01
---
# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
## Problem
ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
even when actual user queries are negligible. The CPU is consumed by background merge
operations on system log tables that grow unboundedly with no default TTL.
## Context / Trigger Conditions
- ClickHouse pod using 500m-1000m+ CPU with no active user queries
- `SELECT * FROM system.processes` shows only diagnostic queries
- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
- System log tables have grown to gigabytes:
- `system.trace_log`: 5+ GiB, 200M+ rows
- `system.text_log`: 3+ GiB, 90M+ rows
- `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
- `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
- Actual user data (e.g., `clickhouse.events`) is only kilobytes
- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
## Root Cause
Two compounding issues:
1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
`text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
retention policy and grow indefinitely.
2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
slower than local disk, creating a feedback loop:
- Slow merges → parts accumulate faster than they can be merged
- More parts → more merge operations spawned
- More merges → more CPU for decompression/recompression while waiting on NFS I/O
## Solution
### Immediate Fix: Truncate System Tables
```bash
CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
```
This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
### Permanent Fix: CronJob for Periodic Truncation
Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
```hcl
resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
metadata {
name = "clickhouse-truncate-logs"
namespace = "<namespace>"
}
spec {
schedule = "0 */6 * * *"
successful_jobs_history_limit = 1
failed_jobs_history_limit = 1
job_template {
metadata {}
spec {
template {
metadata {}
spec {
restart_policy = "OnFailure"
container {
name = "truncate"
image = "curlimages/curl:8.12.1"
command = ["sh", "-c", join(" && ", [
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
"echo 'System logs truncated'"
])]
}
}
}
}
}
}
}
```
### What Does NOT Work: Config.d XML Mount
**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
the entire directory, deleting the built-in `docker_related_config.xml` that the
entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
crash with exit code 36.
This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
## Verification
After truncation, verify:
```bash
# CPU should drop from ~900m to ~100m within minutes
kubectl top pod -n <namespace> -l app=clickhouse
# No active merges
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
"SELECT count() FROM system.merges"
# System tables should be small
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
"SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
FORMAT Pretty"
```
## Diagnostic Commands
```bash
# Check what's consuming CPU (merges vs queries)
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
"SELECT * FROM system.merges FORMAT Pretty"
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
"SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
# Check background pool config
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
"SELECT name, value FROM system.server_settings \
WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
FORMAT Pretty"
# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
```
## Notes
- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
Kubernetes. Root cause unclear but reproducible across mount methods.
- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
`background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
workload. This overhead is unavoidable without config file changes.
- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
local PV storage instead.
## See Also
- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers

View file

@ -0,0 +1,145 @@
---
name: coturn-k8s-without-hostnetwork
description: |
Deploy coturn (TURN/STUN server) on Kubernetes without hostNetwork by using a
narrow relay port range and MetalLB LoadBalancer service. Use when: (1) deploying
a WebRTC relay server on k8s, (2) want coturn to run on any node (not pinned),
(3) avoiding hostNetwork for better pod scheduling and multi-replica support,
(4) need TURN for NAT traversal in WebRTC apps (video streaming, conferencing).
Covers relay port range sizing, MetalLB IP sharing, ephemeral TURN credentials
via HMAC-SHA1, and pfSense port forwarding.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# coturn on Kubernetes Without hostNetwork
## Problem
TURN servers traditionally require hostNetwork because they relay media over a wide
UDP port range (49152-65535). This pins the server to a single node, prevents rolling
updates, and wastes cluster flexibility.
## Context / Trigger Conditions
- Deploying a TURN/STUN server for WebRTC applications on Kubernetes
- Want the TURN pod to be schedulable on any node
- Need to avoid hostNetwork for better availability and scheduling
## Solution
### Key insight: Narrow the relay port range
A home lab with ~20 concurrent WebRTC viewers needs ~40 relay ports (2 per viewer).
Use 100 ports (49152-49252) instead of 16K. This makes it practical to expose via
a K8s LoadBalancer service.
### Terraform module structure
```hcl
locals {
turn_port = 3478
min_port = 49152
max_port = 49252 # 100 ports — enough for ~50 concurrent streams
}
resource "kubernetes_deployment" "coturn" {
spec {
# No hostNetwork, no nodeSelector — runs anywhere
template {
spec {
container {
image = "coturn/coturn:latest"
args = ["-c", "/etc/turnserver/turnserver.conf"]
port {
container_port = 3478
protocol = "UDP"
}
}
}
}
}
}
resource "kubernetes_service" "coturn" {
metadata {
annotations = {
# Share an existing MetalLB IP to avoid consuming a new one
"metallb.universe.tf/loadBalancerIPs" = "10.0.20.200"
"metallb.universe.tf/allow-shared-ip" = "shared"
}
}
spec {
type = "LoadBalancer"
# Signaling port
port {
name = "turn-udp"
port = 3478
protocol = "UDP"
}
# Relay ports — dynamic block generates 100 port definitions
dynamic "port" {
for_each = range(49152, 49253)
content {
name = "relay-${port.value}"
port = port.value
target_port = port.value
protocol = "UDP"
}
}
}
}
```
### coturn config (turnserver.conf)
```
listening-port=3478
fingerprint
lt-cred-mech
use-auth-secret
static-auth-secret=YOUR_SECRET_HERE
realm=yourdomain.com
listening-ip=0.0.0.0
min-port=49152
max-port=49252
no-multicast-peers
no-cli
```
### MetalLB IP sharing
To reuse an existing MetalLB IP (e.g., the WireGuard/Shadowsocks shared IP):
1. Add `metallb.universe.tf/allow-shared-ip: shared` to the coturn service
2. The same annotation must exist on all other services sharing that IP
3. **Port conflicts are not allowed** — verify no other service uses 3478 or 49152-49252
4. After changing the IP annotation, **delete and recreate** the service — MetalLB won't reassign IPs on annotation changes alone
### Ephemeral TURN credentials
coturn's `use-auth-secret` mode generates time-limited credentials via HMAC-SHA1:
```javascript
const crypto = require('crypto');
const TURN_SECRET = 'your-shared-secret';
function getTurnCredentials(name = 'user', ttl = 86400) {
const timestamp = Math.floor(Date.now() / 1000) + ttl;
const username = `${timestamp}:${name}`;
const credential = crypto.createHmac('sha1', TURN_SECRET)
.update(username).digest('base64');
return { username, credential };
}
```
## Verification
```bash
# STUN binding request (raw UDP probe)
echo -ne '\x00\x01\x00\x00\x21\x12\xa4\x42\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
| nc -u -w2 <METALLB_IP> 3478 | xxd | head -3
# Response starting with 0101 = successful STUN binding response
```
## Notes
- 100 relay ports supports ~50 concurrent streams (2 ports per stream)
- If you need more, increase `max_port` and add more ports to the service
- coturn auto-detects pod IP — no need to set `relay-ip` or `external-ip` explicitly
- For public access, add NAT port forwards on pfSense for UDP 3478 + 49152-49252
- See also: `pfsense-nat-rule-creation` skill for adding the port forwards

View file

@ -0,0 +1,99 @@
---
name: crowdsec-agent-registration-failure
description: |
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
machine registrations. Use when: (1) CrowdSec agent init container fails with
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
running with old credentials, (4) cscli machines list shows stale entries for
current agent pod names. Covers deleting stale registrations to allow re-registration.
author: Claude Code
version: 1.0.0
date: 2026-02-15
---
# CrowdSec Agent Registration Failure
## Problem
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
credentials but LAPI retains the old machine registrations. When agents try to
re-register with the same pod name, the `wait-for-lapi-and-register` init container
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
## Context / Trigger Conditions
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
- LAPI pods were recently restarted or redeployed
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
## Solution
### Step 1: Identify stuck agents
```bash
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
```
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
### Step 2: Confirm the init container error
```bash
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
```
Should show `user already exist` error.
### Step 3: Find a running LAPI pod
```bash
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
```
### Step 4: Delete stale machine registrations from LAPI
```bash
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
```
Repeat for each stuck agent.
### Step 5: Wait for agents to recover
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
automatically retry registration and succeed after the stale entry is deleted. This can
take up to 5 minutes per agent depending on where they are in the backoff cycle.
## Verification
```bash
# All agents should show Running status
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
# DaemonSet should show all pods READY
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
```
## Example
```bash
# Identify stuck agents
$ kubectl get pods -n crowdsec | grep agent
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
crowdsec-agent-jw76q 1/1 Running 8 3d
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
# Delete stale registrations
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
# Wait ~5 minutes, then verify
$ kubectl get pods -n crowdsec | grep agent
crowdsec-agent-jr5q7 1/1 Running 1 3d
crowdsec-agent-jw76q 1/1 Running 8 3d
crowdsec-agent-mtgxh 1/1 Running 1 3d
crowdsec-agent-pfw2l 1/1 Running 1 3d
```
## Notes
- This is a known limitation of the CrowdSec Helm chart — the init container registration
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
- The `cscli machines list` output will show many historical stale entries from past
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
the blocklist import.
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.

View file

@ -0,0 +1,310 @@
---
name: fastapi-svelte-gpu-webui
description: |
Pattern for building web UIs for GPU-based CLI tools. Use when:
(1) Wrapping a command-line tool with a web interface, (2) Building job queue
systems for long-running GPU tasks, (3) Creating file upload/download workflows,
(4) Need real-time progress updates via WebSocket, (5) Deploying to Kubernetes
with GPU scheduling. Covers FastAPI backend, Svelte 5 frontend, NFS storage,
and Terraform deployment.
author: Claude Code
version: 1.0.0
date: 2025-01-31
---
# FastAPI + Svelte GPU WebUI Pattern
## Problem
Many powerful tools are command-line only, making them inaccessible to non-technical
users. Building a web UI requires handling file uploads, job queuing, progress tracking,
and GPU resource scheduling.
## Context / Trigger Conditions
- You have a CLI tool that does heavy processing (ML inference, media conversion, etc.)
- Want to add a web interface for easier access
- Need to track long-running job progress
- Deploying to Kubernetes with GPU nodes
- Files need to persist across pod restarts (NFS storage)
## Solution Overview
### Directory Structure
```
project-web/
├── backend/
│ ├── main.py # FastAPI app
│ ├── api/
│ │ ├── __init__.py
│ │ └── routes.py # REST endpoints
│ ├── services/
│ │ ├── __init__.py
│ │ └── converter.py # CLI wrapper + job manager
│ ├── models/
│ │ ├── __init__.py
│ │ └── schemas.py # Pydantic models
│ └── requirements.txt
├── frontend/
│ ├── src/
│ │ ├── App.svelte
│ │ ├── lib/
│ │ │ ├── FileUpload.svelte
│ │ │ ├── JobsList.svelte
│ │ │ └── ProgressBar.svelte
│ │ └── stores/
│ │ └── jobs.js
│ ├── package.json
│ └── vite.config.js
├── Dockerfile
└── README.md
```
### Backend: Job Manager Pattern
```python
# services/converter.py
import asyncio
import uuid
from datetime import datetime
from pathlib import Path
from typing import Optional, Callable
import subprocess
class Job:
id: str
filename: str
status: str # pending, processing, completed, failed
progress: float
created_at: datetime
output_file: Optional[str]
error: Optional[str]
class JobManager:
def __init__(self, storage_path: str = "/mnt"):
self.storage_path = Path(storage_path)
self.jobs: dict[str, Job] = {}
self.progress_callbacks: dict[str, list[Callable]] = {}
def create_job(self, filename: str, **options) -> Job:
job_id = str(uuid.uuid4())
job = Job(
id=job_id,
filename=filename,
status="pending",
progress=0.0,
created_at=datetime.now(),
**options
)
self.jobs[job_id] = job
return job
async def run_conversion(self, job_id: str):
job = self.jobs[job_id]
job.status = "processing"
input_path = self.storage_path / "uploads" / job.filename
output_dir = self.storage_path / "outputs" / job_id
output_dir.mkdir(parents=True, exist_ok=True)
# Build command for CLI tool
cmd = [
"/path/to/cli-tool",
str(input_path),
"-o", str(output_dir),
# Add other options...
]
# Run with output capture for progress parsing
process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
# Parse output for progress updates
async def read_output(stream):
while True:
line = await stream.readline()
if not line:
break
line_str = line.decode().strip()
# Parse progress from CLI output
if "%" in line_str:
# Extract and update progress
self.update_progress(job_id, parsed_progress)
await asyncio.gather(
read_output(process.stdout),
read_output(process.stderr)
)
returncode = await process.wait()
if returncode == 0:
output_files = list(output_dir.glob("*.m4b"))
if output_files:
job.output_file = output_files[0].name
job.status = "completed"
else:
job.status = "failed"
job.error = f"Exit code {returncode}"
job_manager = JobManager()
```
### Backend: API Routes
```python
# api/routes.py
from fastapi import APIRouter, UploadFile, File, HTTPException
from fastapi.responses import FileResponse
from pathlib import Path
import shutil
import asyncio
router = APIRouter(prefix="/api")
@router.post("/upload")
async def upload_file(file: UploadFile = File(...)):
upload_dir = Path("/mnt/uploads")
upload_dir.mkdir(parents=True, exist_ok=True)
file_path = upload_dir / file.filename
with file_path.open("wb") as buffer:
shutil.copyfileobj(file.file, buffer)
return {"filename": file.filename, "size": file_path.stat().st_size}
@router.post("/jobs")
async def create_job(request: JobCreate):
job = job_manager.create_job(filename=request.filename, ...)
asyncio.create_task(job_manager.run_conversion(job.id))
return job
@router.get("/jobs")
async def list_jobs():
return job_manager.get_all_jobs()
@router.get("/jobs/{job_id}/download")
async def download_job(job_id: str):
job = job_manager.get_job(job_id)
if not job or job.status != "completed":
raise HTTPException(404)
output_path = Path("/mnt/outputs") / job_id / job.output_file
return FileResponse(output_path, filename=job.output_file)
```
### Frontend: Svelte 5 Components
```svelte
<!-- FileUpload.svelte -->
<script>
let { onUpload } = $props();
let dragOver = $state(false);
let uploading = $state(false);
async function handleUpload(file) {
uploading = true;
const formData = new FormData();
formData.append('file', file);
const response = await fetch('/api/upload', {
method: 'POST',
body: formData
});
if (response.ok) {
const data = await response.json();
onUpload(data.filename);
}
uploading = false;
}
</script>
<div class="dropzone"
class:dragover={dragOver}
ondragover={(e) => { e.preventDefault(); dragOver = true; }}
ondragleave={() => dragOver = false}
ondrop={(e) => { e.preventDefault(); handleUpload(e.dataTransfer.files[0]); }}>
Drop file here
</div>
```
### Dockerfile
```dockerfile
FROM python:3.12-slim
# Install Node for frontend build
RUN apt-get update && apt-get install -y nodejs npm
# Build frontend
COPY frontend/ /app/frontend/
WORKDIR /app/frontend
RUN npm install && npm run build
# Install backend
COPY backend/ /app/backend/
WORKDIR /app/backend
RUN pip install -r requirements.txt
# Serve static files from FastAPI
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Terraform Deployment (GPU)
```hcl
resource "kubernetes_deployment" "myapp" {
spec {
template {
spec {
node_selector = { "gpu" : "true" }
toleration {
key = "nvidia.com/gpu"
operator = "Equal"
value = "true"
effect = "NoSchedule"
}
container {
image = "myregistry/myapp@sha256:..."
name = "myapp"
resources {
limits = { "nvidia.com/gpu" = "1" }
}
volume_mount {
name = "data"
mount_path = "/mnt"
}
}
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/myapp"
}
}
}
}
}
}
```
## Verification
1. Upload a file via the UI
2. Start a conversion job
3. Watch progress update in real-time
4. Download the completed file
5. Verify files persist across pod restarts
## Notes
- Use image digest for reliable deployments (see `k8s-docker-registry-cache-bypass` skill)
- NFS storage persists across pod restarts
- GPU node taints require matching tolerations
- Consider adding job persistence (database) for production use
- WebSocket can provide smoother progress updates than polling
## See Also
- `k8s-docker-registry-cache-bypass` - Fixing image cache issues
- `k8s-gpu-no-nvidia-devices` - GPU device troubleshooting
- `python-filename-sanitization` - Secure file handling

View file

@ -0,0 +1,105 @@
---
name: grafana-stale-datasource-cleanup
description: |
Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
with provisioned ones, or when stale datasources persist in the MySQL database.
Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
(2) Grafana API returns "datasources:delete permissions needed" when trying to remove
a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
blocks API operations.
author: Claude Code
version: 1.0.0
date: 2026-02-13
---
# Grafana Stale Datasource Cleanup
## Problem
Grafana uses a stale or incorrect datasource from its MySQL database instead of
the correctly provisioned one. Common when Helm charts auto-create datasources
that point to services you've disabled (e.g., Loki gateway).
## Context / Trigger Conditions
- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
different one stored in MySQL
- Grafana API returns `"permissions needed: datasources:delete"` or
`"permissions needed: datasources:write"` even with admin credentials
- Dashboard references a datasource UID that points to a wrong URL
## Solution
### Step 1: Identify the stale datasource
List all datasources via API (this usually works even with RBAC):
```bash
kubectl exec -n monitoring deploy/grafana -c grafana -- \
sh -c 'curl -s "http://localhost:3000/api/datasources" \
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
"import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
```
### Step 2: Try API deletion first
```bash
kubectl exec -n monitoring deploy/grafana -c grafana -- \
sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
```
If this returns a permissions error, proceed to Step 3.
### Step 3: Delete directly from MySQL
When Grafana RBAC blocks API operations, go through MySQL:
```bash
# Find the Grafana MySQL password
kubectl exec -n monitoring deploy/grafana -c grafana -- \
sh -c 'echo $GF_DATABASE_PASSWORD'
# Find the stale datasource
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
-e "SELECT id, uid, name, url FROM data_source;"
# Delete it
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
-e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
```
### Step 4: Fix dashboards referencing the old UID
Dashboards store datasource UIDs in their JSON. Update via MySQL:
```bash
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
-e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
```
### Step 5: Refresh Grafana
Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
```bash
kubectl rollout restart deploy -n monitoring grafana
```
## Verification
```bash
# Verify only correct datasources remain
kubectl exec -n monitoring deploy/grafana -c grafana -- \
sh -c 'curl -s "http://localhost:3000/api/datasources" \
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
```
## Notes
- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
and provisions datasources from them. These are file-provisioned and show as
"provisioned" in the UI.
- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
database pointing to services like `loki-gateway`. If you disable the gateway,
this datasource becomes stale.
- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
so dashboard JSON files in the repo are reference copies only.
- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.

View file

@ -0,0 +1,253 @@
---
name: helm-release-troubleshooting
description: |
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
(6) helm history shows status "pending-upgrade" or "pending-rollback",
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
(8) helm upgrade fails with "an error occurred while finding last successful release".
Covers force re-rendering via state removal/reimport and stuck release recovery via
secret cleanup.
author: Claude Code
version: 1.0.0
date: 2026-02-22
---
# Helm Release Troubleshooting
## Force Re-render
### Problem
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
the new values. For example, adding a new port in Helm values doesn't result in that port
appearing in the Service spec.
### Context / Trigger Conditions
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
the old configuration
- Structural changes to Helm values (new ports, new containers, new volumes) are not
reflected in deployed resources
- The Helm chart templates need to be fully re-rendered, not just patched
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
includes resources based on values
### Root Cause
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
ones rather than doing a full template re-render. For structural changes (like enabling
HTTP/3 which adds a new UDP port to the Service template), the templates may not be
re-rendered with the new conditional branches active.
Additionally, Terraform may see the stored Helm release state as matching the desired state
even though the actual Kubernetes resources don't reflect it, creating a state drift that
Terraform doesn't detect.
### Solution
#### Step 1: Verify the Discrepancy
Confirm that K8s resources don't match Helm values:
```bash
# Check the actual resource
kubectl get svc <service-name> -n <namespace> -o yaml
# Check what Helm thinks is deployed
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
```
#### Step 2: Remove Helm Release from Terraform State
```bash
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
```
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
resources remain untouched in the cluster.
#### Step 3: Import the Helm Release Back
```bash
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
```
For Helm releases, the import ID format is `namespace/release-name`.
#### Step 4: Force Apply with Terraform
After reimporting, run terraform apply. Terraform should now detect the drift between
the desired Helm values and the actual release state:
```bash
terraform apply -target=module.kubernetes_cluster.module.<service>
```
If Terraform still shows "no changes", you may need to taint the resource:
```bash
terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
terraform apply -target=module.kubernetes_cluster.module.<service>
```
#### Step 5: Manual Helm Force Upgrade (Last Resort)
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
```bash
# Get the current values file
helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
# Edit /tmp/values.yaml to include the correct values, or use --set flags
# Force upgrade (re-renders all templates)
helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
# Then reimport into Terraform
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
terraform apply -target=module.kubernetes_cluster.module.<service>
```
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
afterward, and use `terraform apply` to verify Terraform is back in sync.
### Verification
```bash
# Check the K8s resources now match expected configuration
kubectl get svc <service-name> -n <namespace> -o yaml
kubectl get deployment <deployment-name> -n <namespace> -o yaml
# Verify Terraform is in sync
terraform plan -target=module.kubernetes_cluster.module.<service>
# Should show "No changes" or minimal expected drift
```
### Example: Traefik HTTP/3 UDP Port Not Appearing
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
successfully, but the Traefik Service only had TCP port 443, missing the expected
UDP port 443 (`websecure-http3`).
**Fix**:
```bash
# 1. Remove from state
terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
# 2. Reimport
terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
# 3. Apply (Terraform now detects the drift)
terraform apply -target=module.kubernetes_cluster.module.traefik
# 4. Verify
kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
# Should show: port: 443, protocol: UDP
```
### Notes
- This issue is more common with structural Helm value changes (new ports, new sidecars,
conditional template blocks) than with simple value changes (image tags, replica counts)
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
which causes brief downtime. Use with caution on production ingress controllers.
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
---
## Stuck Release Recovery
### Problem
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
### Context / Trigger Conditions
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
- `helm upgrade` fails with: `an error occurred while finding last successful release`
### Solution
#### Step 1: Identify the stuck release
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
```
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
#### Step 2: Delete the stuck Helm release secrets
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
Delete all stuck revisions:
```bash
# Delete specific stuck revision (e.g., revision 5)
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
# If multiple stuck revisions exist, delete all of them
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
```
#### Step 3: Verify the release is clean
```bash
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
```
The latest revision should now show `deployed` status.
#### Step 4: Retry the upgrade
```bash
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
```
### Important Notes
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
This changes the label but not the encoded release data inside the secret, leaving Helm in an
inconsistent state. Always delete the stuck secrets entirely.
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
the next successful upgrade will reconcile the state.
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
### Verification
After deleting stuck secrets and re-applying:
- `helm history` shows the new revision as `deployed`
- `terraform apply` completes without errors
### Example
```bash
# Helm history shows stuck state
$ helm history nextcloud -n nextcloud | tail -3
4 deployed nextcloud-8.8.1 Upgrade complete
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
6 pending-rollback nextcloud-8.8.1 Rollback to 4
# Fix: delete stuck revisions
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
# Verify clean state
$ helm history nextcloud -n nextcloud | tail -1
4 deployed nextcloud-8.8.1 Upgrade complete
# Re-apply
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
```
---
## See Also
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
## References
- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)

View file

@ -0,0 +1,157 @@
---
name: ingress-factory-migration
description: |
Migrate raw kubernetes_ingress_v1 resources to the centralized ingress_factory module.
Use when: (1) a service defines a raw kubernetes_ingress_v1 with hand-rolled Traefik
middleware annotations, (2) adding a new service that needs standard ingress with
rate limiting, CrowdSec, CSP headers, rybbit analytics, or authentik auth,
(3) refactoring existing ingresses for consistency. Covers single-path, multi-path,
split UI/API, full_host overrides, custom rate limits, and extra middleware injection.
author: Claude Code
version: 1.0.0
date: 2026-02-10
---
# Ingress Factory Migration
## Problem
Services define raw `kubernetes_ingress_v1` resources with hand-rolled Traefik middleware
chains. This creates inconsistency - middleware chains are copy-pasted per service, making
it easy to miss security middleware (CrowdSec, rate limiting) or analytics (rybbit). The
`ingress_factory` module at `modules/kubernetes/ingress_factory/main.tf` provides a single
point of control.
## Context / Trigger Conditions
- Service has a raw `kubernetes_ingress_v1` resource instead of using `module "ingress"`
- Service has a manually defined `kubernetes_manifest` for rybbit analytics middleware
- New service needs standard ingress configuration
- Middleware chain needs to be updated across many services
## Solution
### Standard single-path ingress
Replace the raw resource with:
```hcl
module "ingress" {
source = "../ingress_factory"
namespace = kubernetes_namespace.<service>.metadata[0].name
name = "<service-name>" # becomes the ingress name AND default hostname
host = "<subdomain>" # optional: override hostname (if different from name)
service_name = "<k8s-service-name>" # optional: defaults to name
port = 80 # optional: defaults to 80
tls_secret_name = var.tls_secret_name
protected = false # set true for authentik forward auth
}
```
### Multi-path / split UI+API
Use two module calls with different names but same host:
```hcl
module "ingress" {
source = "../ingress_factory"
namespace = kubernetes_namespace.<service>.metadata[0].name
name = "<service>"
host = "<subdomain>"
service_name = "<ui-service>"
tls_secret_name = var.tls_secret_name
rybbit_site_id = "<id>" # optional: adds rybbit analytics
}
module "ingress-api" {
source = "../ingress_factory"
namespace = kubernetes_namespace.<service>.metadata[0].name
name = "<service>-api"
host = "<subdomain>" # same host as UI
service_name = "<api-service>"
ingress_path = ["/api"]
tls_secret_name = var.tls_secret_name
# No rybbit_site_id - API returns JSON, not HTML
}
```
### Full host override (for root domain like viktorbarzin.me)
```hcl
module "ingress" {
source = "../ingress_factory"
namespace = kubernetes_namespace.<service>.metadata[0].name
name = "<service>"
service_name = "<k8s-service>"
full_host = "viktorbarzin.me" # bypasses name.root_domain construction
tls_secret_name = var.tls_secret_name
}
```
### Custom rate limiting (e.g., immich)
```hcl
module "ingress" {
source = "../ingress_factory"
namespace = kubernetes_namespace.<service>.metadata[0].name
name = "<service>"
skip_default_rate_limit = true
extra_middlewares = ["traefik-<custom>-rate-limit@kubernetescrd"]
tls_secret_name = var.tls_secret_name
}
```
### Key variables reference
| Variable | Default | Purpose |
|----------|---------|---------|
| `name` | required | Ingress resource name + default hostname |
| `host` | null | Override hostname prefix (name used if null) |
| `full_host` | null | Override entire hostname (bypasses root_domain) |
| `service_name` | null | K8s service name (name used if null) |
| `port` | 80 | Backend service port |
| `ingress_path` | ["/"] | URL paths to match |
| `protected` | false | Adds authentik forward auth middleware |
| `rybbit_site_id` | null | Adds rybbit analytics script injection |
| `skip_default_rate_limit` | false | Omits default rate limiter |
| `extra_middlewares` | [] | Additional middleware references to append |
| `extra_annotations` | {} | Additional ingress annotations |
| `allow_local_access_only` | false | Restricts to LAN/VPN |
| `exclude_crowdsec` | false | Skips CrowdSec middleware |
| `custom_content_security_policy` | null | Custom CSP header |
### After migration, delete:
1. The raw `kubernetes_ingress_v1` resource
2. Any manually defined `kubernetes_manifest "rybbit_analytics"` (the factory creates this automatically when `rybbit_site_id` is set)
## Gotchas
### Duplicate module names
If the service directory has multiple `.tf` files (e.g., `main.tf` and `frame.tf`), check
for existing `module "ingress"` blocks. Module names must be unique within a directory.
Use a descriptive name like `module "ingress-immich"` instead.
### Terraform target module names with hyphens
Module names in `terraform state list` may use hyphens (e.g., `module.real-estate-crawler`).
When using `-target`, you must match the exact name including hyphens:
```bash
# Wrong - underscores:
terraform apply -target=module.kubernetes_cluster.module.real_estate_crawler
# Correct - hyphens (quote to prevent shell interpretation):
terraform apply '-target=module.kubernetes_cluster.module.real-estate-crawler'
```
### Service name defaults
The factory defaults `service_name` to `name`. If the K8s service has a different name
than the ingress, you must explicitly set `service_name`. Common case: headscale has one
K8s service named `headscale` with multiple ports, so the UI ingress needs
`service_name = "headscale"` even though `name = "headscale-ui"`.
### Servarr subdirectory source path
Services under `servarr/` need `../../ingress_factory` as the source path instead of
`../ingress_factory`.
## Verification
1. `terraform validate` - check for syntax errors
2. `terraform plan -target=module.kubernetes_cluster.module.<service>` - verify old ingress destroyed, new created
3. `kubectl get ingress -n <namespace>` - verify ingress exists with correct host/paths
4. Browse the service URL to confirm accessibility
## Notes
- Services using special protocols (gRPC, mTLS, WebSocket with custom headers) should NOT
be migrated - keep raw `kubernetes_ingress_v1` for those
- The factory automatically includes: rate-limit, CSP headers, CrowdSec, and entrypoint=websecure
- When `rybbit_site_id` is set, the factory creates a `kubernetes_manifest` for the
rewrite-body middleware that injects the analytics script into HTML responses

View file

@ -0,0 +1,244 @@
---
name: k8s-container-image-caching
description: |
Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
(3) need to add pull-through cache for a new upstream registry,
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
(6) kubectl shows correct image tag but container runs old code,
(7) local registry mirror caches stale images,
(8) imagePullPolicy: Always doesn't force fresh pulls,
(9) containerd config has mirror that intercepts pulls serving stale images.
Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
via image digest pinning.
author: Claude Code
version: 1.0.0
date: 2026-02-22
---
# Kubernetes Container Image Caching
## Pull-Through Cache Setup
### Problem
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
and other registries -- they get routed to the Docker Hub proxy which can't serve them,
causing `ImagePullBackOff`.
### Context / Trigger Conditions
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
### Solution
#### 1. Run one Registry v2 container per upstream
Each upstream needs its own Docker Registry v2 instance on a different port:
| Port | Registry | Container Name |
|------|----------|---------------|
| 5000 | docker.io | registry |
| 5010 | ghcr.io | registry-ghcr |
| 5020 | quay.io | registry-quay |
| 5030 | registry.k8s.io | registry-k8s |
| 5040 | reg.kyverno.io | registry-kyverno |
Config for non-Docker-Hub proxies (no auth needed -- they're public):
```yaml
version: 0.1
storage:
cache:
blobdescriptor: inmemory
filesystem:
rootdirectory: /var/lib/registry
http:
addr: :5000
proxy:
remoteurl: https://ghcr.io # change per registry
```
```bash
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
```
#### 2. Replace deprecated wildcard mirror with `config_path`
Instead of:
```toml
# DEPRECATED - breaks non-Docker-Hub registries
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
endpoint = ["http://10.0.20.10:5000"]
```
Use the modern `config_path` approach:
```toml
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
```
Then create per-registry `hosts.toml` files:
```bash
mkdir -p /etc/containerd/certs.d/docker.io
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
server = "https://registry-1.docker.io"
[host."http://10.0.20.10:5000"]
capabilities = ["pull", "resolve"]
EOF
```
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
#### 3. Critical: `config_path` and `mirrors` cannot coexist
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
(including the `[plugins."...registry.mirrors"]` parent section) before setting
`config_path`.
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
where the config format is slightly different. If unsure, either:
- Don't use config_path on that node (skip the pull-through cache)
- Remove the entire `mirrors` section first, then add `config_path`
#### 4. Static IP for registry VM
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
### Verification
```bash
# Test each proxy responds
for port in 5000 5010 5020 5030 5040; do
curl -s http://10.0.20.10:$port/v2/_catalog
done
# Test containerd can pull through cache
crictl pull ghcr.io/some/image:tag
# Check containerd logs for mirror usage
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
```
### Notes
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
direct pull from the upstream `server` URL. This provides graceful degradation.
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
to avoid I/O spikes.
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
---
## Cache Bypass / Stale Image Fix
### Problem
Kubernetes pods continue running old Docker images even after pushing new versions with
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
and serves stale versions, ignoring `imagePullPolicy: Always`.
### Context / Trigger Conditions
- Pod is running but application code is outdated
- `docker push` succeeded with new layers
- `kubectl describe pod` shows correct image tag
- Cluster has a local registry mirror configured (e.g., in containerd config)
- `imagePullPolicy: Always` doesn't fix the issue
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
### Solution
#### 1. Get the image digest after pushing
```bash
docker push viktorbarzin/myimage:latest
# Output includes: latest: digest: sha256:abc123... size: 856
```
#### 2. Use digest instead of tag in deployment
```hcl
# Terraform
container {
# Use digest to bypass local registry cache
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
image_pull_policy = "Always"
name = "myimage"
}
```
```yaml
# Kubernetes YAML
containers:
- name: myimage
image: docker.io/viktorbarzin/myimage@sha256:abc123...
imagePullPolicy: Always
```
#### 3. Apply and restart
```bash
terraform apply -target=module.kubernetes_cluster.module.myservice
kubectl rollout restart deployment/myservice -n mynamespace
```
### Why This Works
- Registry mirrors match by tag, not digest
- When you specify a digest, the node must fetch that exact manifest
- The mirror may not have the digest cached, forcing a pull from upstream
- Even if cached, the digest guarantees the exact image version
### Verification
```bash
# Check the pod is using the new image
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
# Verify application behavior reflects new code
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
```
### Example
Before (problematic):
```hcl
image = "docker.io/viktorbarzin/audiblez-web:latest"
```
After (fixed):
```hcl
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
```
### Notes
- You must update the digest each time you push a new image
- Consider automating digest extraction in CI/CD pipelines
- This is a workaround; ideally fix the registry mirror configuration
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
### Diagnosing Registry Mirror Issues
```bash
# On a k8s node, check containerd config
cat /etc/containerd/config.toml | grep -A5 mirrors
# Check if mirror is intercepting
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
# List cached images on node
crictl images | grep myimage
```
---
## References
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)

View file

@ -0,0 +1,186 @@
---
name: k8s-gpu-no-nvidia-devices
description: |
Fix for Kubernetes GPU pods showing "CUDA not supported" or no /dev/nvidia* devices
despite nvidia.com/gpu resource allocation. Use when: (1) container runs but torch.cuda.is_available()
returns False, (2) ls /dev/nvidia* shows "no matches found", (3) nvidia-smi fails inside pod
but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
Covers NVIDIA device plugin, time-slicing, and container runtime issues.
author: Claude Code
version: 1.1.0
date: 2026-03-01
---
# Kubernetes GPU Pod - No NVIDIA Devices Found
## Problem
A Kubernetes pod requests GPU resources (`nvidia.com/gpu: 1`) and schedules on a GPU node,
but inside the container there are no NVIDIA devices visible. The application falls back
to CPU with messages like "CUDA not supported by the Torch installed!" despite running
in a CUDA-enabled container image.
## Context / Trigger Conditions
- Pod shows `Running` status and is on a node with `gpu=true` label
- `kubectl describe pod` shows GPU limit/request is satisfied
- Inside container: `ls /dev/nvidia*` returns "no matches found"
- Inside container: `nvidia-smi` fails or command not found
- Application logs show: "CUDA not supported", "Switching to CPU", "torch.cuda.is_available() = False"
- On the host node: `nvidia-smi` works fine
## Solution
### Step 1: Verify GPU Availability
Check if other pods are consuming the GPU:
```bash
# List all pods using GPU resources
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) | "\(.metadata.namespace)/\(.metadata.name)"'
# Check NVIDIA device plugin pods
kubectl get pods -n nvidia -l app=nvidia-device-plugin
kubectl logs -n nvidia -l app=nvidia-device-plugin --tail=50
```
### Step 2: Free GPU Resources
If another workload is using the GPU, unload it:
```bash
# For Ollama specifically
kubectl exec -n ollama deployment/ollama -- ollama stop <model_name>
# Or scale down the conflicting deployment
kubectl scale deployment/<name> -n <namespace> --replicas=0
```
### Step 3: Restart the Affected Pod
After freeing GPU resources, restart the pod to get fresh device allocation:
```bash
kubectl rollout restart deployment/<name> -n <namespace>
# Or delete the pod directly
kubectl delete pod <pod-name> -n <namespace>
```
### Step 4: Verify GPU Access
```bash
# Check devices are now visible
kubectl exec -n <namespace> deployment/<name> -- ls -la /dev/nvidia*
# Test nvidia-smi
kubectl exec -n <namespace> deployment/<name> -- nvidia-smi
# Test PyTorch CUDA
kubectl exec -n <namespace> deployment/<name> -- python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
```
## Verification
After restart, you should see:
```
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
```
And `nvidia-smi` should show the GPU with your container process.
## Example
```bash
# Problem: ebook2audiobook shows "CUDA not supported"
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- ls /dev/nvidia*
zsh:1: no matches found: /dev/nvidia*
# Solution: Unload Ollama model holding the GPU
$ kubectl exec -n ollama deployment/ollama -- ollama ps
NAME SIZE PROCESSOR
qwen2.5:14b 10 GB 33%/67% CPU/GPU
$ kubectl exec -n ollama deployment/ollama -- ollama stop qwen2.5:14b
# Restart the affected pod
$ kubectl rollout restart deployment/ebook2audiobook -n ebook2audiobook
# Verify
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
# Should now show the Tesla T4 GPU
```
## Notes
- **GPU Time-Slicing**: If using NVIDIA GPU time-slicing (configured in GPU Operator),
multiple pods can share a GPU. However, device injection still requires proper timing.
- **Pod Scheduling Order**: Pods that start while GPU is fully allocated may not get
devices injected even after GPU becomes available - a restart is required.
- **Container Runtime**: The NVIDIA Container Toolkit must be properly configured.
Issues can arise from:
- cgroup driver mismatch (systemd vs cgroupfs)
- Container updates causing device loss
- SELinux blocking device access
- **Image Compatibility**: The container image must have CUDA libraries matching the
driver version. Check with `nvidia-smi` on host for driver version.
- **This Cluster**: Uses NVIDIA GPU Operator with time-slicing (20 replicas per GPU).
GPU node is `k8s-node1` with Tesla T4.
## See Also
- Check GPU Operator status: `kubectl get pods -n nvidia`
- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
## Automatic GPU Recovery via Liveness Probe
To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
both GPU availability and application health. Example for Frigate (but applicable to any
GPU workload):
```hcl
# Restart pod if GPU becomes unavailable or app hangs
liveness_probe {
exec {
command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
}
initial_delay_seconds = 120
period_seconds = 60
timeout_seconds = 10
failure_threshold = 3
}
# Allow time for GPU model loading at startup
startup_probe {
http_get {
path = "/health"
port = <port>
}
period_seconds = 10
failure_threshold = 30 # up to 5 minutes
}
```
The liveness probe checks:
- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
- `curl` health endpoint — fails if the application process is hung
If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
which re-acquires the GPU device through the NVIDIA device plugin.
**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
configured with a short `initial_delay_seconds`.
## References
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
- [Kubernetes GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html)

View file

@ -0,0 +1,113 @@
---
name: k8s-hpa-scaling-storm
description: |
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
(3) cluster becomes unstable due to resource exhaustion from too many pods,
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
to a deployment that previously had none causes HPA to miscalculate utilization.
Covers emergency response and prevention patterns.
author: Claude Code
version: 1.0.0
date: 2026-02-15
---
# Kubernetes HPA Scaling Storm
## Problem
When an HPA is configured with a memory or CPU utilization target but the underlying
deployment has insufficient resource requests, the HPA calculates artificially high
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
cluster resources and potentially crashing etcd and the API server.
## Context / Trigger Conditions
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
- Pod count for a deployment rapidly increases to maxReplicas
- etcd timeout errors in `kubectl` or `terraform apply`
- API server becomes unreachable (`connection refused` or `network is unreachable`)
- Adding resource requests to a Helm chart that previously had none
- Memory-based HPA targets with real usage far exceeding requests
## Solution
### Emergency Response (stop the storm)
**Step 1: Delete the HPA immediately**
```bash
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
```
**Step 2: Scale the deployment down**
```bash
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
```
**Step 3: Wait for pods to terminate and cluster to stabilize**
```bash
# Watch pod count decrease
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
```
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
will restart static pods (etcd, kube-apiserver) automatically.
### Prevention
**Rule 1: Set resource requests to match actual usage**
Before enabling HPA, check actual resource consumption:
```bash
kubectl top pods -n <namespace> -l <label>
```
Set requests to the baseline (idle) usage, not the minimum possible value.
**Rule 2: Set reasonable maxReplicas**
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
Default of 100 is almost never appropriate for a home/small cluster.
**Rule 3: Prefer CPU-only HPA targets**
Memory-based scaling is problematic because:
- Memory usage grows over time and rarely decreases
- Memory-based scaling creates pods that never scale down
- CPU is more responsive to load changes
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
If adding resource requests to a deployment managed by HPA, temporarily disable
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
## Cascade Effects
A scaling storm can cause:
1. etcd storage exhaustion (too many pod objects)
2. API server OOM or connection limits
3. VPN/network connectivity loss (if VPN runs in the cluster)
4. Kyverno webhook failures (admission controller overwhelmed)
5. Other pods evicted or unable to schedule
## Verification
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
- Pod count is stable at expected replicas
- `kubectl get nodes` responds promptly
- No etcd timeout errors
## Example
```bash
# Observed: HPA scaling Collabora to 100 pods
$ kubectl get hpa -n nextcloud
NAME TARGETS MINPODS MAXPODS REPLICAS
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
# Emergency fix
$ kubectl delete hpa nextcloud-collabora -n nextcloud
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
# Root cause: 256Mi memory request, actual usage 570Mi
# Fix: increase request to 1Gi or disable memory target
```
## Notes
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
Helm upgrade will recreate it. You must also update the Helm values.
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
the HPA issue entirely.
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.

View file

@ -0,0 +1,235 @@
---
name: k8s-nfs-mount-troubleshooting
description: |
Debug Kubernetes NFS volume mount failures. Use when: (1) Pod stuck in ContainerCreating
for extended time, (2) kubectl describe shows "MountVolume.SetUp failed" with NFS errors,
(3) Error message shows "Protocol not supported" or "mount.nfs: access denied",
(4) NFS volume defined in pod spec but container won't start, (5) Container starts but
gets "Permission denied" writing to NFS volume (non-root container UID mismatch),
(6) CronJob or init container fails silently when writing to NFS, (7) Pod shows Running
1/1 but service is unresponsive after a node reboot — stale NFS mount causes frozen
processes with zero listening sockets. Common root causes are missing NFS export on the
server, UID mismatch for non-root containers, and stale mounts after node reboots.
author: Claude Code
version: 1.2.0
date: 2026-02-28
---
# Kubernetes NFS Mount Troubleshooting
## Problem
Pods with NFS volumes get stuck in `ContainerCreating` state indefinitely. The error
messages from `kubectl describe pod` can be misleading, showing protocol or permission
errors when the actual issue is the NFS export doesn't exist.
## Context / Trigger Conditions
- Pod status shows `ContainerCreating` for more than 1-2 minutes
- `kubectl describe pod` shows events like:
- `MountVolume.SetUp failed for volume "data" : mount failed: exit status 32`
- `mount.nfs: Protocol not supported`
- `mount.nfs: access denied by server`
- Pod spec includes an NFS volume mount
- Other pods on the same node work fine
## Solution
### Step 1: Identify the NFS path
```bash
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Volumes:"
```
Look for the NFS server and path (e.g., `10.0.10.15:/mnt/main/myservice`)
### Step 2: Verify the export exists on NFS server
SSH to the NFS server and check:
```bash
ssh root@<nfs-server> "ls -la /mnt/main/myservice"
```
### Step 3: If directory doesn't exist, create it
```bash
ssh root@<nfs-server> "mkdir -p /mnt/main/myservice && chmod 777 /mnt/main/myservice"
```
### Step 4: Add to NFS exports (TrueNAS specific)
For TrueNAS, add the path to the NFS share configuration:
1. Add directory to `scripts/nfs_directories.txt`
2. Run `scripts/nfs_exports.sh` to update the share via API
### Step 5: Restart the pod
```bash
kubectl delete pod -n <namespace> -l app=<app-label>
```
The deployment will create a new pod that should now mount successfully.
## Verification
```bash
kubectl get pods -n <namespace>
# Should show 1/1 Running instead of 0/1 ContainerCreating
kubectl exec -n <namespace> <pod-name> -- ls -la /app/data
# Should show the mounted directory contents
```
## Example
**Symptom:**
```
Events:
Warning FailedMount 55s (x13 over 11m) kubelet MountVolume.SetUp failed for volume "data" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 10.0.10.15:/mnt/main/resume /var/lib/kubelet/pods/.../data
Output: mount.nfs: Protocol not supported
```
**Root Cause:** The directory `/mnt/main/resume` didn't exist on the TrueNAS server.
**Fix:**
```bash
ssh root@10.0.10.15 'mkdir -p /mnt/main/resume && chmod 777 /mnt/main/resume'
# Then add to NFS exports and restart pod
```
## Notes
- The "Protocol not supported" error is misleading - it often means the export path doesn't exist
- Always check the NFS server first before investigating protocol/firewall issues
- For TrueNAS, the NFS share must be updated via API/UI after creating new directories
- NFSv3 vs NFSv4 issues are rare in modern setups; missing paths are more common
- Check that the NFS client packages are installed on Kubernetes nodes if this is a new cluster
## Variant: Non-Root Container UID Permission Denied
### Problem
Container starts and mounts NFS successfully, but gets "Permission denied" when
writing files. The pod appears healthy but operations fail silently.
### Trigger Conditions
- Container logs show "Permission denied" or "client returned ERROR on write"
- Pod is Running (not stuck in ContainerCreating)
- NFS directory exists and is mounted, but owned by root (uid 0)
- Container image runs as a non-root user (e.g., `curlimages/curl` runs as uid 101)
- CronJobs or init containers that write to NFS fail with no obvious error
### Common Non-Root Container UIDs
| Image | UID | User |
|-------|-----|------|
| `curlimages/curl` | 101 | curl_user |
| `nginx` (unprivileged) | 101 | nginx |
| `node` | 1000 | node |
| `python` (slim) | 0 | root (safe) |
| `grafana/grafana` | 472 | grafana |
### Solution
Fix permissions on the NFS server:
```bash
# Option 1: World-writable (simplest, suitable for non-sensitive data)
ssh root@10.0.10.15 "chmod -R 777 /mnt/main/<service>/<subdir>"
# Option 2: Match container UID (more secure)
ssh root@10.0.10.15 "chown -R <uid>:<gid> /mnt/main/<service>/<subdir>"
# Option 3: Use securityContext in pod spec to run as root
spec:
securityContext:
runAsUser: 0
```
### Debugging
```bash
# Check what UID the container runs as
kubectl exec -n <namespace> <pod> -- id
# Test write access from inside container
kubectl exec -n <namespace> <pod> -- sh -c 'echo test > /path/to/nfs/testfile'
# Check NFS directory ownership on server
ssh root@10.0.10.15 "ls -la /mnt/main/<service>/"
```
## Variant: Stale NFS Mounts After Node Reboot (Ghost Running Pods)
### Problem
After a node reboot (e.g., from kured rolling kernel updates), pods are rescheduled and
show `Running 1/1` status, but the application process is frozen/hung. The service is
completely unresponsive despite appearing healthy to Kubernetes.
### Trigger Conditions
- Node was recently rebooted (check `kubectl get nodes` for age, or kured logs)
- Pod shows `Running 1/1` with 0 restarts (looks perfectly healthy)
- Service is unresponsive — Uptime Kuma or curl shows timeout/connection refused
- `kubectl exec <pod> -- ss -tlnp` shows **zero listening sockets** (the process started but is hung)
- Pod uses NFS volumes (inline `nfs {}` or PVC backed by NFS)
- Multiple pods across different namespaces all exhibit the same symptom simultaneously
- `kubectl describe pod` shows no warnings or errors — everything looks normal
### Root Cause
When a node reboots, the NFS client mounts go stale. If the pod is rescheduled to the
same or different node before NFS fully recovers, the application process starts but
immediately hangs when it tries to access the NFS-mounted filesystem. The process is
stuck in an uninterruptible I/O wait (D state) but Kubernetes sees the container as
running because the PID exists and liveness probes (if any) may not exercise the NFS path.
### Solution
Force-delete the affected pods to trigger a clean reschedule with fresh NFS mounts:
```bash
# Identify hung pods — Running but no listening sockets
kubectl exec -n <namespace> <pod> -- ss -tlnp 2>/dev/null
# If output is empty or shows no expected ports, the pod is hung
# Force-delete to skip graceful shutdown (hung process won't respond to SIGTERM)
kubectl delete pod -n <namespace> <pod> --force --grace-period=0
# The deployment controller creates a new pod with fresh NFS mounts
kubectl get pods -n <namespace> -w
```
For bulk remediation after a cluster-wide event:
```bash
# Find all pods with NFS volumes that might be hung
# Check each service's expected port — if ss -tlnp shows nothing, force-delete
for ns in calibre stirling-pdf send speedtest n8n paperless-ngx; do
pod=$(kubectl get pod -n $ns -o name | head -1)
sockets=$(kubectl exec -n $ns ${pod} -- ss -tlnp 2>/dev/null | wc -l)
if [ "$sockets" -le 1 ]; then
echo "HUNG: $ns/$pod (no listening sockets)"
kubectl delete ${pod} -n $ns --force --grace-period=0
fi
done
```
### Verification
```bash
# New pod should have listening sockets
kubectl exec -n <namespace> <new-pod> -- ss -tlnp
# Should show the application's expected port (e.g., *:8080)
# Service should respond
kubectl exec -n <namespace> <new-pod> -- curl -sI http://localhost:<port>/
# Should return HTTP response
```
### Key Diagnostic Insight
The critical signal is **Running 1/1 but zero listening sockets**. Normal healthy pods
always have at least one listening socket for their application port. If `ss -tlnp`
returns nothing, the process is hung on a stale NFS mount, not crashed — that's why
Kubernetes thinks it's fine.
### Prevention
- Add **liveness probes** that hit the application's HTTP endpoint (not just TCP connect):
```hcl
liveness_probe {
http_get {
path = "/"
port = 8080
}
initial_delay_seconds = 60
period_seconds = 30
timeout_seconds = 5
}
```
- This ensures Kubernetes detects hung pods and restarts them automatically.
## See Also
- **nfsv4-idmapd-uid-mapping** — All UIDs show as 65534 (nobody) inside containers. Different from permission denied; the UIDs are wrong, not the permissions.
- TrueNAS NFS configuration documentation
- Kubernetes NFS volume documentation
- k8s-limitrange-oom-silent-kill (for OOM issues often confused with NFS hangs)

View file

@ -0,0 +1,109 @@
---
name: kubelet-static-pod-manifest-update
description: |
Force kubelet to pick up changes to static pod manifests in /etc/kubernetes/manifests/.
Use when: (1) edited kube-apiserver.yaml but the running process still has old flags,
(2) kubelet restart doesn't pick up manifest changes, (3) touching the manifest file
doesn't trigger pod recreation, (4) killing the API server process results in the
same old args on restart, (5) the pod's config.hash annotation doesn't match the
file's hash. Requires a full cycle: remove manifest, stop kubelet, remove containers,
re-add manifest, start kubelet.
author: Claude Code
version: 1.0.0
date: 2026-02-17
---
# Kubelet Static Pod Manifest Update
## Problem
After editing a static pod manifest (e.g., `/etc/kubernetes/manifests/kube-apiserver.yaml`
to add OIDC or audit flags), kubelet continues running the pod with the old configuration.
Standard approaches like `touch`, `systemctl restart kubelet`, or `kubectl delete pod`
do not force kubelet to reconcile the new manifest.
## Context / Trigger Conditions
- Edited `/etc/kubernetes/manifests/kube-apiserver.yaml` (or other static pod manifests)
- The running process (`ps aux | grep kube-apiserver`) shows old flags
- `kubectl get pod -n kube-system kube-apiserver-* -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'` returns a stale hash
- Any of these actions failed to apply the changes:
- `touch /etc/kubernetes/manifests/kube-apiserver.yaml`
- `systemctl restart kubelet`
- `kubectl delete pod kube-apiserver-*`
- Killing the API server process directly
## Root Cause
Kubelet maintains an internal cache of static pod specs keyed by a hash of the manifest.
When the manifest changes, kubelet should detect the new hash and recreate the pod.
However, in practice (observed on Kubernetes 1.34.x), kubelet can get stuck with the
old hash if:
- The pod's mirror object in the API server still exists with the old hash
- Kubelet's internal pod cache wasn't cleared between restarts
- The container runtime (containerd) still has the old container running
## Solution
Full restart cycle on the master node:
```bash
# 1. Back up the manifest
sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
# 2. Remove the manifest (kubelet will stop the pod)
sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml
# 3. Stop kubelet
sudo systemctl stop kubelet
# 4. Wait for the API server container to stop
sleep 5
# 5. Force-remove any remaining API server containers
sudo crictl rm -f $(sudo crictl ps -aq --name kube-apiserver 2>/dev/null) 2>/dev/null
# 6. Re-add the manifest (with your changes)
sudo cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
# 7. Start kubelet
sudo systemctl start kubelet
# 8. Wait for API server to come up (30-60 seconds)
sleep 45
# 9. Verify new flags are active
sudo cat /proc/$(pgrep -f 'kube-apiserver --' | head -1)/cmdline | tr '\0' '\n' | grep 'your-new-flag'
```
**Critical:** The order matters. Removing the manifest BEFORE stopping kubelet ensures
kubelet processes the removal. Then clearing containers ensures no stale state. Finally,
re-adding the manifest with kubelet running triggers a fresh pod creation.
## What Does NOT Work
| Approach | Why it fails |
|----------|-------------|
| `touch manifest.yaml` | Kubelet may not detect mtime-only changes |
| `systemctl restart kubelet` | Kubelet reuses cached pod spec if hash matches |
| `kubectl delete pod` | Deletes mirror pod but kubelet recreates from cached spec |
| `kill <apiserver-pid>` | Container runtime restarts the same container with old args |
| Moving manifest away and back without stopping kubelet | Kubelet may cache the old spec in memory |
## Verification
```bash
# Check the running process has new flags
ps aux | grep kube-apiserver | grep -v grep | grep 'your-new-flag'
# Check the config hash changed
kubectl get pod -n kube-system kube-apiserver-$(hostname) \
-o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'
# Check API server logs for successful startup
kubectl logs -n kube-system kube-apiserver-$(hostname) | tail -5
```
## Notes
- This applies to ALL static pods, not just kube-apiserver (etcd, controller-manager, scheduler)
- The cluster will be briefly unavailable during the restart (30-60 seconds)
- On single-master clusters, kubectl commands will fail during the restart — use `sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf` from the master
- Always validate the YAML before removing the manifest: `python3 -c "import yaml; yaml.safe_load(open('/etc/kubernetes/manifests/kube-apiserver.yaml'))"`
- See also: `authentik-oidc-kubernetes` skill for the full OIDC setup context

View file

@ -0,0 +1,143 @@
---
name: local-llm-gpu-selection
description: |
Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and
comparing to Apple Silicon alternatives. Use when: (1) user asks about running
local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs,
(3) user wants to compare local models to Claude for coding, (4) user asks about
quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for
LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility,
multi-GPU considerations, and realistic quality comparisons to Claude models.
author: Claude Code
version: 1.0.0
date: 2025-06-11
---
# Local LLM GPU Selection & Performance Guide
## Problem
Choosing the right hardware for local LLM inference requires understanding the
relationship between VRAM capacity, memory bandwidth, GPU compatibility with
server chassis, and realistic model quality expectations.
## Context / Trigger Conditions
- User asks about running quantized models locally (Ollama, llama.cpp)
- User wants to know which GPU fits their server (Dell R730 or similar 2U)
- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
## Key Principle: Memory Bandwidth Is Everything
LLM token generation is **memory-bandwidth bound**, not compute bound. The formula:
```
approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
```
This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs
despite having less raw compute.
## VRAM Requirements by Model Size
| Model Size | Quant | VRAM Needed | Examples |
|------------|-------|-------------|----------|
| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
| 7-8B | Q8_0 | ~8 GB | |
| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
| 32B | Q8_0 | ~34 GB | |
| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
| 70B | Q8_0 | ~70 GB | |
Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
## Dell R730 GPU Compatibility
### Constraints
- **2U chassis**: Full-height cards fit, but limited to dual-slot width
- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration
- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
- **PSU**: Check wattage headroom (dual 750W or 1100W typical)
### Compatible GPUs
**No external power needed (<=75W):**
- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
**Requires power cable (>75W):**
- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
**Won't fit:**
- RTX 3090/4090: Too thick (3-slot), too long
- A100: Fits physically but very expensive
- Any consumer RTX: Generally too large for 2U
### Multi-GPU Considerations
- Ollama splits model layers across GPUs automatically
- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
## Apple Silicon Comparison
Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
| Device | Memory | Bandwidth | Advantage |
|--------|--------|-----------|-----------|
| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for
LLM inference due to zero cross-GPU overhead and high unified bandwidth.
## Best Coding Models (for Ollama)
For coding tasks specifically, prefer dedicated coding models:
1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class
2. **Codestral 22B** — Mistral's dedicated coding model
3. **DeepSeek Coder V2** — good quality, efficient
4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB
## Realistic Quality Comparison to Claude
For Claude Code-style agentic coding workflows:
| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
|-----------|-------------|-------|---------------------|-------------|
| Single function gen | Excellent | Good | Good | Decent |
| Multi-file refactoring | Excellent | Decent | Weak | Weak |
| Tool use / agentic loops | Excellent | Good | Poor | Poor |
| Long context (large codebases) | Excellent | Good | Weak | Weak |
Local models work for simple completions and code questions. They struggle badly
with Claude Code's complex multi-step tool-use workflows, long context windows,
and self-correction capabilities.
## Quantization Quality Guide
From best to worst quality (and largest to smallest):
- FP16: Full precision, baseline quality
- Q8_0: Near-lossless, ~50% size reduction
- Q6_K: Minimal quality loss
- Q5_K_M: Good balance
- Q4_K_M: **Recommended default** — best quality/size tradeoff
- Q3_K_M: Noticeable degradation on complex reasoning
- Q2_K: Significant quality loss, emergency only
## Verification
- Check GPU compatibility: `lspci | grep -i nvidia` on the host
- Check available VRAM: `nvidia-smi` inside the GPU VM
- Check model fit: Ollama shows VRAM usage during `ollama run`
- Check inference speed: Count tokens/sec in Ollama output
## Notes
- GPU prices fluctuate significantly in the used market; check current prices
- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
- Power consumption matters for 24/7 homelab use (electricity cost)
- For Claude Code specifically, API-based Claude models remain significantly
superior to any local model for agentic coding workflows

View file

@ -0,0 +1,143 @@
---
name: loki-helm-deployment-pitfalls
description: |
Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
to be enabled", (3) Helm install fails with "cannot re-use a name that is still
in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
author: Claude Code
version: 1.0.0
date: 2026-02-13
---
# Loki Helm Chart Deployment Pitfalls
## Problem
Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
multiple non-obvious failures that aren't documented together.
## Context / Trigger Conditions
- Deploying Loki via `helm_release` in Terraform
- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
- First-time deployment or redeployment after failures
## Pitfall 1: Read-Only Root Filesystem
**Error:** `mkdir /loki/compactor: read-only file system`
**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
for security. The compactor `working_directory` and ruler `rule_path` default to
paths under `/loki/` which is on the read-only root FS.
**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
volume there:
```yaml
compactor:
working_directory: /var/loki/compactor # NOT /loki/compactor
ruler:
rule_path: /var/loki/scratch # NOT /loki/scratch
```
## Pitfall 2: Canary Required
**Error:** `Helm test requires the Loki Canary to be enabled`
**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
to be true. You cannot disable it.
**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
`chunksCache`, and `resultsCache` to reduce resource usage:
```yaml
gateway:
enabled: false
chunksCache:
enabled: false
resultsCache:
enabled: false
# Do NOT add: lokiCanary: enabled: false
```
## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
**Error:** `cannot re-use a name that is still in use`
**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
sometimes leaves a stale release secret in Kubernetes. Terraform then can't
create a new release with the same name.
**Fix:** Delete the stale Helm secret:
```bash
kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
```
Also consider removing `atomic = true` for initial deployments and adding it
back after the first successful install. Use a longer `timeout` (600s+) for
first deploy since image pulls take time.
## Pitfall 4: PV Stuck in Released State
**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
**Fix:** Clear the stale claimRef:
```bash
kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
```
The PV will transition from `Released` to `Available` and can be bound again.
## Pitfall 5: "Entry Too Far Behind" Log Spam
**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
**Cause:** Alloy reads all historical log files from the Kubernetes API on first
startup. Old entries are rejected by Loki's ingester because they're behind the
newest entry for that stream.
**Fix:** This is harmless and self-resolving — Alloy catches up to present time
and errors stop. To clear immediately:
```bash
kubectl rollout restart ds -n monitoring alloy
```
After restart, Alloy tails from approximately "now" for each container.
## Pitfall 6: Alertmanager Service Name
**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
**Cause:** The Prometheus Helm chart names the Alertmanager service
`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
silent alert delivery failures.
**Fix:**
```yaml
ruler:
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
```
Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
## Verification
```bash
# Loki pod running
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
# Loki receiving logs
kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl -s 'http://localhost:3100/loki/api/v1/labels'
# Should return JSON with namespace, pod, container labels
# PV bound
kubectl get pv loki
# STATUS should be "Bound"
```
## Notes
- Always check PV status before retrying a failed deploy
- The Loki Helm chart creates many components by default (gateway, canary,
memcached caches) — disable what you don't need for single-binary mode
- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
disk-friendly setups, but data is lost on pod crash
- See also: `helm-release-force-rerender` for Helm values not updating resources

View file

@ -0,0 +1,148 @@
---
name: music-assistant-librespot-wrong-account
description: |
Fix for Music Assistant Spotify playback failing with "librespot does not support free
accounts" even when the Spotify account has Premium. Use when: (1) Songs load for 1-2
seconds then auto-pause, (2) Music Assistant logs show "librespot does not support free
accounts" followed by FFmpeg "Invalid data found when processing input" exit code 183,
(3) Spotify provider shows "Successfully logged in" but streaming fails. Root cause is
stale librespot credential cache pointing to a different (free-tier) Spotify account.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# Music Assistant Librespot Wrong Account / Stale Credentials
## Problem
Music Assistant (MASS) Spotify playback fails immediately — songs appear to load for 1-2
seconds then auto-pause. Every track is marked "unplayable". The error log shows librespot
rejecting the account as "free" despite the configured Spotify account having Premium.
## Context / Trigger Conditions
- Music Assistant addon on Home Assistant (tested with v2.7.8, addon `d5369777_music_assistant`)
- Symptoms: Song starts loading, pauses after 1-2 seconds, skipped as "unplayable"
- Log pattern (all three appear together on every play attempt):
```
WARNING [music_assistant.spotify] [librespot] librespot does not support "free" accounts.
WARNING [music_assistant.audio.media_stream] Error opening input: Invalid data found when processing input
ERROR [music_assistant.streams] AudioError while streaming queue item ... FFMpeg exited with code 183
```
- OAuth login succeeds: `Successfully logged in to Spotify as <Name>`
- But librespot streaming fails with the "free" account error
## Root Cause
Music Assistant uses **two separate auth mechanisms** for Spotify:
1. **OAuth (PKCE flow)** — for browsing, search, metadata. Uses access tokens refreshed via
the Spotify Web API. This is what produces the "Successfully logged in" message.
2. **Librespot** — for actual audio streaming. Uses cached credentials stored in
`/data/.cache/spotify--<id>/credentials.json` inside the addon container.
The librespot credential cache can become stale or point to a **different Spotify account**
(e.g., if another family member logged in, or credentials were cached from before a Premium
upgrade). Librespot uses these cached credentials to connect to Spotify's internal API, which
returns a `ProductInfo` XML packet containing the account `type`. If the cached account is
"free", librespot calls `exit(1)`, killing the audio pipeline before FFmpeg receives any data.
## How Librespot Determines Account Type
Librespot reads the `type` field from Spotify's `ProductInfo` server packet
(`librespot-org/librespot`, `core/src/session.rs`):
```rust
fn check_catalogue(attributes: &UserAttributes) {
if let Some(account_type) = attributes.get("type") {
if account_type != "premium" {
error!("librespot does not support {account_type:?} accounts.");
exit(1);
}
}
}
```
The check is an exact string match against `"premium"`.
## Solution
### Step 1: Verify the Problem
Check Music Assistant addon logs for the "free accounts" error:
```bash
# Via HA API (from a machine with the HA token)
python3 -c "
import os, json, requests
url = os.environ.get('HOME_ASSISTANT_SOFIA_URL', '').rstrip('/')
token = os.environ.get('HOME_ASSISTANT_SOFIA_TOKEN', '')
headers = {'Authorization': f'Bearer {token}'}
r = requests.get(f'{url}/api/hassio/addons/d5369777_music_assistant/logs', headers=headers)
for line in r.text.split('\n'):
if 'free' in line.lower() or 'librespot' in line.lower():
print(line)
"
```
### Step 2: Identify the Music Assistant Container
From the SSH addon (ha-sofia: `ssh vbarzin@192.168.1.8`):
```bash
sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/json | \
python3 -c "import sys,json; [print(c['Names'][0], c['Id'][:12]) for c in json.load(sys.stdin) if 'music' in c['Names'][0].lower()]"
```
### Step 3: Check Cached Credentials
Exec into the container to read the librespot cache:
```bash
# Create exec
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
"http://localhost/containers/<CONTAINER_ID>/exec" \
-H 'Content-Type: application/json' \
-d '{"Cmd":["cat","/data/.cache/spotify--5s3mSP8y/credentials.json"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
# Run exec
sudo curl -s --unix-socket /run/docker.sock \
"http://localhost/exec/$EXEC_ID/start" \
-H 'Content-Type: application/json' -d '{"Detach":false}'
```
Check the `username` field — if it doesn't match the expected Premium account, that's the problem.
### Step 4: Clear the Cache
```bash
# Create exec to delete cache
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
"http://localhost/containers/<CONTAINER_ID>/exec" \
-H 'Content-Type: application/json' \
-d '{"Cmd":["rm","-rf","/data/.cache/spotify--5s3mSP8y"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
# Run exec
sudo curl -s --unix-socket /run/docker.sock \
"http://localhost/exec/$EXEC_ID/start" \
-H 'Content-Type: application/json' -d '{"Detach":false}'
```
### Step 5: Restart Music Assistant
```bash
sudo curl -s --unix-socket /run/docker.sock \
"http://localhost/containers/<CONTAINER_ID>/restart" -X POST
```
### Step 6: Verify
After restart, check logs for:
- `Successfully logged in to Spotify as <Name>` (OAuth OK)
- No "free accounts" error when playing a track
- Optionally re-check `/data/.cache/spotify--5s3mSP8y/credentials.json` to confirm the
`username` now matches the Premium account
## Verification
1. Play any Spotify track through Music Assistant
2. The track should stream without pausing after 1-2 seconds
3. Logs should show `Start Queue Flow stream` without subsequent `AudioError`
## Notes
- The cache directory name `spotify--5s3mSP8y` is an internal Music Assistant provider ID
and may differ across installations. Use `find /data -name credentials.json` to locate it.
- The `username` field in the credentials cache is Spotify's internal user ID (numeric for
newer accounts, text for older ones), not necessarily the display name or email.
- Spotify Family plan **owners** have account type `"premium"`. Family plan **members** also
report as `"premium"` when their membership is active.
- If the problem recurs, it may indicate that Music Assistant's Spotify provider re-caches
the wrong credentials — check if multiple Spotify accounts are configured or if another
user logged in via the Music Assistant UI.
- The SSH addon on HA OS needs `sudo` for Docker socket access (`/run/docker.sock` is owned
by `root:messagebus`).
- The HA long-lived token typically does NOT have Supervisor API access (hassio endpoints
return 401), so addon management must go through the Docker socket from the SSH addon.

View file

@ -0,0 +1,128 @@
---
name: nextcloud-calendar
description: |
Create, list, and query calendar events in Nextcloud via CalDAV. Use when:
(1) User asks to create a calendar event, (2) User asks what's on their calendar,
(3) User says "add to calendar" or "schedule", (4) User asks about upcoming events.
Always use Nextcloud calendar unless user specifies otherwise.
author: Claude Code
version: 1.0.0
date: 2025-01-25
---
# Nextcloud Calendar Management
## Problem
Need to create, query, or manage calendar events in the user's Nextcloud calendar.
## Context / Trigger Conditions
- User asks to create/add a calendar event
- User asks "what's on my calendar?" or similar
- User mentions scheduling something
- User says "remind me" with a date (create calendar event)
- Default calendar is always Nextcloud unless otherwise specified
## Prerequisites
- Python 3 with `caldav` and `icalendar` packages available (installed via PYTHONPATH or system packages)
- Environment variables `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` must be set
## Solution
### Script Location
```
.claude/calendar-query.py
```
### Execution Pattern (CRITICAL)
Run the script directly with python3 (env vars are set in the environment):
```bash
python3 .claude/calendar-query.py [command] [options]
```
### Available Commands
#### List Calendars
```bash
python .claude/calendar-query.py list
```
#### Query Events
```bash
# Today's events
python .claude/calendar-query.py today
# Tomorrow's events
python .claude/calendar-query.py tomorrow
# This week
python .claude/calendar-query.py week
# This month
python .claude/calendar-query.py month
# Custom date range
python .claude/calendar-query.py events --days 14
python .claude/calendar-query.py events --date 2026-04-10
# From specific calendar
python .claude/calendar-query.py today --calendar "Work"
```
#### Create Events
```bash
# All-day event (single day)
python .claude/calendar-query.py create --title "Doctor appointment" --start "2026-03-15" --all-day
# All-day event (multi-day) - end date is EXCLUSIVE
# For April 10-13, use end date April 14
python .claude/calendar-query.py create --title "Vacation" --start "2026-04-10" --end "2026-04-14" --all-day
# Timed event
python .claude/calendar-query.py create --title "Meeting" --start "2026-03-15 14:00" --end "2026-03-15 15:00"
# With location and description
python .claude/calendar-query.py create --title "Lunch" --start "tomorrow 12:00" --location "Cafe" --description "Team lunch"
# Relative dates work
python .claude/calendar-query.py create --title "Call" --start "today 16:00"
python .claude/calendar-query.py create --title "Review" --start "tomorrow 10:00"
```
### Output Formats
```bash
# JSON output (for parsing)
python .claude/calendar-query.py today --json
# Text output (default, human-readable)
python .claude/calendar-query.py week
```
## Complete Example
To create an event "Team offsite" from March 20-22, 2026:
```bash
python3 .claude/calendar-query.py create --title "Team offsite" --start "2026-03-20" --end "2026-03-23" --all-day
```
## Important Notes
1. **End dates are exclusive** for all-day events (CalDAV standard). To create an event spanning April 10-13, set end to April 14.
2. **No delete/update commands** - The script currently only supports create and query. To modify events, user must do it manually in Nextcloud.
4. **Default calendar** is "Personal" - use `--calendar` flag for others.
## Verification
- For queries: Output shows formatted event list
- For creates: Output shows "Event created: [title]" with calendar name and start date
- Exit code 0 = success, 1 = error (check output for details)
## Common Errors
| Error | Cause | Fix |
|-------|-------|-----|
| `NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD must be set` | Env vars not set | Ensure `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` are in the environment |
| `Required packages not installed` | caldav/icalendar missing | Ensure PYTHONPATH includes the installed packages |
| `Calendar 'X' not found` | Wrong calendar name | Run `list` command to see available calendars |

View file

@ -0,0 +1,132 @@
---
name: nfsv4-idmapd-uid-mapping
description: |
Fix for all file UIDs showing as 65534 (nobody) inside Kubernetes containers when using
NFS volumes from TrueNAS/FreeBSD. Use when: (1) ls -lan inside a container shows all files
owned by 65534:65534 despite correct ownership on the NFS server, (2) PostgreSQL fails with
"data directory has wrong ownership", (3) chown inside containers returns "Invalid argument"
on NFS volumes, (4) services that check file ownership (PostgreSQL, MySQL) crash on startup,
(5) the same NFS mount shows correct UIDs on the host but 65534 inside containers,
(6) NFSv4.2 appears in container mount output even though host mounts use NFSv3.
Root cause: Kubernetes inline NFS volumes auto-negotiate NFSv4.2 (not NFSv3), and NFSv4
idmapd fails to map UIDs when domains don't match or users don't exist on the server.
author: Claude Code
version: 1.0.0
date: 2026-03-01
---
# NFSv4 idmapd UID Mapping — All Files Show as nobody (65534)
## Problem
All files on NFS volumes appear owned by UID 65534 (nobody:nogroup) inside Kubernetes
containers, even though `ls -lan` on the NFS server shows the correct UIDs (e.g., 999, 472).
This breaks any service that checks file ownership: PostgreSQL refuses to start ("data
directory has wrong ownership"), MySQL's entrypoint `chown` fails with "Invalid argument",
and any `chown` inside the container returns EINVAL.
## Context / Trigger Conditions
- TrueNAS CORE (FreeBSD) or TrueNAS SCALE as NFS server
- NFSv4 enabled on the NFS server (`v4: true` in TrueNAS NFS config)
- Kubernetes using inline NFS volumes (not PV/PVC with mount options)
- **Key symptom**: `mount` inside the container shows `type nfs4 (vers=4.2,...)` even
though existing kubelet mounts on the host show `vers=3`
- **Key symptom**: Same NFS path mounted directly on the host shows correct UIDs, but
inside any container shows 65534
## Root Cause
Kubernetes inline NFS volumes don't support `mountOptions`. When kubelet mounts NFS for a
new pod, the Linux NFS client auto-negotiates the highest available version — NFSv4.2 if
the server supports it.
NFSv4 uses **idmapd** for UID translation: the server translates UID→username (e.g.,
`999→postgres@domain`), sends the username string over the wire, and the client translates
it back to a local UID. This fails when:
1. **Domain mismatch**: Server domain (from hostname) differs from client domain
- TrueNAS: `viktorbarzin.me` (from `truenas.viktorbarzin.me`)
- K8s nodes: `viktorbarzin.lan` (from `k8s-node4.viktorbarzin.lan`)
- When domains don't match, ALL UIDs fall back to `nobody` (65534)
2. **Unknown UIDs**: Even with matching domains, if the NFS server has no local user for
UID 999 (common for container UIDs), idmapd maps it to `nobody`
**Why existing mounts work**: Older kubelet mounts (established before NFSv4 was enabled,
or when the NFS client defaulted to v3) continue using NFSv3 with direct numeric UID
passthrough. Only NEW mounts negotiate NFSv4.2.
## Solution
**Fix on TrueNAS (no NFS restart required):**
```bash
# 1. Enable NFSv3-style numeric UID passthrough for NFSv4
midclt call nfs.update '{"v4_v3owner": true, "v4_domain": "viktorbarzin.lan"}'
# 2. Restart nfsuserd with the correct domain (NOT nfsd — that would crash the cluster)
killall nfsuserd
nfsuserd -domain viktorbarzin.lan -force
```
**Clear caches on all K8s nodes:**
```bash
for node in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh wizard@$node "sudo nfsidmap -c && sudo keyctl clear @u"
done
```
**Key settings explained:**
- `v4_v3owner = true`: Makes NFSv4 use numeric UID passthrough like NFSv3, completely
bypassing the username-based idmapd translation. **This is the critical fix.**
- `v4_domain`: Should match the K8s nodes' DNS domain (check with `hostname -d` on a node)
- `nfsuserd -domain <domain> -force`: FreeBSD daemon that handles NFSv4 user mapping.
The `-force` flag is required if it thinks it's already running.
## Verification
```bash
# Run a test pod and check UIDs
kubectl run nfs-test --rm -it --restart=Never --image=alpine \
--overrides='{"spec":{"containers":[{"name":"test","image":"alpine",
"command":["sh","-c","ls -lan /data | head -5"],
"volumeMounts":[{"name":"nfs","mountPath":"/data"}]}],
"volumes":[{"name":"nfs","nfs":{"server":"10.0.10.15","path":"/mnt/main/some-path"}}]}}'
# Should show actual UIDs (e.g., 999, 472) instead of 65534
```
## Debugging Steps
If you're not sure whether this is the issue:
```bash
# 1. Check mount type INSIDE a container (not on the host!)
kubectl exec <pod> -- mount | grep nfs
# If it shows "type nfs4" with "vers=4.2" — this is the issue
# 2. Compare UIDs: host vs container
# On host (via kubelet mount path):
sudo ls -lan /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<vol>/
# Inside container:
kubectl exec <pod> -- ls -lan /mount-path/
# 3. Check TrueNAS NFS config
midclt call nfs.config # Look for v4: true, v4_v3owner, v4_domain
# 4. Check nfsuserd is running with the right domain
ps aux | grep nfsuserd # On TrueNAS
```
## Notes
- **NEVER restart NFS (nfsd)** on TrueNAS — it causes mount failures across ALL pods
cluster-wide. Only restart `nfsuserd` (the ID mapping daemon).
- Existing NFSv3 mounts continue working fine. The issue only affects NEW mounts.
- The `v4_v3owner` setting is persistent across TrueNAS reboots (stored in middleware config).
- The `nfsuserd` restart is NOT persistent — TrueNAS may restart it without the `-domain`
flag after a reboot. The `v4_domain` setting in the middleware config should handle this,
but verify after any TrueNAS restart.
- On Linux NFS servers (not FreeBSD/TrueNAS), the equivalent fix is setting `Domain` in
`/etc/idmapd.conf` on both server and all clients.

View file

@ -0,0 +1,216 @@
---
name: openclaw-k8s-deployment
description: |
Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when:
(1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet",
(2) exec fails with "requires a paired node (none available)",
(3) gateway shows "Config invalid" for exec.host or exec.security values,
(4) OpenClaw can't write files (EACCES on workspace or home),
(5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange),
(6) 502 Bad Gateway from Traefik after pod restart,
(7) setting up Telegram bot channel,
(8) configuring modelrelay sidecar for free model routing.
Covers all non-obvious deployment gotchas discovered through trial and error.
author: Claude Code
version: 1.0.0
date: 2026-03-01
---
# OpenClaw Kubernetes Deployment
## Problem
Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration
requirements. The gateway process, Telegram integration, exec permissions, and
file ownership all have specific constraints not documented together.
## Context / Trigger Conditions
- Deploying OpenClaw from `ghcr.io/openclaw/openclaw` container image
- Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
- Want Telegram bot integration, tool execution, and persistent state
## Solution
### 1. Gateway Configuration (openclaw.json)
**Required fields that aren't obvious:**
```json
{
"gateway": {
"mode": "local",
"bind": "lan",
"controlUi": {
"dangerouslyDisableDeviceAuth": true,
"dangerouslyAllowHostHeaderOriginFallback": true
}
},
"wizard": {
"lastRunAt": "2026-03-01T00:00:00.000Z",
"lastRunVersion": "2026.2.26",
"lastRunCommand": "configure",
"lastRunMode": "local"
}
}
```
- `gateway.mode = "local"`**required** or gateway refuses to start
- `dangerouslyAllowHostHeaderOriginFallback = true` — required in v2026.2.26+
for non-loopback Control UI (error: "non-loopback Control UI requires
gateway.controlUi.allowedOrigins")
- `wizard` block — **required** for Telegram to start. Without it, gateway logs
"Telegram configured, not enabled yet" on every startup. The wizard block
signals that initial setup was completed.
### 2. Exec Configuration
Valid values for `tools.exec`:
| Field | Valid Values | Notes |
|-------|-------------|-------|
| `host` | `sandbox`, `gateway`, `node` | NOT "local" — that's invalid |
| `security` | `deny`, `allowlist`, `full` | NOT "off" — that's invalid |
| `ask` | `"off"` | Disables confirmation prompts |
- `host = "gateway"` — runs commands on the container host directly
- `host = "node"` — requires a "paired node" companion app (doesn't work in containers)
- `host = "sandbox"` — requires Docker-in-Docker
- `security = "full"` — most permissive valid option
### 3. Sandbox Mode
```json
{
"agents": {
"defaults": {
"sandbox": { "mode": "off" },
"workspace": "/workspace/infra"
}
}
}
```
- `sandbox.mode = "off"` disables Docker sandboxing
- `workspace` must be set explicitly — defaults to `~/.openclaw/workspace`
### 4. File Permissions
The init container runs as root but the main container runs as `node` (UID 1000).
**Must chown in init container:**
```sh
chown -R 1000:1000 /workspace/infra
chown -R 1000:1000 /openclaw-home
chmod 700 /openclaw-home
```
**Must create directories:**
```sh
mkdir -p /openclaw-home/agents/main/sessions \
/openclaw-home/credentials \
/openclaw-home/canvas \
/openclaw-home/devices \
/openclaw-home/cron
```
Without these: `EACCES: permission denied` errors for AGENTS.md, canvas,
cron/jobs.json, devices, and other runtime files.
### 5. Startup Command
```sh
node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan
```
Run `doctor --fix` before the gateway to auto-enable Telegram and fix
config issues. Without this, Telegram stays "not enabled yet".
### 6. Resource Requirements
- **CPU limit: 2 cores minimum** — the Node.js gateway startup is CPU-intensive.
With 150-300m CPU, startup takes 5+ minutes.
- **Memory limit: 2Gi minimum** — the gateway OOM-kills at 1Gi during startup
(V8 heap exhaustion).
- **Goldilocks VPA will override these** — see "VPA Override" section below.
### 7. Readiness Probe
```hcl
readiness_probe {
tcp_socket { port = 18789 }
initial_delay_seconds = 30
period_seconds = 10
}
```
Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening
and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik
during startup without killing the container.
### 8. Telegram Integration
```json
{
"channels": {
"telegram": {
"enabled": true,
"botToken": "...",
"dmPolicy": "allowlist",
"allowFrom": ["tg:USER_ID"],
"groupPolicy": "allowlist",
"streamMode": "partial"
}
}
}
```
Telegram won't start without:
1. The `wizard` block in config (signals setup was run)
2. `doctor --fix` at startup (auto-enables the channel)
3. Both `groupPolicy` and `streamMode` fields
### 9. NFS Volume Strategy
| Volume | Purpose | Type |
|--------|---------|------|
| `/home/node/.openclaw` | Persistent state (SOUL.md, sessions, memory, telegram) | NFS |
| `/tools` | Cached binaries (kubectl, terraform, terragrunt, python libs) | NFS |
| `/workspace` | Infra repo clone | NFS |
| `/data` | General data | NFS |
Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping
binary downloads and pip installs on subsequent starts.
### 10. ModelRelay Sidecar
Deploy as a sidecar container for automatic free model routing:
```hcl
container {
name = "modelrelay"
image = "node:22-alpine"
command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
env { name = "NVIDIA_API_KEY"; value = "..." }
env { name = "OPENROUTER_API_KEY"; value = "..." }
}
```
Configure as provider: `baseUrl = "http://127.0.0.1:7352/v1"`, model `auto-fastest`.
## Verification
1. `kubectl logs -c openclaw` should show `[gateway] listening on ws://0.0.0.0:18789`
2. No "Telegram configured, not enabled yet" message
3. No `EACCES` permission errors
4. `kubectl exec ... -- cat /proc/net/tcp` shows listening sockets
5. Telegram bot responds to `/start`
## Notes
- ConfigMap changes require pod restart (init container copies config at start)
- ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
- Goldilocks VPA recreates itself from namespace labels — must delete VPA on
every pod recreation if namespace has `goldilocks.fairwinds.com/vpa-update-mode`
- The `--allow-unconfigured` flag is needed for the gateway command
- v2026.2.26 introduced breaking change requiring `dangerouslyAllowHostHeaderOriginFallback`
## See also
- `openclaw-custom-model-provider` — basic model provider configuration
- `k8s-limitrange-oom-silent-kill` — LimitRange causing OOM (related but different)

View file

@ -0,0 +1,169 @@
---
name: pfsense-dnsmasq-interface-binding
description: |
Restrict pfSense dnsmasq (DNS Forwarder) to specific interfaces to free port 53 on
other interfaces for port forwarding. Use when: (1) pfSense blocks port 53 NAT port
forward because dnsmasq is listening on *:53, (2) need to forward DNS from WAN to an
internal DNS server while preserving client source IPs, (3) dnsmasq shows *:53 in
sockstat despite --listen-address flags, (4) pfSense loses DNS resolution after
restricting dnsmasq interfaces, (5) NAT rdr rules for port 53 silently fail to
generate in /tmp/rules.debug.
author: Claude Code
version: 1.0.0
date: 2026-02-17
---
# pfSense dnsmasq Interface Binding for DNS Port Forwarding
## Problem
pfSense's dnsmasq (DNS Forwarder) binds to `*:53` by default. This prevents creating
NAT port forward rules for port 53 — pfSense silently skips generating the pf `rdr`
directive. You need to restrict dnsmasq to specific interfaces to free port 53 on other
interfaces (e.g., WAN) for forwarding to an internal DNS server.
## Context / Trigger Conditions
- Attempting to create a NAT port forward for port 53 on the WAN interface
- Port forward rule saves to config.xml but `pfctl -sn` shows no corresponding `rdr` rule
- `sockstat -4 | grep ":53"` shows `dnsmasq` on `*:53`
- Goal: Forward DNS queries from one network to an internal DNS server (e.g., Technitium)
while preserving client source IPs (no masquerading)
## Solution
### Step 1: Bind dnsmasq to specific interfaces
Set the interface field in pfSense's dnsmasq config:
```php
ssh admin@10.0.20.1 'php -r '"'"'
require_once("config.inc");
require_once("service-utils.inc");
global $config;
$config = parse_config(true);
$config["dnsmasq"]["interface"] = "lan,opt1"; // Only LAN and OPT1, NOT wan
write_config("Bind dnsmasq to LAN and OPT1 only");
'"'"''
```
This adds `--listen-address=<IP>` flags to dnsmasq but does NOT change socket binding.
### Step 2: Add bind-dynamic (CRITICAL)
Without `bind-dynamic`, dnsmasq still binds the socket to `*:53` even with
`--listen-address` flags. The `--listen-address` only controls which queries get
responses, not the actual socket binding.
```php
ssh admin@10.0.20.1 'php -r '"'"'
require_once("config.inc");
require_once("service-utils.inc");
global $config;
$config = parse_config(true);
$existing = base64_decode($config["dnsmasq"]["custom_options"]);
if (strpos($existing, "bind-dynamic") === false) {
$existing = "bind-dynamic\n" . $existing;
$config["dnsmasq"]["custom_options"] = base64_encode($existing);
write_config("Add bind-dynamic to restrict dnsmasq socket binding");
}
'"'"''
```
### Step 3: Add localhost listen address (CRITICAL)
pfSense's own `resolv.conf` points to `127.0.0.1`. Without this, pfSense itself
loses DNS resolution after the interface restriction.
```php
# Add to custom_options (base64-encoded in config):
listen-address=127.0.0.1
```
### Step 4: Restart dnsmasq
```php
services_dnsmasq_configure();
```
### Step 5: Verify binding
```bash
sockstat -4 | grep ":53 "
# Should show specific IPs, not *:53:
# 127.0.0.1:53
# 10.0.10.1:53 (lan)
# 10.0.20.1:53 (opt1)
# NOT 192.168.1.2:53 (wan)
```
### Step 6: Add the port forward rule
**Critical format note**: The `source` field must use `array("any" => "")`, NOT
`array("network" => "192.168.1.0/24")`. The CIDR source format silently fails to
generate the pf `rdr` directive.
```php
ssh admin@10.0.20.1 'php -r '"'"'
require_once("config.inc");
require_once("filter.inc");
require_once("shaper.inc");
global $config;
$config = parse_config(true);
$rule = array(
"source" => array("any" => ""), // MUST be "any", not CIDR
"destination" => array(
"network" => "wanip",
"port" => "53"
),
"ipprotocol" => "inet",
"protocol" => "udp",
"target" => "10.0.20.204", // Internal DNS server
"local-port" => "53",
"interface" => "wan",
"associated-rule-id" => "pass",
"descr" => "DNS to internal DNS (preserve client IP)",
"created" => array("time" => (string)time(), "username" => "admin"),
"updated" => array("time" => (string)time(), "username" => "admin")
);
array_unshift($config["nat"]["rule"], $rule);
write_config("Add DNS port forward");
filter_configure();
'"'"''
```
### Step 7: Verify the redirect rule
```bash
pfctl -sn | grep "domain\|:53"
# Should show: rdr pass on vtnet0 inet proto udp from any to 192.168.1.2 port = domain -> 10.0.20.204
```
## Verification
1. pfSense own DNS: `nslookup google.com 127.0.0.1` (from pfSense shell)
2. Internal DNS: `nslookup google.com 10.0.20.1` (from LAN/OPT1 clients)
3. Port forward: `dig @192.168.1.2 example.com` (from WAN-side client)
4. Client IP: Check DNS server logs — should show real client IP, not pfSense IP
## Pitfalls
| Pitfall | Symptom | Fix |
|---------|---------|-----|
| Missing `bind-dynamic` | sockstat shows `*:53`, port forward still blocked | Add `bind-dynamic` to custom_options |
| Missing `listen-address=127.0.0.1` | pfSense loses all DNS resolution | Add to custom_options |
| Source `"network" => "CIDR"` in NAT rule | Rule saves to config but no `rdr` in `pfctl -sn` | Use `"any" => ""` instead |
| Using local `$config` variable | Config not persisted after PHP exit | Always use `global $config` |
| Not calling `filter_configure()` | Rule in config.xml but not in pf | Call after `write_config()` |
| Custom options not base64 | dnsmasq fails to start | pfSense stores custom_options as base64 |
## Notes
- `bind-dynamic` is preferred over `bind-interfaces` because it handles interfaces that
come up after dnsmasq starts (e.g., VPN tunnels)
- The pf `rdr` rule is a redirect, not masquerade — source IP is preserved
- dnsmasq custom_options in pfSense config.xml are base64-encoded
- Check `/tmp/rules.debug` for the generated pf ruleset (before loading into pf)
- Use `pfctl -sn` to see rules actually loaded in the running firewall
## See also
- `pfsense` — General pfSense management skill
- `k8s-ndots-search-domain-nxdomain-flood` — Related DNS optimization

View file

@ -0,0 +1,105 @@
---
name: pfsense-nat-rule-creation
description: |
Create NAT port forward rules on pfSense programmatically via PHP/SSH.
Use when: (1) adding port forwards for new K8s services, (2) NAT rules
added via PHP don't appear in pfctl output, (3) config_read_array() throws
"undefined function" error, (4) destination "wanip" not working in NAT rules,
(5) rules saved to config.xml but not loaded into pfctl. Covers the correct
PHP array structure, config API differences between pfSense versions, and
the required pfctl reload step.
author: Claude Code
version: 1.0.0
date: 2026-02-21
---
# pfSense NAT Rule Creation via PHP
## Problem
Creating NAT port forward rules on pfSense programmatically via SSH/PHP has
multiple gotchas around the config API, rule structure, and rule loading.
## Context / Trigger Conditions
- Adding a port forward for a new Kubernetes service (e.g., TURN, game server)
- Using `ssh admin@10.0.20.1` + PHP to automate pfSense config
- NAT rules don't appear in `pfctl -sn` after `write_config()` + `filter_configure()`
- `config_read_array()` throws "Call to undefined function"
- Rules saved to config.xml but pfctl doesn't have them
## Solution
### Correct PHP for adding NAT rules
```php
<?php
require_once("config.inc");
require_once("filter.inc");
global $config; // NOT config_read_array() — that doesn't exist in pfSense 2.7.x
$config["nat"]["rule"][] = array(
"interface" => "wan",
"ipprotocol" => "inet", // Required! Must be "inet" for IPv4
"protocol" => "tcp/udp", // Or "udp" or "tcp"
"source" => array("any" => ""),
"destination" => array(
"network" => "wanip", // Use "network" => "wanip", NOT "address" => "wanip"
"port" => "3478" // Single port or "start:end" for range
),
"target" => "10.0.20.200", // Internal destination IP
"local-port" => "3478", // Internal port (for ranges, just the start port)
"descr" => "My port forward",
"associated-rule-id" => "pass" // Auto-create firewall pass rule
);
write_config("Description for config history");
filter_configure();
```
### Key gotchas
1. **`config_read_array()` doesn't exist** in pfSense 2.7.x. Use `global $config` instead.
2. **Destination format**: Use `"network" => "wanip"`, NOT `"address" => "wanip"` or `"address" => "192.168.1.2"`. The `"network"` key with `"wanip"` tells pfSense to resolve the WAN IP dynamically.
3. **`ipprotocol` is required**: Must include `"ipprotocol" => "inet"` or rules won't generate in `/tmp/rules.debug`.
4. **Port ranges**: Use `"port" => "49152:49252"` for ranges. The `"local-port"` should be just the start port — pfSense maps the range automatically.
5. **Rules may not load immediately**: After `write_config()` + `filter_configure()`, rules appear in `/tmp/rules.debug` but may not be in pfctl until the next filter reload. Force with:
```bash
pfctl -f /tmp/rules.debug
```
6. **SSH quoting**: The pfsense.py `php` command breaks on `\n` in strings. For multi-line PHP, write a `.php` file, `scp` it, and execute:
```bash
scp script.php admin@10.0.20.1:/tmp/
ssh admin@10.0.20.1 "php /tmp/script.php"
```
### Execution via pfsense.py
For simple single-line PHP (no newlines or backslashes):
```bash
python3 .claude/pfsense.py php 'require_once("config.inc"); ...; echo "Done";'
```
For complex scripts, use scp + ssh as above.
## Verification
```bash
# Check rules in config
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /cf/conf/config.xml"
# Check generated pf rules
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /tmp/rules.debug"
# Check active pfctl rules
python3 .claude/pfsense.py pfctl "-sn" | grep YOUR_PORT
```
## Notes
- Existing working NAT rules on this pfSense use the same structure (check WireGuard port 51820 as reference)
- The `associated-rule-id: pass` auto-creates a WAN firewall rule to allow the forwarded traffic
- pfSense applies NAT rules across ALL interfaces when using the web UI, but PHP-created rules only apply to the specified interface
- See also: `pfsense` skill for general pfSense management

View file

@ -0,0 +1,136 @@
---
name: proxmox-vm-disk-expansion-pitfalls
description: |
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
with "invalid option -- P", (3) kubectl drain times out with pods stuck
terminating, (4) filesystem shows old size after qm resize. Covers
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
tuning, and recovery from partial failures.
author: Claude Code
version: 1.0.0
date: 2026-02-13
---
# Proxmox VM Disk Expansion Pitfalls
## Problem
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
incompatibilities, and Kubernetes drain timeouts.
## Context / Trigger Conditions
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
- Ubuntu 24.04 cloud-init images (the default k8s node template)
- Kubernetes nodes with many pods or stateful workloads
- Using `scripts/extend_vm_storage.sh` or similar automation
## Issues and Solutions
### 1. `growpart: command not found` on Ubuntu 24.04
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
with "command not found". `resize2fs` then reports "Nothing to do!" because the
partition table hasn't been updated.
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
by default. The `growpart` tool (which updates the partition table to use new
disk space) is in this package.
**Fix**:
```bash
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
**Prevention**: Check for `growpart` before attempting partition expansion:
```bash
if ! command -v growpart &>/dev/null; then
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
fi
```
### 2. `grep -P` (PCRE) not available on macOS
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
```bash
# BAD (GNU grep only):
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
# GOOD (portable):
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
```
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
regex or bash built-in `[[ =~ ]]` for pattern matching.
### 3. `kubectl drain` timeout with stuck pods
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
for multiple pods. Pods are evicted but don't terminate in time.
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
pods are draining simultaneously.
**Fix**: Use `--force` flag and a longer timeout, or retry:
```bash
# First attempt with standard timeout
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# If it fails, force with longer timeout (pods already evicting)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
```
**Note**: After a failed drain, the node is already cordoned. A second drain
attempt only needs to wait for already-evicting pods to finish.
### 4. Recovery from partial failure
If the script fails mid-way (after drain but before uncordon):
```bash
# Check VM status
ssh root@192.168.1.127 "qm status <vmid>"
# Start VM if stopped
ssh root@192.168.1.127 "qm start <vmid>"
# Uncordon node
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
```
## Verification
After successful expansion:
```bash
# On the VM
df -h /
# Should show new size (128G disk → ~126G usable for ext4)
# On the cluster
kubectl get node <name>
# Should show Ready status
```
## Notes
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
the script handling both paths
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
this is not an error
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
- SSH host keys may change if VMs are recreated or network changes — use
`-o StrictHostKeyChecking=no` in automated scripts
See also: `extend-vm-storage.md` (the operational skill for running the script)

View file

@ -0,0 +1,182 @@
---
name: python-filename-sanitization
description: |
Secure filename sanitization pattern for Python web applications. Use when:
(1) Accepting user-provided filenames for file operations, (2) Building file
rename/upload functionality, (3) Preventing path traversal attacks (../../../etc/passwd),
(4) Preventing shell injection through filenames, (5) FastAPI/Flask file handling.
Provides regex-based whitelist approach with pathlib for safe file operations.
author: Claude Code
version: 1.0.0
date: 2025-01-31
---
# Python Filename Sanitization
## Problem
User-provided filenames can contain malicious characters that enable path traversal
attacks, shell injection, or filesystem corruption. Direct use of user input in
file paths is a security vulnerability.
## Context / Trigger Conditions
- Building file upload, rename, or download functionality
- User can specify filenames via API or form input
- Files are stored on server filesystem
- Need to prevent: `../`, shell metacharacters, null bytes, etc.
## Solution
### Complete Sanitization Function
```python
import re
from pathlib import Path
def sanitize_filename(filename: str, max_length: int = 200) -> str:
"""
Sanitize a filename to prevent path traversal and shell injection.
Only allows alphanumeric characters, spaces, hyphens, underscores,
parentheses, and dots.
"""
if not filename:
raise ValueError("Filename cannot be empty")
# Remove any path components (prevent path traversal)
filename = Path(filename).name
# Only allow safe characters: alphanumeric, space, hyphen, underscore, parentheses, dot
# This regex removes anything that isn't in the allowed set
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
# Collapse multiple spaces/dots
safe_filename = re.sub(r'\s+', ' ', safe_filename)
safe_filename = re.sub(r'\.+', '.', safe_filename)
# Strip leading/trailing whitespace and dots
safe_filename = safe_filename.strip(' .')
# Limit length
if len(safe_filename) > max_length:
safe_filename = safe_filename[:max_length]
if not safe_filename:
raise ValueError("Filename contains no valid characters")
return safe_filename
```
### FastAPI Integration Example
```python
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from pathlib import Path
class RenameRequest(BaseModel):
new_name: str
@router.patch("/files/{file_id}/rename")
async def rename_file(file_id: str, request: RenameRequest):
"""Rename a file with sanitized input."""
file_dir = Path("/data/files") / file_id
if not file_dir.exists():
raise HTTPException(status_code=404, detail="File not found")
# Find existing file
files = list(file_dir.glob("*"))
if not files:
raise HTTPException(status_code=404, detail="No file found")
current_file = files[0]
current_extension = current_file.suffix
# Sanitize the new name
try:
safe_name = sanitize_filename(request.new_name)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
# Preserve original extension
if not safe_name.lower().endswith(current_extension.lower()):
safe_name = safe_name + current_extension
# Create new path (same directory, new filename)
new_file = file_dir / safe_name
# Check for conflicts
if new_file.exists() and new_file != current_file:
raise HTTPException(status_code=400, detail="A file with that name already exists")
# Rename using pathlib (no shell commands!)
current_file.rename(new_file)
return {"status": "renamed", "new_filename": safe_name}
```
## Key Security Principles
### 1. Whitelist, Don't Blacklist
```python
# BAD: Trying to block dangerous characters
filename = filename.replace('../', '').replace('\x00', '')
# GOOD: Only allow known-safe characters
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
```
### 2. Use pathlib, Not Shell Commands
```python
# BAD: Shell command (vulnerable to injection)
os.system(f'mv "{old_path}" "{new_path}"')
# GOOD: Pure Python (no shell)
old_path.rename(new_path)
```
### 3. Extract Basename First
```python
# BAD: User could submit "../../../etc/passwd"
filename = user_input
# GOOD: Extract just the filename part
filename = Path(user_input).name
```
### 4. Validate After Sanitization
```python
# Ensure something remains after sanitization
if not safe_filename:
raise ValueError("Filename contains no valid characters")
```
## Verification
```python
# Test cases that should be handled safely
assert sanitize_filename("normal.txt") == "normal.txt"
assert sanitize_filename("../../../etc/passwd") == "etcpasswd"
assert sanitize_filename("file; rm -rf /") == "file rm -rf"
assert sanitize_filename(" spaces .txt") == "spaces.txt"
assert sanitize_filename("$(whoami).txt") == "whoami.txt"
# Test cases that should raise errors
try:
sanitize_filename("") # Should raise ValueError
except ValueError:
pass
try:
sanitize_filename("$#@!") # Should raise ValueError (no valid chars)
except ValueError:
pass
```
## Notes
- This is intentionally restrictive; expand the regex if you need Unicode support
- For Unicode filenames, consider `unicodedata.normalize('NFKD', ...)` first
- Max length of 200 is conservative; filesystem limits vary (255 bytes typical)
- Always preserve file extensions when renaming to avoid breaking file associations
- Consider adding a UUID prefix for guaranteed uniqueness in upload scenarios
## References
- [OWASP Path Traversal](https://owasp.org/www-community/attacks/Path_Traversal)
- [CWE-22: Path Traversal](https://cwe.mitre.org/data/definitions/22.html)
- [Python pathlib documentation](https://docs.python.org/3/library/pathlib.html)

View file

@ -0,0 +1,97 @@
---
name: terraform-state-identity-mismatch
description: |
Fix Terraform "Unexpected Identity Change" errors during plan/apply. Use when:
(1) Terraform fails with "the Terraform Provider unexpectedly returned a different
identity", (2) State refresh shows identity mismatch between stored and current values,
(3) Resource was created but terraform apply timed out, leaving state inconsistent.
Solution involves removing and reimporting the affected resource.
author: Claude Code
version: 1.0.0
date: 2026-01-28
---
# Terraform State Identity Mismatch Fix
## Problem
Terraform fails during plan or apply with an "Unexpected Identity Change" error,
indicating the stored state identity doesn't match what the provider returns when
reading the resource.
## Context / Trigger Conditions
- Error message contains: "Unexpected Identity Change: During the read operation,
the Terraform Provider unexpectedly returned a different identity"
- Often occurs after a terraform apply times out mid-creation
- Resource exists in the cluster/cloud but state is corrupted
- Common with Kubernetes provider after deployment rollout timeouts
## Solution
### Step 1: Identify the affected resource
The error message includes the resource address:
```
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
```
### Step 2: Remove from state
```bash
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
```
Note: Use single quotes around the address to handle brackets properly.
### Step 3: Import the resource back
```bash
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' <namespace>/<name>
```
For Kubernetes deployments, the import ID is `namespace/deployment-name`.
### Step 4: Verify with plan
```bash
terraform plan -target=<module-path>
```
Should show minimal or no changes if import was successful.
### Step 5: Apply to sync any drift
```bash
terraform apply -target=<module-path>
```
## Verification
- `terraform plan` runs without identity errors
- `terraform apply` completes successfully
- Resource still exists and functions correctly
## Example
**Error:**
```
Error: Unexpected Identity Change
Current Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.NullVal...})
New Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.StringVal("apps/v1")...})
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
```
**Fix:**
```bash
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
# Output: Removed ... Successfully removed 1 resource instance(s).
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' resume/resume
# Output: Import successful!
terraform apply -target=module.kubernetes_cluster.module.resume -auto-approve
# Output: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
```
## Notes
- This is a provider bug, not user error - consider reporting to provider maintainers
- The resource continues to work fine; only the terraform state is affected
- Always verify the resource exists before importing (don't import non-existent resources)
- For Kubernetes resources, import IDs are typically `namespace/name`
- For AWS resources, import IDs vary by resource type (check provider docs)
- Consider adding `-lock=false` if state locking causes issues during recovery
## See Also
- Terraform state management documentation
- Kubernetes provider import documentation

View file

@ -0,0 +1,405 @@
---
name: traefik-helm-configuration
description: |
Consolidated Traefik Helm chart configuration skill covering HTTP/3 (QUIC), UDP
cross-namespace routing, and plugin download failures. Use when:
(1) enabling HTTP/3 on Traefik or Alt-Svc header shows wrong port (e.g., 8443 instead of 443),
(2) HTTP/3 is configured in Helm values but not working end-to-end,
(3) Cloudflare-proxied domains need HTTP/3 enabled,
(4) custom UDP entrypoints don't appear in the LoadBalancer Service,
(5) IngressRouteUDP logs show "udp service is not in the parent resource namespace",
(6) DNS or other UDP traffic through Traefik times out despite correct IngressRouteUDP config,
(7) all Traefik routes suddenly return 404 after a restart or pod recreation,
(8) Traefik logs show "Plugins are disabled because an error has occurred",
(9) plugin download fails with "context deadline exceeded" for crowdsec-bouncer or rewrite-body.
author: Claude Code
version: 1.0.0
date: 2026-02-22
---
# Traefik Helm Chart Configuration
Consolidated guide for three common Traefik Helm chart issues: HTTP/3 (QUIC) enablement,
UDP cross-namespace routing, and plugin download failures causing global 404s.
---
## HTTP/3 (QUIC)
### Problem
You want to enable HTTP/3 (QUIC) on a Traefik ingress controller in Kubernetes so that
clients can negotiate HTTP/3 connections via the `Alt-Svc` response header.
### Context / When to Use
- Enabling HTTP/3 for the first time on Traefik
- Troubleshooting HTTP/3 not working despite configuration
- Alt-Svc header shows internal container port (8443) instead of external port (443)
- Need to enable HTTP/3 on both origin (Traefik) and CDN (Cloudflare)
### Solution
#### Step 1: Configure Traefik Helm Chart Values
In the Traefik Helm release values, add `http3` configuration to the `websecure` entrypoint:
```hcl
# In modules/kubernetes/traefik/main.tf
ports = {
websecure = {
port = 8443
exposedPort = 443
protocol = "TCP"
http = {
tls = {
enabled = true
}
}
# Enable HTTP/3 (QUIC)
http3 = {
enabled = true
advertisedPort = 443 # CRITICAL: Must match the external port
}
}
}
```
**Key gotcha: `advertisedPort = 443`**
Without `advertisedPort`, Traefik advertises the *internal container port* (8443) in the
`Alt-Svc` header:
```
Alt-Svc: h3=":8443"; ma=2592000
```
This is wrong because clients connect on external port 443, not 8443. The correct header is:
```
Alt-Svc: h3=":443"; ma=2592000
```
Setting `advertisedPort = 443` fixes this.
#### Step 2: Ensure Helm Chart Fully Re-renders
Changing `http3.enabled=true` in values alone may not cause the Helm chart to add the
required UDP port to the Service and Deployment specs. The Traefik Helm chart templates
need to re-render to include `websecure-http3: 443/UDP` in the Service.
If the Service doesn't show a UDP port after applying:
- See the companion skill `helm-release-force-rerender` for fixing this
- The root cause is that `helm upgrade --reuse-values` (Terraform's default behavior)
may not trigger template re-rendering for structural changes like adding new ports
After a successful apply, verify the Service has the UDP port:
```bash
kubectl get svc traefik -n traefik -o yaml | grep -A5 "443"
```
Expected output should include both:
```yaml
- name: websecure
port: 443
protocol: TCP
targetPort: websecure
- name: websecure-http3
port: 443
protocol: UDP
targetPort: websecure-http3
```
#### Step 3: Enable HTTP/3 on Cloudflare (if using Cloudflare proxy)
For Cloudflare-proxied domains, HTTP/3 must also be enabled at the Cloudflare zone level.
**Cloudflare Provider v4** (current in this repo):
```hcl
resource "cloudflare_zone_settings_override" "http3" {
zone_id = var.cloudflare_zone_id
settings {
http3 = "on" # String values: "on" or "off"
}
}
```
**Note**: In Cloudflare provider v5, this uses `cloudflare_zone_setting` (singular) with
different syntax. The v4 resource is `cloudflare_zone_settings_override` (plural + override).
#### Step 4: Verify End-to-End
##### Testing from macOS
macOS system curl does NOT support HTTP/3. Install curl with HTTP/3:
```bash
brew install curl
```
Then use the Homebrew version explicitly:
```bash
# Test HTTP/3 negotiation (Alt-Svc header)
/opt/homebrew/opt/curl/bin/curl -sI https://example.viktorbarzin.me 2>&1 | grep -i alt-svc
# Expected: alt-svc: h3=":443"; ma=2592000
# Test actual HTTP/3 connection
/opt/homebrew/opt/curl/bin/curl --http3-only -sI https://example.viktorbarzin.me
# Expected: HTTP/3 200
```
##### Testing from within the Cluster
```bash
# Use a curl image with HTTP/3 support (amd64 only)
kubectl run curl-h3 --rm -it --image=ymuski/curl-http3 --restart=Never -- \
curl --http3-only -sI https://example.viktorbarzin.me
# Note: ymuski/curl-http3 is amd64-only; it will fail on arm64 nodes
```
##### Checking Traefik Logs
```bash
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100 | grep -i quic
```
### Verification Checklist
1. Traefik Service shows UDP port 443 (`websecure-http3`)
2. `Alt-Svc` response header shows `h3=":443"` (not `h3=":8443"`)
3. `/opt/homebrew/opt/curl/bin/curl --http3-only` successfully connects
4. Cloudflare zone has HTTP/3 enabled (for proxied domains)
### Current Configuration (This Repo)
- **Traefik config**: `modules/kubernetes/traefik/main.tf` (lines 89-92)
- **Cloudflare HTTP/3**: `modules/kubernetes/cloudflared/cloudflare.tf` (line 153)
- **MetalLB IP**: 10.0.20.202 (Traefik LoadBalancer service)
### Notes
- HTTP/3 uses QUIC over UDP. Firewalls must allow UDP 443 inbound.
- Traefik automatically handles TLS for HTTP/3 using the same certs as HTTPS.
- The `Alt-Svc` header is sent on HTTP/2 responses to tell clients HTTP/3 is available.
Clients then upgrade to HTTP/3 on subsequent requests.
- For non-Cloudflare (direct DNS) domains, only the Traefik-side config is needed.
- Cloudflare handles its own HTTP/3 negotiation with end users; the origin connection
between Cloudflare and Traefik uses HTTP/1.1 or HTTP/2 (not HTTP/3).
---
## UDP Cross-Namespace Routing
### Problem
Adding a custom UDP entrypoint (e.g., DNS on port 53) to Traefik v3 via Helm chart values
doesn't work out of the box. Traffic times out even though the Traefik pod listens on the
port internally. Two separate issues compound:
1. The Helm chart defaults `expose` to `false` for custom entrypoints -- the port is never
added to the LoadBalancer Service
2. `allowCrossNamespace` defaults to `false` -- IngressRouteUDP in namespace A can't
reference a Service in namespace B
### Context / Trigger Conditions
- Traefik Helm chart v39.0.0+ (Traefik v3.x)
- Custom UDP entrypoint defined in `ports` values
- `IngressRouteUDP` referencing a service in a different namespace
- Symptoms:
- `kubectl get svc traefik` doesn't show your custom UDP port
- UDP traffic to the LoadBalancer IP times out
- Traefik logs show: `"udp service <namespace>/<service> is not in the parent resource namespace <traefik-namespace>"`
- `netstat -ulnp` inside Traefik pod confirms it IS listening on the port
### Solution
#### Fix 1: Expose the UDP port on the Service
In the Helm values, add `expose = { default = true }` to the entrypoint:
```hcl
# Terraform HCL
ports = {
dns-udp = {
port = 5353
exposedPort = 53
protocol = "UDP"
expose = { default = true } # <-- Required for custom entrypoints
}
}
```
```yaml
# Helm values YAML equivalent
ports:
dns-udp:
port: 5353
exposedPort: 53
protocol: UDP
expose:
default: true
```
Note: The built-in `web` and `websecure` entrypoints have `expose.default = true` by
default, but custom entrypoints do NOT.
#### Fix 2: Enable cross-namespace CRD references
In the Helm values, add `allowCrossNamespace = true` to the kubernetesCRD provider:
```hcl
# Terraform HCL
providers = {
kubernetesCRD = {
enabled = true
allowCrossNamespace = true # <-- Required for cross-namespace IngressRouteUDP
}
}
```
```yaml
# Helm values YAML
providers:
kubernetesCRD:
enabled: true
allowCrossNamespace: true
```
This is required whenever an `IngressRouteUDP` (or `IngressRouteTCP`, `IngressRoute`)
references a Kubernetes Service in a different namespace.
### Verification
```bash
# 1. Verify the port appears in the Service
kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'
# Should include your custom entrypoint name (e.g., "dns-udp")
# 2. Check Traefik logs for cross-namespace errors
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "not in the parent resource namespace"
# Should return nothing after the fix
# 3. Test the UDP service
dig @<traefik-lb-ip> example.com
```
### Example
DNS forwarding through Traefik to Technitium DNS:
- IngressRouteUDP in `traefik` namespace routes `dns-udp` entrypoint to
`technitium-dns:53` in `technitium` namespace
- Without Fix 1: port 53 never exposed on LoadBalancer -- traffic can't reach Traefik
- Without Fix 2: Traefik rejects the route -- logs error every ~60 seconds
- With both fixes: DNS queries to LoadBalancer IP:53 -> Traefik -> Technitium
### Notes
1. **Debugging order matters**: Fix 1 (expose) must come first. Without the port on the
Service, you can't even test if the routing works. Fix 2 (cross-namespace) errors only
appear in Traefik logs, not as user-visible failures.
2. **`allowCrossNamespace` is a security consideration**: It allows any IngressRoute CRD
to reference services in any namespace. If this is too broad, consider using
`TraefikService` middleware or moving the IngressRouteUDP to the target namespace.
3. **Rolling update**: Changing `allowCrossNamespace` triggers a Traefik pod restart
(new CLI args). Changing `expose` only updates the Service (no pod restart needed).
4. **This applies to TCP too**: `IngressRouteTCP` with cross-namespace services needs the
same `allowCrossNamespace` setting.
---
## Plugin Download Failure (Global 404)
### Problem
After a node maintenance operation (containerd restart, node drain/uncordon, etc.),
all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist
and look correct, making this extremely confusing to debug.
### Context / Trigger Conditions
- ALL Traefik routes return 404 simultaneously (not just one service)
- Traefik pods are Running and Ready
- Ingress resources exist with correct annotations
- Middlewares exist in the correct namespaces
- TLS secrets exist
- Traefik startup logs contain: `Plugins are disabled because an error has occurred`
- Plugin download error: `unable to download plugin ... context deadline exceeded`
- Happened after a node restart, containerd restart, or network disruption
### Root Cause
Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from
`plugins.traefik.io` on **every pod startup**. If the download fails (network
unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**.
Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually
every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the
missing plugin middleware as a fatal routing error and returns 404 for every route
that references it -- which is typically all of them.
### Solution
```bash
# 1. Confirm the diagnosis - check Traefik startup logs
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20
# Look for: "Plugins are disabled because an error has occurred"
# 2. Verify outbound connectivity is restored
kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \
-o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io
# 3. Rollout restart to retry plugin download
kubectl rollout restart deployment -n traefik traefik
# 4. Verify plugins loaded
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins"
# Should show: "Plugins loaded."
# 5. Verify routes work
curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k
# Should return 200 instead of 404
```
### Verification
- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`)
- Routes return expected HTTP status codes (200, 302, etc.) instead of 404
- `kubectl logs -n traefik <pod> | grep "does not exist"` shows no middleware errors
### Why This Is Hard to Debug
1. **Traefik pods show Running/Ready** -- health checks pass even without plugins
2. **All Kubernetes resources look correct** -- Ingresses, Services, Middlewares all exist
3. **The error is in startup logs only** -- not in per-request logs (requests just get 404)
4. **The 404 is Traefik's default** -- same as "no route matched", not a backend error
5. **The middleware error is logged once at startup** -- easy to miss in a stream of logs
### Prevention
- During planned maintenance (node drain, containerd restart), restart Traefik pods
AFTER network connectivity is confirmed restored
- Consider pre-caching Traefik plugins in the container image or using an init container
- Monitor for the `Plugins are disabled` log message in your alerting system
### Notes
- This affects ALL plugin-based middlewares, not just crowdsec
- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected
- Traefik v3.x downloads plugins on every startup; there is no persistent cache
- If only some routes return 404, the problem is likely different (missing middleware
or TLS secret, not a plugin issue)
---
## References
- [Traefik HTTP/3 Documentation](https://doc.traefik.io/traefik/routing/entrypoints/#http3)
- [Traefik Helm Chart Values](https://github.com/traefik/traefik-helm-chart/blob/master/traefik/values.yaml)
- [Cloudflare HTTP/3 Settings](https://developers.cloudflare.com/speed/optimization/protocol/http3/)
- [Traefik Helm Chart Ports Configuration](https://github.com/traefik/traefik-helm-chart)
- [Traefik v3 Providers Documentation](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
## See Also
- `traefik-rewrite-body-troubleshooting` -- Traefik rewrite-body plugin troubleshooting (compression, Accept header issues)
- `helm-release-force-rerender` -- Force Helm chart re-render when structural changes don't take effect

View file

@ -0,0 +1,200 @@
---
name: traefik-rewrite-body-troubleshooting
description: |
Troubleshooting guide for the Traefik rewrite-body plugin (packruler/rewrite-body).
Covers two failure modes: (1) Compression failure — plugin logs "flate: corrupt input
before offset 5" when backends send gzip-compressed responses, corrupting response
bodies and breaking WebSocket connections, authentication flows, and mobile app
connectivity. (2) Silent skip — plugin silently skips content injection (rybbit
analytics, trap links, or any HTML rewriting) when the request Accept header doesn't
contain "text/html" (e.g., curl's default Accept: */*), making it appear broken
despite correct configuration.
author: Claude Code
version: 1.0.0
date: 2026-02-22
---
# Traefik Rewrite-Body Plugin Troubleshooting
Two distinct failure modes for the `packruler/rewrite-body` Traefik plugin used for
injecting analytics scripts (rybbit) and anti-AI trap links into HTML responses.
---
## Problem 1: Compression Failure
### Symptoms
- Traefik logs show: `Rewrite-Body | ERROR ... Error loading content: flate: corrupt input before offset 5`
- Mobile apps (e.g., Home Assistant Companion) fail while browser works
- HA Companion app shows repeated `GET /?external_auth=1` requests (auth loop)
- WebSocket connections (`/api/websocket`) are very short-lived (seconds instead of minutes)
- HTTP 499 errors on API calls (client disconnects due to corrupted responses)
- Using `packruler/rewrite-body` plugin v1.2.0 with `monitoring.types = ["text/html"]`
### Root Cause
Despite the `monitoring.types = ["text/html"]` filter, the plugin attempts to decompress
ALL responses before checking content type. When decompression fails on certain gzip
encodings, it corrupts the response body, breaking:
- WebSocket upgrade handshakes
- Authentication flows (HA Companion app's `external_auth` callback)
- Mobile app connectivity (while browser appears to work due to auto-reconnect)
### Misleading Symptoms
- HTTP/3 (QUIC) may appear to be the cause because HTTP/3 requests show 499 errors.
This is a red herring -- the rewrite-body plugin corruption affects all protocols.
- WebSocket issues may look like a timeout or proxy configuration problem.
- The `monitoring.types = ["text/html"]` config suggests the plugin should only touch
HTML, but it still processes all responses for decompression before filtering.
### Solution
#### Step 1: Create a strip-accept-encoding middleware
Add a Traefik middleware that removes `Accept-Encoding` from requests, forcing
backends to send uncompressed responses that the plugin can safely process:
```hcl
# In traefik/middleware.tf
resource "kubernetes_manifest" "middleware_strip_accept_encoding" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "strip-accept-encoding"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
headers = {
customRequestHeaders = {
"Accept-Encoding" = ""
}
}
}
}
depends_on = [helm_release.traefik]
}
```
#### Step 2: Add middleware to routes with rewrite-body
In the ingress factory middleware chain, add `strip-accept-encoding` BEFORE the
rewrite-body middleware:
```hcl
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
```
The order matters: strip-accept-encoding must come first so the request reaches
the backend without Accept-Encoding, and the uncompressed response then passes
through the rewrite-body plugin.
### Verification (Compression Fix)
1. Check Traefik logs for absence of `flate: corrupt input` errors:
```bash
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=200 | grep -i "flate\|rewrite-body"
```
2. Verify the middleware chain includes strip-accept-encoding before rybbit:
```bash
kubectl get ingress -n <namespace> <name> -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
```
3. Test mobile app connectivity (HA Companion, etc.)
### Notes (Compression)
- This affects ALL services using the rewrite-body plugin, not just HA
- The fix is applied conditionally: `strip-accept-encoding` is only added to the
middleware chain when `rybbit_site_id` is set, so services without analytics
are unaffected
- Both `ingress_factory` and `reverse_proxy/factory` modules need the fix
- Traefik may still compress responses to clients via its own compression middleware;
the strip only affects the backend request
- The plugin's `monitoring.types` filter works for deciding what to rewrite, but
decompression is attempted on all responses regardless
---
## Problem 2: Silent Skip (Accept Header Mismatch)
### Symptoms
- rewrite-body middleware is in the ingress middleware chain and shows status "enabled" in Traefik API
- `curl https://example.com/` returns original HTML with no injected content
- Browser shows injected content (rybbit script, trap links, etc.)
- No errors in Traefik logs -- the plugin silently skips processing
- `monitoring.types = ["text/html"]` is configured in the middleware spec
- Middleware chain order is correct (strip-accept-encoding before rewrite-body)
### Root Cause
In the plugin source code, `SupportsProcessing()` checks the **request** `Accept`
header (not the response `Content-Type`) against `monitoring.types`:
```go
func (r *Rewriter) SupportsProcessing(req *http.Request) bool {
accept := req.Header.Get("Accept")
for _, monitoringType := range r.monitoring.Types {
if strings.Contains(accept, monitoringType) {
return true
}
}
return false
}
```
It uses `strings.Contains(accept, "text/html")`. The curl default `Accept: */*` does
NOT contain the substring `text/html`, so the plugin returns false and skips all
processing. Browser requests include `Accept: text/html,application/xhtml+xml,...`
which does match.
### Misleading Symptoms
- Appears as if the middleware isn't working at all
- May look like a middleware ordering issue or configuration error
- `kubectl get middleware` shows the resource exists with correct spec
- Traefik API (`/api/http/middlewares/`) shows the middleware as "enabled"
- Checking the rewrite-body regex patterns seems pointless since nothing is being processed
### Solution
This is **working as designed** -- not a bug. The fix depends on context:
#### For testing with curl
Add the `Accept` header to simulate a browser:
```bash
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/
```
#### For verifying injection is working
```bash
# Check for injected content (trap links, analytics, etc.)
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
| grep -oE 'href="https://poison[^"]*"'
# Check for rybbit analytics
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
| grep -oE 'src="https://rybbit[^"]*"'
```
#### For programmatic clients that need injection
If a non-browser client needs to receive injected content, ensure it sends
`Accept: text/html` in its request headers.
### Verification (Accept Header)
```bash
# Without Accept header -- no injection (expected)
curl -s https://example.com/ | grep -c "rybbit"
# Output: 0
# With Accept header -- injection works
curl -s -H "Accept: text/html" https://example.com/ | grep -c "rybbit"
# Output: 1
```
### Notes (Accept Header)
- This behavior is independent of the compression issue (Problem 1 above)
- The check is on the **request** `Accept` header, not the **response** `Content-Type`
- `Accept: */*` does NOT match -- `strings.Contains("*/*", "text/html")` is false
- Real AI scrapers typically send browser-like Accept headers, so trap links will be
injected for them correctly
- API calls (which typically send `Accept: application/json`) are correctly skipped
---
## See Also
- `traefik-helm-configuration` -- Traefik Helm chart configuration and entrypoints
- `ingress-factory-migration` -- Covers the ingress factory module that creates
rybbit analytics middlewares