fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
170
.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
Normal file
170
.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
---
|
||||
name: authentik-oidc-kubernetes
|
||||
description: |
|
||||
Configure Authentik as OIDC provider for Kubernetes API server authentication.
|
||||
Use when: (1) setting up OIDC auth for kubectl with Authentik, (2) kube-apiserver
|
||||
rejects OIDC tokens with "oidc: email not verified", (3) JWKS endpoint returns
|
||||
empty {} despite provider being configured, (4) kubelogin fails with "claim not
|
||||
present" for email, (5) redirect_uri mismatch errors during kubelogin browser auth,
|
||||
(6) kube-apiserver static pod manifest changes don't take effect after restart.
|
||||
Covers all gotchas discovered when integrating Authentik 2025.10.x with Kubernetes
|
||||
1.34.x using kubelogin (int128/kubelogin).
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik OIDC for Kubernetes API Authentication
|
||||
|
||||
## Problem
|
||||
Setting up Authentik as an OIDC identity provider for Kubernetes kubectl access
|
||||
involves multiple non-obvious pitfalls that cause silent failures at different
|
||||
stages of the authentication flow.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Setting up multi-user kubectl access with OIDC
|
||||
- Using Authentik as the identity provider and kubelogin (int128/kubelogin) as the kubectl plugin
|
||||
- Any of these errors:
|
||||
- `oidc: email not verified`
|
||||
- `oidc: parse username claims "email": claim not present`
|
||||
- `The request fails due to a missing, invalid, or mismatching redirection URI`
|
||||
- JWKS endpoint (`/application/o/<app>/jwks/`) returns `{}`
|
||||
- `Unauthorized` after successful browser login
|
||||
|
||||
## Solution
|
||||
|
||||
### Gotcha 1: Signing Key Must Be Assigned
|
||||
|
||||
Authentik's OAuth2 provider does NOT assign a signing key by default. Without it,
|
||||
the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
|
||||
|
||||
**Fix:** Assign a signing key (e.g., "authentik Self-signed Certificate") to the
|
||||
OAuth2 provider:
|
||||
```python
|
||||
# Via Django shell (kubectl exec into authentik server pod)
|
||||
from authentik.providers.oauth2.models import OAuth2Provider
|
||||
from authentik.crypto.models import CertificateKeyPair
|
||||
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
cert = CertificateKeyPair.objects.filter(name='authentik Self-signed Certificate').first()
|
||||
provider.signing_key = cert
|
||||
provider.save()
|
||||
```
|
||||
|
||||
Or via API:
|
||||
```bash
|
||||
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
|
||||
"$AUTHENTIK_URL/api/v3/providers/oauth2/<pk>/" \
|
||||
-d '{"signing_key": "<certificate-keypair-uuid>"}'
|
||||
```
|
||||
|
||||
### Gotcha 2: Default Email Mapping Sets `email_verified: False`
|
||||
|
||||
Authentik's built-in email scope mapping hardcodes `email_verified: False`:
|
||||
```python
|
||||
return {
|
||||
"email": request.user.email,
|
||||
"email_verified": False # <-- This causes kube-apiserver to reject the token
|
||||
}
|
||||
```
|
||||
|
||||
kube-apiserver requires `email_verified: true` by default.
|
||||
|
||||
**Fix:** Create a custom scope mapping with `email_verified: True` and assign it
|
||||
to the provider instead of the default:
|
||||
```python
|
||||
from authentik.providers.oauth2.models import OAuth2Provider, ScopeMapping
|
||||
|
||||
# Create custom mapping
|
||||
mapping, _ = ScopeMapping.objects.get_or_create(
|
||||
name='Kubernetes Email (verified)',
|
||||
defaults={
|
||||
'scope_name': 'email',
|
||||
'expression': 'return {"email": request.user.email, "email_verified": True}'
|
||||
}
|
||||
)
|
||||
|
||||
# Replace default email mapping on the provider
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
default_email = ScopeMapping.objects.filter(
|
||||
managed='goauthentik.io/providers/oauth2/scope-email'
|
||||
).first()
|
||||
if default_email:
|
||||
provider.property_mappings.remove(default_email)
|
||||
provider.property_mappings.add(mapping)
|
||||
```
|
||||
|
||||
### Gotcha 3: kubelogin Needs Extra Scopes
|
||||
|
||||
By default, kubelogin only requests the `openid` scope. The token will lack
|
||||
`email` and `groups` claims, causing:
|
||||
```
|
||||
oidc: parse username claims "email": claim not present
|
||||
```
|
||||
|
||||
**Fix:** Add `--oidc-extra-scope` flags to the kubeconfig exec plugin:
|
||||
```yaml
|
||||
users:
|
||||
- name: oidc-user
|
||||
user:
|
||||
exec:
|
||||
command: kubectl
|
||||
args:
|
||||
- oidc-login
|
||||
- get-token
|
||||
- --oidc-issuer-url=https://authentik.example.com/application/o/kubernetes/
|
||||
- --oidc-client-id=kubernetes
|
||||
- --oidc-extra-scope=email # Required!
|
||||
- --oidc-extra-scope=profile
|
||||
- --oidc-extra-scope=groups
|
||||
```
|
||||
|
||||
### Gotcha 4: Redirect URIs Must Use Regex Mode
|
||||
|
||||
kubelogin picks a random available port (tries 8000, 18000, then random).
|
||||
Strict redirect URI matching like `http://localhost:8000/callback` will fail
|
||||
when kubelogin uses a different port.
|
||||
|
||||
**Fix:** Use regex matching in the Authentik provider:
|
||||
```json
|
||||
{
|
||||
"redirect_uris": [
|
||||
{"matching_mode": "regex", "url": "http://localhost:.*"},
|
||||
{"matching_mode": "regex", "url": "http://127\\.0\\.0\\.1:.*"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Gotcha 5: Property Mappings API Endpoint Changed
|
||||
|
||||
In Authentik 2025.10.x, scope mappings are at:
|
||||
- `propertymappings/provider/scope/` (new, correct)
|
||||
- NOT `propertymappings/scope/` (old, returns 405 Method Not Allowed on POST)
|
||||
|
||||
### Gotcha 6: Static Pod Manifest Changes Need Full Cycle
|
||||
|
||||
See skill: `kubelet-static-pod-manifest-update` for the full restart procedure.
|
||||
|
||||
## Verification
|
||||
|
||||
After all fixes:
|
||||
```bash
|
||||
# 1. JWKS has a key
|
||||
curl -s https://authentik.example.com/application/o/kubernetes/jwks/ | jq '.keys | length'
|
||||
# Expected: 1 (or more)
|
||||
|
||||
# 2. Test auth
|
||||
KUBECONFIG=/path/to/oidc-kubeconfig kubectl get namespaces
|
||||
# Expected: browser opens, login, namespaces returned
|
||||
|
||||
# 3. Check API server logs for success
|
||||
ssh master "sudo kubectl logs -n kube-system kube-apiserver-* | grep oidc | tail -5"
|
||||
# Expected: no "Unable to authenticate" errors
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The OAuth2 provider should use `client_type: public` (no client secret needed for kubelogin)
|
||||
- Set `sub_mode: user_email` so the OIDC subject matches the RBAC binding
|
||||
- Set `include_claims_in_id_token: true` for the token to contain claims directly
|
||||
- Use `issuer_mode: per_provider` for a clean issuer URL
|
||||
- RBAC ClusterRoleBindings should match on the user's email (the `--oidc-username-claim=email` value)
|
||||
297
.claude/skills/archived/authentik/SKILL.md
Normal file
297
.claude/skills/archived/authentik/SKILL.md
Normal file
|
|
@ -0,0 +1,297 @@
|
|||
---
|
||||
name: authentik
|
||||
description: |
|
||||
Manage the Authentik identity provider via its REST API. Use when:
|
||||
(1) User asks to create, update, or delete users in Authentik,
|
||||
(2) User asks to manage groups or group memberships,
|
||||
(3) User asks to create a new OAuth2/OIDC application or provider,
|
||||
(4) User asks to protect a service with forward auth (Authentik + Traefik),
|
||||
(5) User asks about SSO, single sign-on, authentication, or identity,
|
||||
(6) User asks to manage Authentik flows, stages, or policies,
|
||||
(7) User asks to configure social login (Google, GitHub, Facebook),
|
||||
(8) User asks about OIDC for Kubernetes or who has access to what,
|
||||
(9) User deploys a new service that needs authentication.
|
||||
Authentik v2025.10.3 running in Kubernetes, managed via REST API.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik Identity Provider Management
|
||||
|
||||
## Overview
|
||||
- **URL**: `https://authentik.viktorbarzin.me`
|
||||
- **Admin UI**: `https://authentik.viktorbarzin.me/if/admin/`
|
||||
- **API Base**: `https://authentik.viktorbarzin.me/api/v3/`
|
||||
- **API Docs**: `https://authentik.viktorbarzin.me/api/v3/docs/`
|
||||
- **Helm Chart**: authentik v2025.10.3
|
||||
- **Namespace**: `authentik`
|
||||
|
||||
## API Access
|
||||
|
||||
### Getting the Token
|
||||
The API token is stored in `terraform.tfvars` (git-crypt encrypted):
|
||||
```bash
|
||||
AUTHENTIK_TOKEN=$(grep authentik_api_token terraform.tfvars | cut -d'"' -f2)
|
||||
```
|
||||
|
||||
### Making API Calls
|
||||
```bash
|
||||
# Generic pattern
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
|
||||
|
||||
# With JSON body (POST/PATCH/PUT)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/" \
|
||||
-d '{"key": "value"}'
|
||||
```
|
||||
|
||||
### Verify Token Works
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/me/" | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Key API Endpoints
|
||||
|
||||
| Endpoint | Methods | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `core/users/` | GET, POST | List/create users |
|
||||
| `core/users/{id}/` | GET, PATCH, DELETE | Get/update/delete user |
|
||||
| `core/groups/` | GET, POST | List/create groups |
|
||||
| `core/groups/{pk}/` | GET, PATCH, DELETE | Get/update/delete group |
|
||||
| `core/applications/` | GET, POST | List/create applications |
|
||||
| `core/tokens/` | GET, POST | List/create tokens |
|
||||
| `core/tokens/{identifier}/view_key/` | GET | View token secret key |
|
||||
| `providers/all/` | GET | List all providers |
|
||||
| `providers/oauth2/` | GET, POST | OAuth2/OIDC providers |
|
||||
| `providers/proxy/` | GET, POST | Proxy providers (forward auth) |
|
||||
| `flows/instances/` | GET | List flows |
|
||||
| `stages/all/` | GET | List stages |
|
||||
| `sources/all/` | GET | List sources (social login) |
|
||||
| `outposts/instances/` | GET | List outposts |
|
||||
| `propertymappings/provider/scope/` | GET, POST | OIDC scope mappings |
|
||||
| `rbac/roles/` | GET | List roles |
|
||||
|
||||
## Common Operations
|
||||
|
||||
### List All Users
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for u in json.load(sys.stdin)['results']:
|
||||
groups=[g['name'] for g in u.get('groups_obj',[])]
|
||||
print(f\" {u['username']:<40} {u['name']:<30} groups={groups}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a New User
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/" \
|
||||
-d '{
|
||||
"username": "user@example.com",
|
||||
"name": "Full Name",
|
||||
"email": "user@example.com",
|
||||
"is_active": true,
|
||||
"type": "internal",
|
||||
"path": "users"
|
||||
}'
|
||||
```
|
||||
|
||||
### Add User to Group
|
||||
```bash
|
||||
# First get the group to find current users
|
||||
GROUP_PK="<group-uuid>"
|
||||
CURRENT_USERS=$(curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" | \
|
||||
python3 -c "import json,sys; print(json.load(sys.stdin)['users'])")
|
||||
|
||||
# Then PATCH with the updated user list (add new user pk)
|
||||
curl -s -X PATCH \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" \
|
||||
-d '{"users": [<existing_pks>, <new_pk>]}'
|
||||
```
|
||||
|
||||
### Create a New Group
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/" \
|
||||
-d '{
|
||||
"name": "My New Group",
|
||||
"is_superuser": false,
|
||||
"parent": "<parent-group-pk-or-null>"
|
||||
}'
|
||||
```
|
||||
|
||||
### Create OAuth2/OIDC Application (Full Flow)
|
||||
|
||||
**Step 1: Create the OAuth2 Provider**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/providers/oauth2/" \
|
||||
-d '{
|
||||
"name": "Provider for myapp",
|
||||
"authorization_flow": "<flow-pk>",
|
||||
"invalidation_flow": "<invalidation-flow-pk>",
|
||||
"client_type": "confidential",
|
||||
"client_id": "<generated-or-custom>",
|
||||
"client_secret": "<generated-or-custom>",
|
||||
"redirect_uris": "https://myapp.viktorbarzin.me/callback",
|
||||
"property_mappings": ["<scope-mapping-pks>"],
|
||||
"signing_key": "<signing-key-pk>"
|
||||
}'
|
||||
```
|
||||
|
||||
**Step 2: Create the Application**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/" \
|
||||
-d '{
|
||||
"name": "My App",
|
||||
"slug": "myapp",
|
||||
"provider": <provider-pk-from-step-1>,
|
||||
"meta_launch_url": "https://myapp.viktorbarzin.me"
|
||||
}'
|
||||
```
|
||||
|
||||
### List Applications
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for a in json.load(sys.stdin)['results']:
|
||||
ptype = a.get('provider_obj',{}).get('verbose_name','N/A')
|
||||
print(f\" {a['name']:<30} slug={a['slug']:<25} provider={ptype}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a Non-Expiring API Token
|
||||
```bash
|
||||
# Create token
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/" \
|
||||
-d '{
|
||||
"identifier": "my-token-name",
|
||||
"intent": "api",
|
||||
"expiring": false,
|
||||
"description": "Description here"
|
||||
}'
|
||||
|
||||
# Retrieve the key
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/my-token-name/view_key/"
|
||||
```
|
||||
|
||||
## Important Reference UUIDs
|
||||
|
||||
### Authorization Flows
|
||||
| Flow | Slug | Use For |
|
||||
|------|------|---------|
|
||||
| Authorize Application (explicit consent) | `default-provider-authorization-explicit-consent` | Apps that should show consent screen |
|
||||
| Authorize Application (implicit consent) | `default-provider-authorization-implicit-consent` | Internal/trusted apps, auto-redirect |
|
||||
| Logout | `default-invalidation-flow` | Invalidation/logout flow |
|
||||
|
||||
### Common Property Mappings (OIDC Scopes)
|
||||
These are the standard scope mappings used by most providers:
|
||||
- `60e33a8c-66a2-414f-840c-b13012b4d4bd` — openid
|
||||
- `1f51c659-f13b-4ad4-ba89-70458ef88e9c` — email
|
||||
- `4c0bf430-7f74-4216-b9d7-23703ab544ba` — profile
|
||||
|
||||
### Login Sources
|
||||
| Source | Slug | Matching Mode |
|
||||
|--------|------|---------------|
|
||||
| Google | `google` | identifier |
|
||||
| GitHub | `github` | email_link |
|
||||
| Facebook | `facebook` | email_link |
|
||||
|
||||
## Protecting a Service with Forward Auth
|
||||
|
||||
To protect a service via Authentik + Traefik forward auth:
|
||||
|
||||
1. In the service's Terraform module, set `protected = true` in the `ingress_factory` call
|
||||
2. This adds the `authentik-forward-auth` Traefik middleware
|
||||
3. Unauthenticated users get redirected to the Authentik login page
|
||||
4. After login, these headers are forwarded to the service:
|
||||
- `X-authentik-username`
|
||||
- `X-authentik-uid`
|
||||
- `X-authentik-email`
|
||||
- `X-authentik-name`
|
||||
- `X-authentik-groups`
|
||||
|
||||
## Invitation Management
|
||||
|
||||
### Create Invitation
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/" \
|
||||
-d '{
|
||||
"name": "invite-slug-name",
|
||||
"single_use": true,
|
||||
"fixed_data": {"group": "Target Group Name"},
|
||||
"flow": "<invitation-enrollment-flow-pk>"
|
||||
}'
|
||||
# Returns PK which is the itoken
|
||||
# Link: https://authentik.viktorbarzin.me/if/flow/invitation-enrollment/?itoken=<pk>
|
||||
```
|
||||
|
||||
### List Invitations
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/?page_size=50"
|
||||
```
|
||||
|
||||
### Delete Invitation
|
||||
```bash
|
||||
curl -s -X DELETE -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/<pk>/"
|
||||
```
|
||||
|
||||
### Helper Script
|
||||
Use `.claude/scripts/authentik-invite.sh` for invitation management:
|
||||
```bash
|
||||
./authentik-invite.sh create "Group Name" [--days N]
|
||||
./authentik-invite.sh assign <username> "Group Name"
|
||||
./authentik-invite.sh list
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
- OAuth source `enrollment_flow` is set to `invitation-enrollment` -- new social login users require invitation
|
||||
- Source updates require Django ORM (PATCH not supported on `sources/oauth/<slug>/`)
|
||||
- Invitation `name` field must be a slug (letters, numbers, hyphens, underscores)
|
||||
|
||||
## Gotchas
|
||||
|
||||
1. **API pagination**: All list endpoints return paginated results. Use `?page_size=50` or check `pagination.next` for more pages.
|
||||
2. **Group user updates**: PATCH to groups replaces the entire user list — always fetch current users first, then append.
|
||||
3. **Provider property mappings**: Must reference existing scope mapping UUIDs. Query `propertymappings/provider/scope/` to find them.
|
||||
4. **Signing key for OIDC**: Must assign a signing key to OAuth2 providers or JWKS endpoint returns empty `{}`.
|
||||
5. **Email verified claim**: Default email scope mapping sets `email_verified: False`. For Kubernetes OIDC, create a custom mapping that returns `True`.
|
||||
6. **Token identifier uniqueness**: Token identifiers must be unique across the entire instance.
|
||||
|
||||
## Notes
|
||||
- Authentik is classified as DEFCON Level 1 (Critical) — handle with care
|
||||
- Changes to Authentik configuration (Helm chart, PgBouncer, etc.) must go through Terraform
|
||||
- API-level changes (users, groups, applications) are fine to make directly via the API
|
||||
- The embedded outpost auto-discovers providers assigned to it
|
||||
- See also: `ingress-factory-migration` skill for protecting services
|
||||
175
.claude/skills/archived/bluestacks-burp-interception/SKILL.md
Normal file
175
.claude/skills/archived/bluestacks-burp-interception/SKILL.md
Normal file
|
|
@ -0,0 +1,175 @@
|
|||
---
|
||||
name: bluestacks-burp-interception
|
||||
description: |
|
||||
Intercept Android app HTTPS traffic using BlueStacks and Burp Suite on macOS.
|
||||
Use when: (1) Need to analyze Android app API calls, (2) App ignores HTTP proxy,
|
||||
(3) App uses SSL pinning that blocks interception, (4) Need to install Burp CA
|
||||
as system certificate. Covers ADB setup, proxy configuration, Zygisk SSL unpinning,
|
||||
and Magisk trustusercerts module for system CA installation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-24
|
||||
---
|
||||
|
||||
# BlueStacks + Burp Suite HTTPS Traffic Interception
|
||||
|
||||
## Problem
|
||||
You want to intercept HTTPS traffic from an Android app running in BlueStacks to analyze
|
||||
API calls, but the app either ignores the proxy or uses SSL certificate pinning.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Running BlueStacks on macOS with Burp Suite
|
||||
- App traffic not appearing in Burp Suite
|
||||
- App crashes or refuses to connect when proxy is set
|
||||
- Need to bypass SSL pinning for security testing/research
|
||||
|
||||
## Prerequisites
|
||||
- BlueStacks with Magisk (kitsune variant) and root enabled
|
||||
- Zygisk-SSL-Unpinning module installed
|
||||
- trustusercerts Magisk module installed
|
||||
- Android SDK installed (for ADB)
|
||||
- Burp Suite running on port 8080
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Connect ADB to BlueStacks
|
||||
|
||||
```bash
|
||||
# ADB location on macOS (Android SDK)
|
||||
ADB=~/Library/Android/sdk/platform-tools/adb
|
||||
|
||||
# Connect to BlueStacks
|
||||
$ADB connect localhost:5555
|
||||
|
||||
# Verify connection
|
||||
$ADB devices
|
||||
# Should show: emulator-5554 or localhost:5555
|
||||
```
|
||||
|
||||
Note: BlueStacks runs **arm64-v8a** (not x86 as you might expect).
|
||||
|
||||
### Step 2: Set HTTP Proxy
|
||||
|
||||
Use your Mac's WiFi IP address (not 10.0.2.2 or localhost):
|
||||
|
||||
```bash
|
||||
# Get Mac WiFi IP
|
||||
IP=$(ipconfig getifaddr en0)
|
||||
|
||||
# Set proxy (Burp default port 8080)
|
||||
$ADB shell settings put global http_proxy ${IP}:8080
|
||||
|
||||
# Verify
|
||||
$ADB shell settings get global http_proxy
|
||||
|
||||
# Disable proxy when done
|
||||
$ADB shell settings put global http_proxy :0
|
||||
```
|
||||
|
||||
### Step 3: Configure SSL Unpinning for Target App
|
||||
|
||||
```bash
|
||||
# Find app package name
|
||||
$ADB shell pm list packages | grep <keyword>
|
||||
|
||||
# Edit config
|
||||
$ADB shell "su -c 'cat > /data/local/tmp/zyg.ssl/config.json << EOF
|
||||
{
|
||||
\"targets\": [
|
||||
{
|
||||
\"pkg_name\" : \"com.example.app\",
|
||||
\"enable\": true,
|
||||
\"start_safe\": true,
|
||||
\"start_delay\": 1000
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF'"
|
||||
|
||||
# Restart the app
|
||||
$ADB shell am force-stop com.example.app
|
||||
$ADB shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1
|
||||
|
||||
# Verify SSL unpinning is active
|
||||
$ADB shell "logcat -d | grep -i ZygiskSSL | tail -10"
|
||||
# Should show: "App detected: com.example.app" and "[*] SSL UNPINNING [#]"
|
||||
```
|
||||
|
||||
### Step 4: Install Burp CA as System Certificate
|
||||
|
||||
```bash
|
||||
# Download Burp CA cert
|
||||
curl -x http://127.0.0.1:8080 http://burp/cert -o /tmp/burp-cert.der
|
||||
|
||||
# Convert to PEM
|
||||
openssl x509 -inform DER -in /tmp/burp-cert.der -out /tmp/burp-cert.pem
|
||||
|
||||
# Get hash for Android cert store naming
|
||||
HASH=$(openssl x509 -inform PEM -subject_hash_old -in /tmp/burp-cert.pem | head -1)
|
||||
cp /tmp/burp-cert.pem /tmp/${HASH}.0
|
||||
|
||||
# Push to device
|
||||
$ADB push /tmp/${HASH}.0 /sdcard/
|
||||
|
||||
# Install via trustusercerts Magisk module
|
||||
$ADB shell "su -c 'cp /sdcard/${HASH}.0 /data/adb/modules/trustusercerts/system/etc/security/cacerts/'"
|
||||
$ADB shell "su -c 'chmod 644 /data/adb/modules/trustusercerts/system/etc/security/cacerts/${HASH}.0'"
|
||||
|
||||
# Reboot required for Magisk overlay
|
||||
$ADB shell "su -c 'reboot'"
|
||||
|
||||
# After reboot, verify cert is in system store
|
||||
$ADB shell "su -c 'ls /system/etc/security/cacerts/${HASH}.0'"
|
||||
```
|
||||
|
||||
### Step 5: Test Interception
|
||||
|
||||
1. Re-enable proxy after reboot: `$ADB shell settings put global http_proxy ${IP}:8080`
|
||||
2. Launch target app
|
||||
3. Check Burp Suite → Proxy → HTTP history for requests
|
||||
|
||||
## Verification
|
||||
|
||||
- Proxy set: `adb shell settings get global http_proxy` returns `<ip>:8080`
|
||||
- SSL unpinning active: `logcat | grep ZygiskSSL` shows "SSL UNPINNING"
|
||||
- Burp CA installed: `ls /system/etc/security/cacerts/<hash>.0` exists
|
||||
- Traffic visible in Burp Suite HTTP history
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| No traffic in Burp | Proxy not set | Check `settings get global http_proxy` |
|
||||
| App shows SSL error | Cert not installed | Verify cert in system store, reboot |
|
||||
| SSL unpinning not working | Config not loaded | Force-stop app, check config.json syntax |
|
||||
| ADB connection refused | BlueStacks ADB disabled | Enable in BlueStacks Settings → Advanced |
|
||||
| Wrong cert hash | Using wrong openssl flag | Use `subject_hash_old` not `subject_hash` |
|
||||
|
||||
## Notes
|
||||
|
||||
- BlueStacks runs arm64-v8a, so Zygisk modules need arm64 support
|
||||
- The trustusercerts module copies certs at boot via Magisk overlay
|
||||
- System partition is read-only; use Magisk modules instead of direct mounting
|
||||
- Burp cert hash is typically `9a5ba575` but verify for your instance
|
||||
- Some apps may use additional protections (root detection, Frida detection)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Set proxy
|
||||
adb shell settings put global http_proxy <ip>:8080
|
||||
|
||||
# Disable proxy
|
||||
adb shell settings put global http_proxy :0
|
||||
|
||||
# Check SSL unpinning logs
|
||||
adb shell "logcat -d | grep -i ZygiskSSL"
|
||||
|
||||
# Force restart app
|
||||
adb shell am force-stop <package> && adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1
|
||||
```
|
||||
|
||||
## References
|
||||
- [Zygisk-SSL-Unpinning](https://github.com/m0szy/Zygisk-SSL-Unpinning)
|
||||
- [MagiskTrustUserCerts](https://github.com/NVISOsecurity/MagiskTrustUserCerts)
|
||||
- [Burp Suite Documentation](https://portswigger.net/burp/documentation)
|
||||
|
|
@ -0,0 +1,189 @@
|
|||
---
|
||||
name: clickhouse-k8s-nfs-system-log-bloat
|
||||
description: |
|
||||
Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
|
||||
NFS storage, caused by unbounded system log table growth triggering continuous background
|
||||
merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
|
||||
(2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
|
||||
(3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
|
||||
grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
|
||||
76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
|
||||
Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
|
||||
ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
|
||||
system log truncation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
|
||||
|
||||
## Problem
|
||||
|
||||
ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
|
||||
even when actual user queries are negligible. The CPU is consumed by background merge
|
||||
operations on system log tables that grow unboundedly with no default TTL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- ClickHouse pod using 500m-1000m+ CPU with no active user queries
|
||||
- `SELECT * FROM system.processes` shows only diagnostic queries
|
||||
- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
|
||||
- System log tables have grown to gigabytes:
|
||||
- `system.trace_log`: 5+ GiB, 200M+ rows
|
||||
- `system.text_log`: 3+ GiB, 90M+ rows
|
||||
- `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
|
||||
- `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
|
||||
- Actual user data (e.g., `clickhouse.events`) is only kilobytes
|
||||
- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
|
||||
- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
|
||||
|
||||
## Root Cause
|
||||
|
||||
Two compounding issues:
|
||||
|
||||
1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
|
||||
`text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
|
||||
retention policy and grow indefinitely.
|
||||
|
||||
2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
|
||||
merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
|
||||
slower than local disk, creating a feedback loop:
|
||||
- Slow merges → parts accumulate faster than they can be merged
|
||||
- More parts → more merge operations spawned
|
||||
- More merges → more CPU for decompression/recompression while waiting on NFS I/O
|
||||
|
||||
## Solution
|
||||
|
||||
### Immediate Fix: Truncate System Tables
|
||||
|
||||
```bash
|
||||
CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
|
||||
```
|
||||
|
||||
This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
|
||||
|
||||
### Permanent Fix: CronJob for Periodic Truncation
|
||||
|
||||
Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
|
||||
metadata {
|
||||
name = "clickhouse-truncate-logs"
|
||||
namespace = "<namespace>"
|
||||
}
|
||||
spec {
|
||||
schedule = "0 */6 * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "truncate"
|
||||
image = "curlimages/curl:8.12.1"
|
||||
command = ["sh", "-c", join(" && ", [
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
|
||||
"echo 'System logs truncated'"
|
||||
])]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What Does NOT Work: Config.d XML Mount
|
||||
|
||||
**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
|
||||
via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
|
||||
|
||||
- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
|
||||
the entire directory, deleting the built-in `docker_related_config.xml` that the
|
||||
entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
|
||||
|
||||
- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
|
||||
with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
|
||||
|
||||
- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
|
||||
crash with exit code 36.
|
||||
|
||||
This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
|
||||
and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
|
||||
|
||||
## Verification
|
||||
|
||||
After truncation, verify:
|
||||
|
||||
```bash
|
||||
# CPU should drop from ~900m to ~100m within minutes
|
||||
kubectl top pod -n <namespace> -l app=clickhouse
|
||||
|
||||
# No active merges
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT count() FROM system.merges"
|
||||
|
||||
# System tables should be small
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
|
||||
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
|
||||
FORMAT Pretty"
|
||||
```
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Check what's consuming CPU (merges vs queries)
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT * FROM system.merges FORMAT Pretty"
|
||||
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
|
||||
|
||||
# Check background pool config
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT name, value FROM system.server_settings \
|
||||
WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
|
||||
FORMAT Pretty"
|
||||
|
||||
# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
|
||||
of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
|
||||
|
||||
- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
|
||||
Kubernetes. Root cause unclear but reproducible across mount methods.
|
||||
|
||||
- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
|
||||
`background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
|
||||
workload. This overhead is unavoidable without config file changes.
|
||||
|
||||
- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
|
||||
persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
|
||||
local PV storage instead.
|
||||
|
||||
## See Also
|
||||
|
||||
- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
|
||||
145
.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
Normal file
145
.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
---
|
||||
name: coturn-k8s-without-hostnetwork
|
||||
description: |
|
||||
Deploy coturn (TURN/STUN server) on Kubernetes without hostNetwork by using a
|
||||
narrow relay port range and MetalLB LoadBalancer service. Use when: (1) deploying
|
||||
a WebRTC relay server on k8s, (2) want coturn to run on any node (not pinned),
|
||||
(3) avoiding hostNetwork for better pod scheduling and multi-replica support,
|
||||
(4) need TURN for NAT traversal in WebRTC apps (video streaming, conferencing).
|
||||
Covers relay port range sizing, MetalLB IP sharing, ephemeral TURN credentials
|
||||
via HMAC-SHA1, and pfSense port forwarding.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# coturn on Kubernetes Without hostNetwork
|
||||
|
||||
## Problem
|
||||
TURN servers traditionally require hostNetwork because they relay media over a wide
|
||||
UDP port range (49152-65535). This pins the server to a single node, prevents rolling
|
||||
updates, and wastes cluster flexibility.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying a TURN/STUN server for WebRTC applications on Kubernetes
|
||||
- Want the TURN pod to be schedulable on any node
|
||||
- Need to avoid hostNetwork for better availability and scheduling
|
||||
|
||||
## Solution
|
||||
|
||||
### Key insight: Narrow the relay port range
|
||||
A home lab with ~20 concurrent WebRTC viewers needs ~40 relay ports (2 per viewer).
|
||||
Use 100 ports (49152-49252) instead of 16K. This makes it practical to expose via
|
||||
a K8s LoadBalancer service.
|
||||
|
||||
### Terraform module structure
|
||||
|
||||
```hcl
|
||||
locals {
|
||||
turn_port = 3478
|
||||
min_port = 49152
|
||||
max_port = 49252 # 100 ports — enough for ~50 concurrent streams
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "coturn" {
|
||||
spec {
|
||||
# No hostNetwork, no nodeSelector — runs anywhere
|
||||
template {
|
||||
spec {
|
||||
container {
|
||||
image = "coturn/coturn:latest"
|
||||
args = ["-c", "/etc/turnserver/turnserver.conf"]
|
||||
port {
|
||||
container_port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "coturn" {
|
||||
metadata {
|
||||
annotations = {
|
||||
# Share an existing MetalLB IP to avoid consuming a new one
|
||||
"metallb.universe.tf/loadBalancerIPs" = "10.0.20.200"
|
||||
"metallb.universe.tf/allow-shared-ip" = "shared"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
type = "LoadBalancer"
|
||||
# Signaling port
|
||||
port {
|
||||
name = "turn-udp"
|
||||
port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
# Relay ports — dynamic block generates 100 port definitions
|
||||
dynamic "port" {
|
||||
for_each = range(49152, 49253)
|
||||
content {
|
||||
name = "relay-${port.value}"
|
||||
port = port.value
|
||||
target_port = port.value
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### coturn config (turnserver.conf)
|
||||
|
||||
```
|
||||
listening-port=3478
|
||||
fingerprint
|
||||
lt-cred-mech
|
||||
use-auth-secret
|
||||
static-auth-secret=YOUR_SECRET_HERE
|
||||
realm=yourdomain.com
|
||||
listening-ip=0.0.0.0
|
||||
min-port=49152
|
||||
max-port=49252
|
||||
no-multicast-peers
|
||||
no-cli
|
||||
```
|
||||
|
||||
### MetalLB IP sharing
|
||||
To reuse an existing MetalLB IP (e.g., the WireGuard/Shadowsocks shared IP):
|
||||
1. Add `metallb.universe.tf/allow-shared-ip: shared` to the coturn service
|
||||
2. The same annotation must exist on all other services sharing that IP
|
||||
3. **Port conflicts are not allowed** — verify no other service uses 3478 or 49152-49252
|
||||
4. After changing the IP annotation, **delete and recreate** the service — MetalLB won't reassign IPs on annotation changes alone
|
||||
|
||||
### Ephemeral TURN credentials
|
||||
coturn's `use-auth-secret` mode generates time-limited credentials via HMAC-SHA1:
|
||||
|
||||
```javascript
|
||||
const crypto = require('crypto');
|
||||
const TURN_SECRET = 'your-shared-secret';
|
||||
|
||||
function getTurnCredentials(name = 'user', ttl = 86400) {
|
||||
const timestamp = Math.floor(Date.now() / 1000) + ttl;
|
||||
const username = `${timestamp}:${name}`;
|
||||
const credential = crypto.createHmac('sha1', TURN_SECRET)
|
||||
.update(username).digest('base64');
|
||||
return { username, credential };
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# STUN binding request (raw UDP probe)
|
||||
echo -ne '\x00\x01\x00\x00\x21\x12\xa4\x42\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
|
||||
| nc -u -w2 <METALLB_IP> 3478 | xxd | head -3
|
||||
# Response starting with 0101 = successful STUN binding response
|
||||
```
|
||||
|
||||
## Notes
|
||||
- 100 relay ports supports ~50 concurrent streams (2 ports per stream)
|
||||
- If you need more, increase `max_port` and add more ports to the service
|
||||
- coturn auto-detects pod IP — no need to set `relay-ip` or `external-ip` explicitly
|
||||
- For public access, add NAT port forwards on pfSense for UDP 3478 + 49152-49252
|
||||
- See also: `pfsense-nat-rule-creation` skill for adding the port forwards
|
||||
|
|
@ -0,0 +1,99 @@
|
|||
---
|
||||
name: crowdsec-agent-registration-failure
|
||||
description: |
|
||||
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
|
||||
machine registrations. Use when: (1) CrowdSec agent init container fails with
|
||||
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
|
||||
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
|
||||
running with old credentials, (4) cscli machines list shows stale entries for
|
||||
current agent pod names. Covers deleting stale registrations to allow re-registration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# CrowdSec Agent Registration Failure
|
||||
|
||||
## Problem
|
||||
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
|
||||
credentials but LAPI retains the old machine registrations. When agents try to
|
||||
re-register with the same pod name, the `wait-for-lapi-and-register` init container
|
||||
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
|
||||
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
|
||||
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
|
||||
- LAPI pods were recently restarted or redeployed
|
||||
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify stuck agents
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
|
||||
```
|
||||
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
|
||||
|
||||
### Step 2: Confirm the init container error
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
|
||||
```
|
||||
Should show `user already exist` error.
|
||||
|
||||
### Step 3: Find a running LAPI pod
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
|
||||
```
|
||||
|
||||
### Step 4: Delete stale machine registrations from LAPI
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
|
||||
```
|
||||
Repeat for each stuck agent.
|
||||
|
||||
### Step 5: Wait for agents to recover
|
||||
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
|
||||
automatically retry registration and succeed after the stale entry is deleted. This can
|
||||
take up to 5 minutes per agent depending on where they are in the backoff cycle.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# All agents should show Running status
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
|
||||
# DaemonSet should show all pods READY
|
||||
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
|
||||
```
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Identify stuck agents
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
|
||||
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
|
||||
|
||||
# Delete stale registrations
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
|
||||
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
|
||||
|
||||
# Wait ~5 minutes, then verify
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 1/1 Running 1 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 1/1 Running 1 3d
|
||||
crowdsec-agent-pfw2l 1/1 Running 1 3d
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a known limitation of the CrowdSec Helm chart — the init container registration
|
||||
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
|
||||
- The `cscli machines list` output will show many historical stale entries from past
|
||||
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
|
||||
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
|
||||
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
|
||||
the blocklist import.
|
||||
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.
|
||||
310
.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
Normal file
310
.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
Normal file
|
|
@ -0,0 +1,310 @@
|
|||
---
|
||||
name: fastapi-svelte-gpu-webui
|
||||
description: |
|
||||
Pattern for building web UIs for GPU-based CLI tools. Use when:
|
||||
(1) Wrapping a command-line tool with a web interface, (2) Building job queue
|
||||
systems for long-running GPU tasks, (3) Creating file upload/download workflows,
|
||||
(4) Need real-time progress updates via WebSocket, (5) Deploying to Kubernetes
|
||||
with GPU scheduling. Covers FastAPI backend, Svelte 5 frontend, NFS storage,
|
||||
and Terraform deployment.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# FastAPI + Svelte GPU WebUI Pattern
|
||||
|
||||
## Problem
|
||||
Many powerful tools are command-line only, making them inaccessible to non-technical
|
||||
users. Building a web UI requires handling file uploads, job queuing, progress tracking,
|
||||
and GPU resource scheduling.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- You have a CLI tool that does heavy processing (ML inference, media conversion, etc.)
|
||||
- Want to add a web interface for easier access
|
||||
- Need to track long-running job progress
|
||||
- Deploying to Kubernetes with GPU nodes
|
||||
- Files need to persist across pod restarts (NFS storage)
|
||||
|
||||
## Solution Overview
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
project-web/
|
||||
├── backend/
|
||||
│ ├── main.py # FastAPI app
|
||||
│ ├── api/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── routes.py # REST endpoints
|
||||
│ ├── services/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── converter.py # CLI wrapper + job manager
|
||||
│ ├── models/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── schemas.py # Pydantic models
|
||||
│ └── requirements.txt
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── App.svelte
|
||||
│ │ ├── lib/
|
||||
│ │ │ ├── FileUpload.svelte
|
||||
│ │ │ ├── JobsList.svelte
|
||||
│ │ │ └── ProgressBar.svelte
|
||||
│ │ └── stores/
|
||||
│ │ └── jobs.js
|
||||
│ ├── package.json
|
||||
│ └── vite.config.js
|
||||
├── Dockerfile
|
||||
└── README.md
|
||||
```
|
||||
|
||||
### Backend: Job Manager Pattern
|
||||
```python
|
||||
# services/converter.py
|
||||
import asyncio
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional, Callable
|
||||
import subprocess
|
||||
|
||||
class Job:
|
||||
id: str
|
||||
filename: str
|
||||
status: str # pending, processing, completed, failed
|
||||
progress: float
|
||||
created_at: datetime
|
||||
output_file: Optional[str]
|
||||
error: Optional[str]
|
||||
|
||||
class JobManager:
|
||||
def __init__(self, storage_path: str = "/mnt"):
|
||||
self.storage_path = Path(storage_path)
|
||||
self.jobs: dict[str, Job] = {}
|
||||
self.progress_callbacks: dict[str, list[Callable]] = {}
|
||||
|
||||
def create_job(self, filename: str, **options) -> Job:
|
||||
job_id = str(uuid.uuid4())
|
||||
job = Job(
|
||||
id=job_id,
|
||||
filename=filename,
|
||||
status="pending",
|
||||
progress=0.0,
|
||||
created_at=datetime.now(),
|
||||
**options
|
||||
)
|
||||
self.jobs[job_id] = job
|
||||
return job
|
||||
|
||||
async def run_conversion(self, job_id: str):
|
||||
job = self.jobs[job_id]
|
||||
job.status = "processing"
|
||||
|
||||
input_path = self.storage_path / "uploads" / job.filename
|
||||
output_dir = self.storage_path / "outputs" / job_id
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build command for CLI tool
|
||||
cmd = [
|
||||
"/path/to/cli-tool",
|
||||
str(input_path),
|
||||
"-o", str(output_dir),
|
||||
# Add other options...
|
||||
]
|
||||
|
||||
# Run with output capture for progress parsing
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
)
|
||||
|
||||
# Parse output for progress updates
|
||||
async def read_output(stream):
|
||||
while True:
|
||||
line = await stream.readline()
|
||||
if not line:
|
||||
break
|
||||
line_str = line.decode().strip()
|
||||
# Parse progress from CLI output
|
||||
if "%" in line_str:
|
||||
# Extract and update progress
|
||||
self.update_progress(job_id, parsed_progress)
|
||||
|
||||
await asyncio.gather(
|
||||
read_output(process.stdout),
|
||||
read_output(process.stderr)
|
||||
)
|
||||
|
||||
returncode = await process.wait()
|
||||
|
||||
if returncode == 0:
|
||||
output_files = list(output_dir.glob("*.m4b"))
|
||||
if output_files:
|
||||
job.output_file = output_files[0].name
|
||||
job.status = "completed"
|
||||
else:
|
||||
job.status = "failed"
|
||||
job.error = f"Exit code {returncode}"
|
||||
|
||||
job_manager = JobManager()
|
||||
```
|
||||
|
||||
### Backend: API Routes
|
||||
```python
|
||||
# api/routes.py
|
||||
from fastapi import APIRouter, UploadFile, File, HTTPException
|
||||
from fastapi.responses import FileResponse
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
import asyncio
|
||||
|
||||
router = APIRouter(prefix="/api")
|
||||
|
||||
@router.post("/upload")
|
||||
async def upload_file(file: UploadFile = File(...)):
|
||||
upload_dir = Path("/mnt/uploads")
|
||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
||||
file_path = upload_dir / file.filename
|
||||
|
||||
with file_path.open("wb") as buffer:
|
||||
shutil.copyfileobj(file.file, buffer)
|
||||
|
||||
return {"filename": file.filename, "size": file_path.stat().st_size}
|
||||
|
||||
@router.post("/jobs")
|
||||
async def create_job(request: JobCreate):
|
||||
job = job_manager.create_job(filename=request.filename, ...)
|
||||
asyncio.create_task(job_manager.run_conversion(job.id))
|
||||
return job
|
||||
|
||||
@router.get("/jobs")
|
||||
async def list_jobs():
|
||||
return job_manager.get_all_jobs()
|
||||
|
||||
@router.get("/jobs/{job_id}/download")
|
||||
async def download_job(job_id: str):
|
||||
job = job_manager.get_job(job_id)
|
||||
if not job or job.status != "completed":
|
||||
raise HTTPException(404)
|
||||
output_path = Path("/mnt/outputs") / job_id / job.output_file
|
||||
return FileResponse(output_path, filename=job.output_file)
|
||||
```
|
||||
|
||||
### Frontend: Svelte 5 Components
|
||||
```svelte
|
||||
<!-- FileUpload.svelte -->
|
||||
<script>
|
||||
let { onUpload } = $props();
|
||||
let dragOver = $state(false);
|
||||
let uploading = $state(false);
|
||||
|
||||
async function handleUpload(file) {
|
||||
uploading = true;
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
|
||||
const response = await fetch('/api/upload', {
|
||||
method: 'POST',
|
||||
body: formData
|
||||
});
|
||||
|
||||
if (response.ok) {
|
||||
const data = await response.json();
|
||||
onUpload(data.filename);
|
||||
}
|
||||
uploading = false;
|
||||
}
|
||||
</script>
|
||||
|
||||
<div class="dropzone"
|
||||
class:dragover={dragOver}
|
||||
ondragover={(e) => { e.preventDefault(); dragOver = true; }}
|
||||
ondragleave={() => dragOver = false}
|
||||
ondrop={(e) => { e.preventDefault(); handleUpload(e.dataTransfer.files[0]); }}>
|
||||
Drop file here
|
||||
</div>
|
||||
```
|
||||
|
||||
### Dockerfile
|
||||
```dockerfile
|
||||
FROM python:3.12-slim
|
||||
|
||||
# Install Node for frontend build
|
||||
RUN apt-get update && apt-get install -y nodejs npm
|
||||
|
||||
# Build frontend
|
||||
COPY frontend/ /app/frontend/
|
||||
WORKDIR /app/frontend
|
||||
RUN npm install && npm run build
|
||||
|
||||
# Install backend
|
||||
COPY backend/ /app/backend/
|
||||
WORKDIR /app/backend
|
||||
RUN pip install -r requirements.txt
|
||||
|
||||
# Serve static files from FastAPI
|
||||
EXPOSE 8000
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
### Terraform Deployment (GPU)
|
||||
```hcl
|
||||
resource "kubernetes_deployment" "myapp" {
|
||||
spec {
|
||||
template {
|
||||
spec {
|
||||
node_selector = { "gpu" : "true" }
|
||||
|
||||
toleration {
|
||||
key = "nvidia.com/gpu"
|
||||
operator = "Equal"
|
||||
value = "true"
|
||||
effect = "NoSchedule"
|
||||
}
|
||||
|
||||
container {
|
||||
image = "myregistry/myapp@sha256:..."
|
||||
name = "myapp"
|
||||
|
||||
resources {
|
||||
limits = { "nvidia.com/gpu" = "1" }
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/mnt"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/myapp"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Upload a file via the UI
|
||||
2. Start a conversion job
|
||||
3. Watch progress update in real-time
|
||||
4. Download the completed file
|
||||
5. Verify files persist across pod restarts
|
||||
|
||||
## Notes
|
||||
- Use image digest for reliable deployments (see `k8s-docker-registry-cache-bypass` skill)
|
||||
- NFS storage persists across pod restarts
|
||||
- GPU node taints require matching tolerations
|
||||
- Consider adding job persistence (database) for production use
|
||||
- WebSocket can provide smoother progress updates than polling
|
||||
|
||||
## See Also
|
||||
- `k8s-docker-registry-cache-bypass` - Fixing image cache issues
|
||||
- `k8s-gpu-no-nvidia-devices` - GPU device troubleshooting
|
||||
- `python-filename-sanitization` - Secure file handling
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
name: grafana-stale-datasource-cleanup
|
||||
description: |
|
||||
Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
|
||||
with provisioned ones, or when stale datasources persist in the MySQL database.
|
||||
Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
|
||||
(2) Grafana API returns "datasources:delete permissions needed" when trying to remove
|
||||
a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
|
||||
the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
|
||||
service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
|
||||
blocks API operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Grafana Stale Datasource Cleanup
|
||||
|
||||
## Problem
|
||||
Grafana uses a stale or incorrect datasource from its MySQL database instead of
|
||||
the correctly provisioned one. Common when Helm charts auto-create datasources
|
||||
that point to services you've disabled (e.g., Loki gateway).
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
|
||||
- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
|
||||
different one stored in MySQL
|
||||
- Grafana API returns `"permissions needed: datasources:delete"` or
|
||||
`"permissions needed: datasources:write"` even with admin credentials
|
||||
- Dashboard references a datasource UID that points to a wrong URL
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stale datasource
|
||||
|
||||
List all datasources via API (this usually works even with RBAC):
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
|
||||
"import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
|
||||
```
|
||||
|
||||
### Step 2: Try API deletion first
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
|
||||
```
|
||||
|
||||
If this returns a permissions error, proceed to Step 3.
|
||||
|
||||
### Step 3: Delete directly from MySQL
|
||||
|
||||
When Grafana RBAC blocks API operations, go through MySQL:
|
||||
|
||||
```bash
|
||||
# Find the Grafana MySQL password
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'echo $GF_DATABASE_PASSWORD'
|
||||
|
||||
# Find the stale datasource
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "SELECT id, uid, name, url FROM data_source;"
|
||||
|
||||
# Delete it
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
|
||||
```
|
||||
|
||||
### Step 4: Fix dashboards referencing the old UID
|
||||
|
||||
Dashboards store datasource UIDs in their JSON. Update via MySQL:
|
||||
```bash
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
|
||||
```
|
||||
|
||||
### Step 5: Refresh Grafana
|
||||
|
||||
Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
|
||||
```bash
|
||||
kubectl rollout restart deploy -n monitoring grafana
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Verify only correct datasources remain
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
|
||||
and provisions datasources from them. These are file-provisioned and show as
|
||||
"provisioned" in the UI.
|
||||
- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
|
||||
database pointing to services like `loki-gateway`. If you disable the gateway,
|
||||
this datasource becomes stale.
|
||||
- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
|
||||
so dashboard JSON files in the repo are reference copies only.
|
||||
- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
|
||||
- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
|
||||
253
.claude/skills/archived/helm-release-troubleshooting/SKILL.md
Normal file
253
.claude/skills/archived/helm-release-troubleshooting/SKILL.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
---
|
||||
name: helm-release-troubleshooting
|
||||
description: |
|
||||
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
|
||||
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
|
||||
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
|
||||
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
|
||||
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
|
||||
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
|
||||
(6) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(8) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers force re-rendering via state removal/reimport and stuck release recovery via
|
||||
secret cleanup.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Helm Release Troubleshooting
|
||||
|
||||
## Force Re-render
|
||||
|
||||
### Problem
|
||||
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
|
||||
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
|
||||
the new values. For example, adding a new port in Helm values doesn't result in that port
|
||||
appearing in the Service spec.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
|
||||
the old configuration
|
||||
- Structural changes to Helm values (new ports, new containers, new volumes) are not
|
||||
reflected in deployed resources
|
||||
- The Helm chart templates need to be fully re-rendered, not just patched
|
||||
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
|
||||
includes resources based on values
|
||||
|
||||
### Root Cause
|
||||
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
|
||||
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
|
||||
ones rather than doing a full template re-render. For structural changes (like enabling
|
||||
HTTP/3 which adds a new UDP port to the Service template), the templates may not be
|
||||
re-rendered with the new conditional branches active.
|
||||
|
||||
Additionally, Terraform may see the stored Helm release state as matching the desired state
|
||||
even though the actual Kubernetes resources don't reflect it, creating a state drift that
|
||||
Terraform doesn't detect.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Verify the Discrepancy
|
||||
|
||||
Confirm that K8s resources don't match Helm values:
|
||||
```bash
|
||||
# Check the actual resource
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
|
||||
# Check what Helm thinks is deployed
|
||||
helm get values <release-name> -n <namespace>
|
||||
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
|
||||
```
|
||||
|
||||
#### Step 2: Remove Helm Release from Terraform State
|
||||
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
```
|
||||
|
||||
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
|
||||
resources remain untouched in the cluster.
|
||||
|
||||
#### Step 3: Import the Helm Release Back
|
||||
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
```
|
||||
|
||||
For Helm releases, the import ID format is `namespace/release-name`.
|
||||
|
||||
#### Step 4: Force Apply with Terraform
|
||||
|
||||
After reimporting, run terraform apply. Terraform should now detect the drift between
|
||||
the desired Helm values and the actual release state:
|
||||
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
If Terraform still shows "no changes", you may need to taint the resource:
|
||||
```bash
|
||||
terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
#### Step 5: Manual Helm Force Upgrade (Last Resort)
|
||||
|
||||
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
|
||||
|
||||
```bash
|
||||
# Get the current values file
|
||||
helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
|
||||
|
||||
# Edit /tmp/values.yaml to include the correct values, or use --set flags
|
||||
|
||||
# Force upgrade (re-renders all templates)
|
||||
helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
|
||||
|
||||
# Then reimport into Terraform
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
|
||||
afterward, and use `terraform apply` to verify Terraform is back in sync.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check the K8s resources now match expected configuration
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
kubectl get deployment <deployment-name> -n <namespace> -o yaml
|
||||
|
||||
# Verify Terraform is in sync
|
||||
terraform plan -target=module.kubernetes_cluster.module.<service>
|
||||
# Should show "No changes" or minimal expected drift
|
||||
```
|
||||
|
||||
### Example: Traefik HTTP/3 UDP Port Not Appearing
|
||||
|
||||
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
|
||||
successfully, but the Traefik Service only had TCP port 443, missing the expected
|
||||
UDP port 443 (`websecure-http3`).
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# 1. Remove from state
|
||||
terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
|
||||
|
||||
# 2. Reimport
|
||||
terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
|
||||
|
||||
# 3. Apply (Terraform now detects the drift)
|
||||
terraform apply -target=module.kubernetes_cluster.module.traefik
|
||||
|
||||
# 4. Verify
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
|
||||
# Should show: port: 443, protocol: UDP
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- This issue is more common with structural Helm value changes (new ports, new sidecars,
|
||||
conditional template blocks) than with simple value changes (image tags, replica counts)
|
||||
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
|
||||
which causes brief downtime. Use with caution on production ingress controllers.
|
||||
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
|
||||
|
||||
---
|
||||
|
||||
## Stuck Release Recovery
|
||||
|
||||
### Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
#### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
#### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
#### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
### Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
### Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
|
||||
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
|
||||
|
||||
## References
|
||||
|
||||
- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
|
||||
- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
|
||||
- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)
|
||||
157
.claude/skills/archived/ingress-factory-migration/SKILL.md
Normal file
157
.claude/skills/archived/ingress-factory-migration/SKILL.md
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
---
|
||||
name: ingress-factory-migration
|
||||
description: |
|
||||
Migrate raw kubernetes_ingress_v1 resources to the centralized ingress_factory module.
|
||||
Use when: (1) a service defines a raw kubernetes_ingress_v1 with hand-rolled Traefik
|
||||
middleware annotations, (2) adding a new service that needs standard ingress with
|
||||
rate limiting, CrowdSec, CSP headers, rybbit analytics, or authentik auth,
|
||||
(3) refactoring existing ingresses for consistency. Covers single-path, multi-path,
|
||||
split UI/API, full_host overrides, custom rate limits, and extra middleware injection.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-10
|
||||
---
|
||||
|
||||
# Ingress Factory Migration
|
||||
|
||||
## Problem
|
||||
Services define raw `kubernetes_ingress_v1` resources with hand-rolled Traefik middleware
|
||||
chains. This creates inconsistency - middleware chains are copy-pasted per service, making
|
||||
it easy to miss security middleware (CrowdSec, rate limiting) or analytics (rybbit). The
|
||||
`ingress_factory` module at `modules/kubernetes/ingress_factory/main.tf` provides a single
|
||||
point of control.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Service has a raw `kubernetes_ingress_v1` resource instead of using `module "ingress"`
|
||||
- Service has a manually defined `kubernetes_manifest` for rybbit analytics middleware
|
||||
- New service needs standard ingress configuration
|
||||
- Middleware chain needs to be updated across many services
|
||||
|
||||
## Solution
|
||||
|
||||
### Standard single-path ingress
|
||||
Replace the raw resource with:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service-name>" # becomes the ingress name AND default hostname
|
||||
host = "<subdomain>" # optional: override hostname (if different from name)
|
||||
service_name = "<k8s-service-name>" # optional: defaults to name
|
||||
port = 80 # optional: defaults to 80
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false # set true for authentik forward auth
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-path / split UI+API
|
||||
Use two module calls with different names but same host:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
host = "<subdomain>"
|
||||
service_name = "<ui-service>"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
rybbit_site_id = "<id>" # optional: adds rybbit analytics
|
||||
}
|
||||
|
||||
module "ingress-api" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>-api"
|
||||
host = "<subdomain>" # same host as UI
|
||||
service_name = "<api-service>"
|
||||
ingress_path = ["/api"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# No rybbit_site_id - API returns JSON, not HTML
|
||||
}
|
||||
```
|
||||
|
||||
### Full host override (for root domain like viktorbarzin.me)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
service_name = "<k8s-service>"
|
||||
full_host = "viktorbarzin.me" # bypasses name.root_domain construction
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Custom rate limiting (e.g., immich)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = ["traefik-<custom>-rate-limit@kubernetescrd"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Key variables reference
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `name` | required | Ingress resource name + default hostname |
|
||||
| `host` | null | Override hostname prefix (name used if null) |
|
||||
| `full_host` | null | Override entire hostname (bypasses root_domain) |
|
||||
| `service_name` | null | K8s service name (name used if null) |
|
||||
| `port` | 80 | Backend service port |
|
||||
| `ingress_path` | ["/"] | URL paths to match |
|
||||
| `protected` | false | Adds authentik forward auth middleware |
|
||||
| `rybbit_site_id` | null | Adds rybbit analytics script injection |
|
||||
| `skip_default_rate_limit` | false | Omits default rate limiter |
|
||||
| `extra_middlewares` | [] | Additional middleware references to append |
|
||||
| `extra_annotations` | {} | Additional ingress annotations |
|
||||
| `allow_local_access_only` | false | Restricts to LAN/VPN |
|
||||
| `exclude_crowdsec` | false | Skips CrowdSec middleware |
|
||||
| `custom_content_security_policy` | null | Custom CSP header |
|
||||
|
||||
### After migration, delete:
|
||||
1. The raw `kubernetes_ingress_v1` resource
|
||||
2. Any manually defined `kubernetes_manifest "rybbit_analytics"` (the factory creates this automatically when `rybbit_site_id` is set)
|
||||
|
||||
## Gotchas
|
||||
|
||||
### Duplicate module names
|
||||
If the service directory has multiple `.tf` files (e.g., `main.tf` and `frame.tf`), check
|
||||
for existing `module "ingress"` blocks. Module names must be unique within a directory.
|
||||
Use a descriptive name like `module "ingress-immich"` instead.
|
||||
|
||||
### Terraform target module names with hyphens
|
||||
Module names in `terraform state list` may use hyphens (e.g., `module.real-estate-crawler`).
|
||||
When using `-target`, you must match the exact name including hyphens:
|
||||
```bash
|
||||
# Wrong - underscores:
|
||||
terraform apply -target=module.kubernetes_cluster.module.real_estate_crawler
|
||||
|
||||
# Correct - hyphens (quote to prevent shell interpretation):
|
||||
terraform apply '-target=module.kubernetes_cluster.module.real-estate-crawler'
|
||||
```
|
||||
|
||||
### Service name defaults
|
||||
The factory defaults `service_name` to `name`. If the K8s service has a different name
|
||||
than the ingress, you must explicitly set `service_name`. Common case: headscale has one
|
||||
K8s service named `headscale` with multiple ports, so the UI ingress needs
|
||||
`service_name = "headscale"` even though `name = "headscale-ui"`.
|
||||
|
||||
### Servarr subdirectory source path
|
||||
Services under `servarr/` need `../../ingress_factory` as the source path instead of
|
||||
`../ingress_factory`.
|
||||
|
||||
## Verification
|
||||
1. `terraform validate` - check for syntax errors
|
||||
2. `terraform plan -target=module.kubernetes_cluster.module.<service>` - verify old ingress destroyed, new created
|
||||
3. `kubectl get ingress -n <namespace>` - verify ingress exists with correct host/paths
|
||||
4. Browse the service URL to confirm accessibility
|
||||
|
||||
## Notes
|
||||
- Services using special protocols (gRPC, mTLS, WebSocket with custom headers) should NOT
|
||||
be migrated - keep raw `kubernetes_ingress_v1` for those
|
||||
- The factory automatically includes: rate-limit, CSP headers, CrowdSec, and entrypoint=websecure
|
||||
- When `rybbit_site_id` is set, the factory creates a `kubernetes_manifest` for the
|
||||
rewrite-body middleware that injects the analytics script into HTML responses
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
---
|
||||
name: iterative-plan-review-with-subagents
|
||||
description: |
|
||||
Design pattern for reviewing implementation plans using parallel subagent reviewers
|
||||
with iterative refinement. Use when: (1) designing a complex infrastructure change
|
||||
that needs security + implementation review, (2) creating a migration plan with
|
||||
multiple phases, (3) any plan where missing a critical issue could cause data loss
|
||||
or security exposure. Spawns 2 reviewer agents (security + implementation), collects
|
||||
CRITICAL/IMPORTANT/NIT findings, fixes all CRITICALs, re-runs until zero CRITICALs.
|
||||
Typically converges in 2-3 iterations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# Iterative Plan Review with Subagents
|
||||
|
||||
## Problem
|
||||
Complex infrastructure plans have blind spots — security issues, implementation
|
||||
incompatibilities, race conditions, format mismatches. A single reviewer misses things.
|
||||
Multiple reviewers with different expertise catch more.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Writing a migration plan (e.g., secrets management, storage migration)
|
||||
- Designing a multi-phase infrastructure change
|
||||
- Any plan where a missed issue = downtime, data loss, or security exposure
|
||||
- User explicitly asks for plan review
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Write the plan as a markdown document
|
||||
Save to `docs/plans/YYYY-MM-DD-<topic>.md`
|
||||
|
||||
### 2. Spawn 2 reviewer agents in parallel
|
||||
```
|
||||
Agent 1: Security reviewer
|
||||
- Focus: secret exposure, access control, key management, CI pipeline security
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
|
||||
Agent 2: Implementation reviewer
|
||||
- Focus: format compatibility, race conditions, ordering, tool behavior
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
```
|
||||
|
||||
Key: give each reviewer specific focus areas and the actual source code to check against.
|
||||
|
||||
### 3. Consolidate and fix CRITICALs
|
||||
- Merge findings from both reviewers
|
||||
- Deduplicate (both often find the same issue)
|
||||
- Fix ALL CRITICALs in the plan document
|
||||
- Note IMPORTANTs for implementation phase
|
||||
|
||||
### 4. Re-run reviewers on the updated plan
|
||||
- Same 2 agents, but tell them which CRITICALs were fixed
|
||||
- Ask them to VERIFY fixes are correct AND find new issues
|
||||
- Repeat until zero CRITICALs
|
||||
|
||||
### 5. Typical convergence
|
||||
- v1: 5-6 CRITICALs (format issues, race conditions, missing steps)
|
||||
- v2: 2-3 CRITICALs (fixes introduced new issues, missed edge cases)
|
||||
- v3: 0 CRITICALs, only IMPORTANTs remaining
|
||||
|
||||
## Example Findings from Real Usage (SOPS migration)
|
||||
|
||||
| Iteration | CRITICALs Found | Examples |
|
||||
|-----------|----------------|---------|
|
||||
| v1 | 6 | YAML≠HCL format, `git add .` commits secrets, no branch protection, parallel race condition |
|
||||
| v2 | 3 | `SOPS_AGE_KEY_FILE` misunderstanding, `renew-tls.yml` not updated, plan leaks in PR logs |
|
||||
| v3 | 0 | All verified fixed. 6 IMPORTANTs noted for implementation. |
|
||||
|
||||
## Verification
|
||||
- Zero CRITICALs from both reviewers on the final iteration
|
||||
- IMPORTANTs documented as implementation notes (not blockers)
|
||||
|
||||
## Notes
|
||||
- Use `sonnet` model for reviewers (fast, thorough enough for review)
|
||||
- Give reviewers actual source code paths to read, not just the plan
|
||||
- Tell v2+ reviewers what was fixed so they verify, not re-discover
|
||||
- The final review should say "ONLY report CRITICALs" to avoid noise
|
||||
- This pattern cost ~$3-5 in API calls but caught issues that would have caused hours of debugging
|
||||
244
.claude/skills/archived/k8s-container-image-caching/SKILL.md
Normal file
244
.claude/skills/archived/k8s-container-image-caching/SKILL.md
Normal file
|
|
@ -0,0 +1,244 @@
|
|||
---
|
||||
name: k8s-container-image-caching
|
||||
description: |
|
||||
Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
|
||||
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
|
||||
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
|
||||
(3) need to add pull-through cache for a new upstream registry,
|
||||
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
|
||||
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
|
||||
(6) kubectl shows correct image tag but container runs old code,
|
||||
(7) local registry mirror caches stale images,
|
||||
(8) imagePullPolicy: Always doesn't force fresh pulls,
|
||||
(9) containerd config has mirror that intercepts pulls serving stale images.
|
||||
Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
|
||||
via image digest pinning.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Kubernetes Container Image Caching
|
||||
|
||||
## Pull-Through Cache Setup
|
||||
|
||||
### Problem
|
||||
|
||||
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
|
||||
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
|
||||
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
|
||||
and other registries -- they get routed to the Docker Hub proxy which can't serve them,
|
||||
causing `ImagePullBackOff`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
|
||||
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
|
||||
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
|
||||
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Run one Registry v2 container per upstream
|
||||
|
||||
Each upstream needs its own Docker Registry v2 instance on a different port:
|
||||
|
||||
| Port | Registry | Container Name |
|
||||
|------|----------|---------------|
|
||||
| 5000 | docker.io | registry |
|
||||
| 5010 | ghcr.io | registry-ghcr |
|
||||
| 5020 | quay.io | registry-quay |
|
||||
| 5030 | registry.k8s.io | registry-k8s |
|
||||
| 5040 | reg.kyverno.io | registry-kyverno |
|
||||
|
||||
Config for non-Docker-Hub proxies (no auth needed -- they're public):
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
http:
|
||||
addr: :5000
|
||||
proxy:
|
||||
remoteurl: https://ghcr.io # change per registry
|
||||
```
|
||||
|
||||
```bash
|
||||
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
|
||||
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
|
||||
```
|
||||
|
||||
#### 2. Replace deprecated wildcard mirror with `config_path`
|
||||
|
||||
Instead of:
|
||||
```toml
|
||||
# DEPRECATED - breaks non-Docker-Hub registries
|
||||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
|
||||
endpoint = ["http://10.0.20.10:5000"]
|
||||
```
|
||||
|
||||
Use the modern `config_path` approach:
|
||||
```toml
|
||||
[plugins."io.containerd.grpc.v1.cri".registry]
|
||||
config_path = "/etc/containerd/certs.d"
|
||||
```
|
||||
|
||||
Then create per-registry `hosts.toml` files:
|
||||
```bash
|
||||
mkdir -p /etc/containerd/certs.d/docker.io
|
||||
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
|
||||
server = "https://registry-1.docker.io"
|
||||
|
||||
[host."http://10.0.20.10:5000"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
EOF
|
||||
```
|
||||
|
||||
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
|
||||
|
||||
#### 3. Critical: `config_path` and `mirrors` cannot coexist
|
||||
|
||||
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
|
||||
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
|
||||
(including the `[plugins."...registry.mirrors"]` parent section) before setting
|
||||
`config_path`.
|
||||
|
||||
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
|
||||
where the config format is slightly different. If unsure, either:
|
||||
- Don't use config_path on that node (skip the pull-through cache)
|
||||
- Remove the entire `mirrors` section first, then add `config_path`
|
||||
|
||||
#### 4. Static IP for registry VM
|
||||
|
||||
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
|
||||
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test each proxy responds
|
||||
for port in 5000 5010 5020 5030 5040; do
|
||||
curl -s http://10.0.20.10:$port/v2/_catalog
|
||||
done
|
||||
|
||||
# Test containerd can pull through cache
|
||||
crictl pull ghcr.io/some/image:tag
|
||||
|
||||
# Check containerd logs for mirror usage
|
||||
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
|
||||
direct pull from the upstream `server` URL. This provides graceful degradation.
|
||||
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
|
||||
to avoid I/O spikes.
|
||||
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
|
||||
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
|
||||
|
||||
---
|
||||
|
||||
## Cache Bypass / Stale Image Fix
|
||||
|
||||
### Problem
|
||||
Kubernetes pods continue running old Docker images even after pushing new versions with
|
||||
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
|
||||
and serves stale versions, ignoring `imagePullPolicy: Always`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Pod is running but application code is outdated
|
||||
- `docker push` succeeded with new layers
|
||||
- `kubectl describe pod` shows correct image tag
|
||||
- Cluster has a local registry mirror configured (e.g., in containerd config)
|
||||
- `imagePullPolicy: Always` doesn't fix the issue
|
||||
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Get the image digest after pushing
|
||||
```bash
|
||||
docker push viktorbarzin/myimage:latest
|
||||
# Output includes: latest: digest: sha256:abc123... size: 856
|
||||
```
|
||||
|
||||
#### 2. Use digest instead of tag in deployment
|
||||
```hcl
|
||||
# Terraform
|
||||
container {
|
||||
# Use digest to bypass local registry cache
|
||||
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
|
||||
image_pull_policy = "Always"
|
||||
name = "myimage"
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Kubernetes YAML
|
||||
containers:
|
||||
- name: myimage
|
||||
image: docker.io/viktorbarzin/myimage@sha256:abc123...
|
||||
imagePullPolicy: Always
|
||||
```
|
||||
|
||||
#### 3. Apply and restart
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.myservice
|
||||
kubectl rollout restart deployment/myservice -n mynamespace
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Registry mirrors match by tag, not digest
|
||||
- When you specify a digest, the node must fetch that exact manifest
|
||||
- The mirror may not have the digest cached, forcing a pull from upstream
|
||||
- Even if cached, the digest guarantees the exact image version
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check the pod is using the new image
|
||||
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
|
||||
# Verify application behavior reflects new code
|
||||
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
Before (problematic):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web:latest"
|
||||
```
|
||||
|
||||
After (fixed):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
|
||||
```
|
||||
|
||||
### Notes
|
||||
- You must update the digest each time you push a new image
|
||||
- Consider automating digest extraction in CI/CD pipelines
|
||||
- This is a workaround; ideally fix the registry mirror configuration
|
||||
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
|
||||
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
|
||||
|
||||
### Diagnosing Registry Mirror Issues
|
||||
```bash
|
||||
# On a k8s node, check containerd config
|
||||
cat /etc/containerd/config.toml | grep -A5 mirrors
|
||||
|
||||
# Check if mirror is intercepting
|
||||
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
|
||||
|
||||
# List cached images on node
|
||||
crictl images | grep myimage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
|
||||
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
|
||||
186
.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
Normal file
186
.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
---
|
||||
name: k8s-gpu-no-nvidia-devices
|
||||
description: |
|
||||
Fix for Kubernetes GPU pods showing "CUDA not supported" or no /dev/nvidia* devices
|
||||
despite nvidia.com/gpu resource allocation. Use when: (1) container runs but torch.cuda.is_available()
|
||||
returns False, (2) ls /dev/nvidia* shows "no matches found", (3) nvidia-smi fails inside pod
|
||||
but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
|
||||
Covers NVIDIA device plugin, time-slicing, and container runtime issues.
|
||||
author: Claude Code
|
||||
version: 1.1.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# Kubernetes GPU Pod - No NVIDIA Devices Found
|
||||
|
||||
## Problem
|
||||
|
||||
A Kubernetes pod requests GPU resources (`nvidia.com/gpu: 1`) and schedules on a GPU node,
|
||||
but inside the container there are no NVIDIA devices visible. The application falls back
|
||||
to CPU with messages like "CUDA not supported by the Torch installed!" despite running
|
||||
in a CUDA-enabled container image.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Pod shows `Running` status and is on a node with `gpu=true` label
|
||||
- `kubectl describe pod` shows GPU limit/request is satisfied
|
||||
- Inside container: `ls /dev/nvidia*` returns "no matches found"
|
||||
- Inside container: `nvidia-smi` fails or command not found
|
||||
- Application logs show: "CUDA not supported", "Switching to CPU", "torch.cuda.is_available() = False"
|
||||
- On the host node: `nvidia-smi` works fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify GPU Availability
|
||||
|
||||
Check if other pods are consuming the GPU:
|
||||
|
||||
```bash
|
||||
# List all pods using GPU resources
|
||||
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) | "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
|
||||
# Check NVIDIA device plugin pods
|
||||
kubectl get pods -n nvidia -l app=nvidia-device-plugin
|
||||
kubectl logs -n nvidia -l app=nvidia-device-plugin --tail=50
|
||||
```
|
||||
|
||||
### Step 2: Free GPU Resources
|
||||
|
||||
If another workload is using the GPU, unload it:
|
||||
|
||||
```bash
|
||||
# For Ollama specifically
|
||||
kubectl exec -n ollama deployment/ollama -- ollama stop <model_name>
|
||||
|
||||
# Or scale down the conflicting deployment
|
||||
kubectl scale deployment/<name> -n <namespace> --replicas=0
|
||||
```
|
||||
|
||||
### Step 3: Restart the Affected Pod
|
||||
|
||||
After freeing GPU resources, restart the pod to get fresh device allocation:
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/<name> -n <namespace>
|
||||
|
||||
# Or delete the pod directly
|
||||
kubectl delete pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
### Step 4: Verify GPU Access
|
||||
|
||||
```bash
|
||||
# Check devices are now visible
|
||||
kubectl exec -n <namespace> deployment/<name> -- ls -la /dev/nvidia*
|
||||
|
||||
# Test nvidia-smi
|
||||
kubectl exec -n <namespace> deployment/<name> -- nvidia-smi
|
||||
|
||||
# Test PyTorch CUDA
|
||||
kubectl exec -n <namespace> deployment/<name> -- python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After restart, you should see:
|
||||
|
||||
```
|
||||
/dev/nvidia0
|
||||
/dev/nvidiactl
|
||||
/dev/nvidia-uvm
|
||||
/dev/nvidia-uvm-tools
|
||||
```
|
||||
|
||||
And `nvidia-smi` should show the GPU with your container process.
|
||||
|
||||
## Example
|
||||
|
||||
```bash
|
||||
# Problem: ebook2audiobook shows "CUDA not supported"
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- ls /dev/nvidia*
|
||||
zsh:1: no matches found: /dev/nvidia*
|
||||
|
||||
# Solution: Unload Ollama model holding the GPU
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama ps
|
||||
NAME SIZE PROCESSOR
|
||||
qwen2.5:14b 10 GB 33%/67% CPU/GPU
|
||||
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama stop qwen2.5:14b
|
||||
|
||||
# Restart the affected pod
|
||||
$ kubectl rollout restart deployment/ebook2audiobook -n ebook2audiobook
|
||||
|
||||
# Verify
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
|
||||
# Should now show the Tesla T4 GPU
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **GPU Time-Slicing**: If using NVIDIA GPU time-slicing (configured in GPU Operator),
|
||||
multiple pods can share a GPU. However, device injection still requires proper timing.
|
||||
|
||||
- **Pod Scheduling Order**: Pods that start while GPU is fully allocated may not get
|
||||
devices injected even after GPU becomes available - a restart is required.
|
||||
|
||||
- **Container Runtime**: The NVIDIA Container Toolkit must be properly configured.
|
||||
Issues can arise from:
|
||||
- cgroup driver mismatch (systemd vs cgroupfs)
|
||||
- Container updates causing device loss
|
||||
- SELinux blocking device access
|
||||
|
||||
- **Image Compatibility**: The container image must have CUDA libraries matching the
|
||||
driver version. Check with `nvidia-smi` on host for driver version.
|
||||
|
||||
- **This Cluster**: Uses NVIDIA GPU Operator with time-slicing (20 replicas per GPU).
|
||||
GPU node is `k8s-node1` with Tesla T4.
|
||||
|
||||
## See Also
|
||||
|
||||
- Check GPU Operator status: `kubectl get pods -n nvidia`
|
||||
- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
|
||||
|
||||
## Automatic GPU Recovery via Liveness Probe
|
||||
|
||||
To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
|
||||
both GPU availability and application health. Example for Frigate (but applicable to any
|
||||
GPU workload):
|
||||
|
||||
```hcl
|
||||
# Restart pod if GPU becomes unavailable or app hangs
|
||||
liveness_probe {
|
||||
exec {
|
||||
command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 120
|
||||
period_seconds = 60
|
||||
timeout_seconds = 10
|
||||
failure_threshold = 3
|
||||
}
|
||||
# Allow time for GPU model loading at startup
|
||||
startup_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = <port>
|
||||
}
|
||||
period_seconds = 10
|
||||
failure_threshold = 30 # up to 5 minutes
|
||||
}
|
||||
```
|
||||
|
||||
The liveness probe checks:
|
||||
- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
|
||||
- `curl` health endpoint — fails if the application process is hung
|
||||
|
||||
If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
|
||||
which re-acquires the GPU device through the NVIDIA device plugin.
|
||||
|
||||
**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
|
||||
(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
|
||||
configured with a short `initial_delay_seconds`.
|
||||
|
||||
## References
|
||||
|
||||
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
|
||||
- [Kubernetes GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html)
|
||||
113
.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
Normal file
113
.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
---
|
||||
name: k8s-hpa-scaling-storm
|
||||
description: |
|
||||
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
|
||||
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
|
||||
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
|
||||
(3) cluster becomes unstable due to resource exhaustion from too many pods,
|
||||
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
|
||||
to a deployment that previously had none causes HPA to miscalculate utilization.
|
||||
Covers emergency response and prevention patterns.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Kubernetes HPA Scaling Storm
|
||||
|
||||
## Problem
|
||||
When an HPA is configured with a memory or CPU utilization target but the underlying
|
||||
deployment has insufficient resource requests, the HPA calculates artificially high
|
||||
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
|
||||
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
|
||||
cluster resources and potentially crashing etcd and the API server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
|
||||
- Pod count for a deployment rapidly increases to maxReplicas
|
||||
- etcd timeout errors in `kubectl` or `terraform apply`
|
||||
- API server becomes unreachable (`connection refused` or `network is unreachable`)
|
||||
- Adding resource requests to a Helm chart that previously had none
|
||||
- Memory-based HPA targets with real usage far exceeding requests
|
||||
|
||||
## Solution
|
||||
|
||||
### Emergency Response (stop the storm)
|
||||
|
||||
**Step 1: Delete the HPA immediately**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Step 2: Scale the deployment down**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
|
||||
```
|
||||
|
||||
**Step 3: Wait for pods to terminate and cluster to stabilize**
|
||||
```bash
|
||||
# Watch pod count decrease
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
|
||||
```
|
||||
|
||||
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
|
||||
will restart static pods (etcd, kube-apiserver) automatically.
|
||||
|
||||
### Prevention
|
||||
|
||||
**Rule 1: Set resource requests to match actual usage**
|
||||
Before enabling HPA, check actual resource consumption:
|
||||
```bash
|
||||
kubectl top pods -n <namespace> -l <label>
|
||||
```
|
||||
Set requests to the baseline (idle) usage, not the minimum possible value.
|
||||
|
||||
**Rule 2: Set reasonable maxReplicas**
|
||||
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
|
||||
Default of 100 is almost never appropriate for a home/small cluster.
|
||||
|
||||
**Rule 3: Prefer CPU-only HPA targets**
|
||||
Memory-based scaling is problematic because:
|
||||
- Memory usage grows over time and rarely decreases
|
||||
- Memory-based scaling creates pods that never scale down
|
||||
- CPU is more responsive to load changes
|
||||
|
||||
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
|
||||
If adding resource requests to a deployment managed by HPA, temporarily disable
|
||||
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
|
||||
|
||||
## Cascade Effects
|
||||
A scaling storm can cause:
|
||||
1. etcd storage exhaustion (too many pod objects)
|
||||
2. API server OOM or connection limits
|
||||
3. VPN/network connectivity loss (if VPN runs in the cluster)
|
||||
4. Kyverno webhook failures (admission controller overwhelmed)
|
||||
5. Other pods evicted or unable to schedule
|
||||
|
||||
## Verification
|
||||
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
|
||||
- Pod count is stable at expected replicas
|
||||
- `kubectl get nodes` responds promptly
|
||||
- No etcd timeout errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Observed: HPA scaling Collabora to 100 pods
|
||||
$ kubectl get hpa -n nextcloud
|
||||
NAME TARGETS MINPODS MAXPODS REPLICAS
|
||||
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
|
||||
|
||||
# Emergency fix
|
||||
$ kubectl delete hpa nextcloud-collabora -n nextcloud
|
||||
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
|
||||
|
||||
# Root cause: 256Mi memory request, actual usage 570Mi
|
||||
# Fix: increase request to 1Gi or disable memory target
|
||||
```
|
||||
|
||||
## Notes
|
||||
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
|
||||
Helm upgrade will recreate it. You must also update the Helm values.
|
||||
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
|
||||
the HPA issue entirely.
|
||||
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.
|
||||
235
.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
Normal file
235
.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
---
|
||||
name: k8s-nfs-mount-troubleshooting
|
||||
description: |
|
||||
Debug Kubernetes NFS volume mount failures. Use when: (1) Pod stuck in ContainerCreating
|
||||
for extended time, (2) kubectl describe shows "MountVolume.SetUp failed" with NFS errors,
|
||||
(3) Error message shows "Protocol not supported" or "mount.nfs: access denied",
|
||||
(4) NFS volume defined in pod spec but container won't start, (5) Container starts but
|
||||
gets "Permission denied" writing to NFS volume (non-root container UID mismatch),
|
||||
(6) CronJob or init container fails silently when writing to NFS, (7) Pod shows Running
|
||||
1/1 but service is unresponsive after a node reboot — stale NFS mount causes frozen
|
||||
processes with zero listening sockets. Common root causes are missing NFS export on the
|
||||
server, UID mismatch for non-root containers, and stale mounts after node reboots.
|
||||
author: Claude Code
|
||||
version: 1.2.0
|
||||
date: 2026-02-28
|
||||
---
|
||||
|
||||
# Kubernetes NFS Mount Troubleshooting
|
||||
|
||||
## Problem
|
||||
Pods with NFS volumes get stuck in `ContainerCreating` state indefinitely. The error
|
||||
messages from `kubectl describe pod` can be misleading, showing protocol or permission
|
||||
errors when the actual issue is the NFS export doesn't exist.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Pod status shows `ContainerCreating` for more than 1-2 minutes
|
||||
- `kubectl describe pod` shows events like:
|
||||
- `MountVolume.SetUp failed for volume "data" : mount failed: exit status 32`
|
||||
- `mount.nfs: Protocol not supported`
|
||||
- `mount.nfs: access denied by server`
|
||||
- Pod spec includes an NFS volume mount
|
||||
- Other pods on the same node work fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the NFS path
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Volumes:"
|
||||
```
|
||||
Look for the NFS server and path (e.g., `10.0.10.15:/mnt/main/myservice`)
|
||||
|
||||
### Step 2: Verify the export exists on NFS server
|
||||
SSH to the NFS server and check:
|
||||
```bash
|
||||
ssh root@<nfs-server> "ls -la /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 3: If directory doesn't exist, create it
|
||||
```bash
|
||||
ssh root@<nfs-server> "mkdir -p /mnt/main/myservice && chmod 777 /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 4: Add to NFS exports (TrueNAS specific)
|
||||
For TrueNAS, add the path to the NFS share configuration:
|
||||
1. Add directory to `scripts/nfs_directories.txt`
|
||||
2. Run `scripts/nfs_exports.sh` to update the share via API
|
||||
|
||||
### Step 5: Restart the pod
|
||||
```bash
|
||||
kubectl delete pod -n <namespace> -l app=<app-label>
|
||||
```
|
||||
The deployment will create a new pod that should now mount successfully.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
kubectl get pods -n <namespace>
|
||||
# Should show 1/1 Running instead of 0/1 ContainerCreating
|
||||
|
||||
kubectl exec -n <namespace> <pod-name> -- ls -la /app/data
|
||||
# Should show the mounted directory contents
|
||||
```
|
||||
|
||||
## Example
|
||||
**Symptom:**
|
||||
```
|
||||
Events:
|
||||
Warning FailedMount 55s (x13 over 11m) kubelet MountVolume.SetUp failed for volume "data" : mount failed: exit status 32
|
||||
Mounting command: mount
|
||||
Mounting arguments: -t nfs 10.0.10.15:/mnt/main/resume /var/lib/kubelet/pods/.../data
|
||||
Output: mount.nfs: Protocol not supported
|
||||
```
|
||||
|
||||
**Root Cause:** The directory `/mnt/main/resume` didn't exist on the TrueNAS server.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
ssh root@10.0.10.15 'mkdir -p /mnt/main/resume && chmod 777 /mnt/main/resume'
|
||||
# Then add to NFS exports and restart pod
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The "Protocol not supported" error is misleading - it often means the export path doesn't exist
|
||||
- Always check the NFS server first before investigating protocol/firewall issues
|
||||
- For TrueNAS, the NFS share must be updated via API/UI after creating new directories
|
||||
- NFSv3 vs NFSv4 issues are rare in modern setups; missing paths are more common
|
||||
- Check that the NFS client packages are installed on Kubernetes nodes if this is a new cluster
|
||||
|
||||
## Variant: Non-Root Container UID Permission Denied
|
||||
|
||||
### Problem
|
||||
Container starts and mounts NFS successfully, but gets "Permission denied" when
|
||||
writing files. The pod appears healthy but operations fail silently.
|
||||
|
||||
### Trigger Conditions
|
||||
- Container logs show "Permission denied" or "client returned ERROR on write"
|
||||
- Pod is Running (not stuck in ContainerCreating)
|
||||
- NFS directory exists and is mounted, but owned by root (uid 0)
|
||||
- Container image runs as a non-root user (e.g., `curlimages/curl` runs as uid 101)
|
||||
- CronJobs or init containers that write to NFS fail with no obvious error
|
||||
|
||||
### Common Non-Root Container UIDs
|
||||
| Image | UID | User |
|
||||
|-------|-----|------|
|
||||
| `curlimages/curl` | 101 | curl_user |
|
||||
| `nginx` (unprivileged) | 101 | nginx |
|
||||
| `node` | 1000 | node |
|
||||
| `python` (slim) | 0 | root (safe) |
|
||||
| `grafana/grafana` | 472 | grafana |
|
||||
|
||||
### Solution
|
||||
Fix permissions on the NFS server:
|
||||
```bash
|
||||
# Option 1: World-writable (simplest, suitable for non-sensitive data)
|
||||
ssh root@10.0.10.15 "chmod -R 777 /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 2: Match container UID (more secure)
|
||||
ssh root@10.0.10.15 "chown -R <uid>:<gid> /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 3: Use securityContext in pod spec to run as root
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
```
|
||||
|
||||
### Debugging
|
||||
```bash
|
||||
# Check what UID the container runs as
|
||||
kubectl exec -n <namespace> <pod> -- id
|
||||
|
||||
# Test write access from inside container
|
||||
kubectl exec -n <namespace> <pod> -- sh -c 'echo test > /path/to/nfs/testfile'
|
||||
|
||||
# Check NFS directory ownership on server
|
||||
ssh root@10.0.10.15 "ls -la /mnt/main/<service>/"
|
||||
```
|
||||
|
||||
## Variant: Stale NFS Mounts After Node Reboot (Ghost Running Pods)
|
||||
|
||||
### Problem
|
||||
After a node reboot (e.g., from kured rolling kernel updates), pods are rescheduled and
|
||||
show `Running 1/1` status, but the application process is frozen/hung. The service is
|
||||
completely unresponsive despite appearing healthy to Kubernetes.
|
||||
|
||||
### Trigger Conditions
|
||||
- Node was recently rebooted (check `kubectl get nodes` for age, or kured logs)
|
||||
- Pod shows `Running 1/1` with 0 restarts (looks perfectly healthy)
|
||||
- Service is unresponsive — Uptime Kuma or curl shows timeout/connection refused
|
||||
- `kubectl exec <pod> -- ss -tlnp` shows **zero listening sockets** (the process started but is hung)
|
||||
- Pod uses NFS volumes (inline `nfs {}` or PVC backed by NFS)
|
||||
- Multiple pods across different namespaces all exhibit the same symptom simultaneously
|
||||
- `kubectl describe pod` shows no warnings or errors — everything looks normal
|
||||
|
||||
### Root Cause
|
||||
When a node reboots, the NFS client mounts go stale. If the pod is rescheduled to the
|
||||
same or different node before NFS fully recovers, the application process starts but
|
||||
immediately hangs when it tries to access the NFS-mounted filesystem. The process is
|
||||
stuck in an uninterruptible I/O wait (D state) but Kubernetes sees the container as
|
||||
running because the PID exists and liveness probes (if any) may not exercise the NFS path.
|
||||
|
||||
### Solution
|
||||
Force-delete the affected pods to trigger a clean reschedule with fresh NFS mounts:
|
||||
|
||||
```bash
|
||||
# Identify hung pods — Running but no listening sockets
|
||||
kubectl exec -n <namespace> <pod> -- ss -tlnp 2>/dev/null
|
||||
# If output is empty or shows no expected ports, the pod is hung
|
||||
|
||||
# Force-delete to skip graceful shutdown (hung process won't respond to SIGTERM)
|
||||
kubectl delete pod -n <namespace> <pod> --force --grace-period=0
|
||||
|
||||
# The deployment controller creates a new pod with fresh NFS mounts
|
||||
kubectl get pods -n <namespace> -w
|
||||
```
|
||||
|
||||
For bulk remediation after a cluster-wide event:
|
||||
```bash
|
||||
# Find all pods with NFS volumes that might be hung
|
||||
# Check each service's expected port — if ss -tlnp shows nothing, force-delete
|
||||
for ns in calibre stirling-pdf send speedtest n8n paperless-ngx; do
|
||||
pod=$(kubectl get pod -n $ns -o name | head -1)
|
||||
sockets=$(kubectl exec -n $ns ${pod} -- ss -tlnp 2>/dev/null | wc -l)
|
||||
if [ "$sockets" -le 1 ]; then
|
||||
echo "HUNG: $ns/$pod (no listening sockets)"
|
||||
kubectl delete ${pod} -n $ns --force --grace-period=0
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# New pod should have listening sockets
|
||||
kubectl exec -n <namespace> <new-pod> -- ss -tlnp
|
||||
# Should show the application's expected port (e.g., *:8080)
|
||||
|
||||
# Service should respond
|
||||
kubectl exec -n <namespace> <new-pod> -- curl -sI http://localhost:<port>/
|
||||
# Should return HTTP response
|
||||
```
|
||||
|
||||
### Key Diagnostic Insight
|
||||
The critical signal is **Running 1/1 but zero listening sockets**. Normal healthy pods
|
||||
always have at least one listening socket for their application port. If `ss -tlnp`
|
||||
returns nothing, the process is hung on a stale NFS mount, not crashed — that's why
|
||||
Kubernetes thinks it's fine.
|
||||
|
||||
### Prevention
|
||||
- Add **liveness probes** that hit the application's HTTP endpoint (not just TCP connect):
|
||||
```hcl
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 60
|
||||
period_seconds = 30
|
||||
timeout_seconds = 5
|
||||
}
|
||||
```
|
||||
- This ensures Kubernetes detects hung pods and restarts them automatically.
|
||||
|
||||
## See Also
|
||||
- **nfsv4-idmapd-uid-mapping** — All UIDs show as 65534 (nobody) inside containers. Different from permission denied; the UIDs are wrong, not the permissions.
|
||||
- TrueNAS NFS configuration documentation
|
||||
- Kubernetes NFS volume documentation
|
||||
- k8s-limitrange-oom-silent-kill (for OOM issues often confused with NFS hangs)
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
---
|
||||
name: kubelet-static-pod-manifest-update
|
||||
description: |
|
||||
Force kubelet to pick up changes to static pod manifests in /etc/kubernetes/manifests/.
|
||||
Use when: (1) edited kube-apiserver.yaml but the running process still has old flags,
|
||||
(2) kubelet restart doesn't pick up manifest changes, (3) touching the manifest file
|
||||
doesn't trigger pod recreation, (4) killing the API server process results in the
|
||||
same old args on restart, (5) the pod's config.hash annotation doesn't match the
|
||||
file's hash. Requires a full cycle: remove manifest, stop kubelet, remove containers,
|
||||
re-add manifest, start kubelet.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Kubelet Static Pod Manifest Update
|
||||
|
||||
## Problem
|
||||
After editing a static pod manifest (e.g., `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
to add OIDC or audit flags), kubelet continues running the pod with the old configuration.
|
||||
Standard approaches like `touch`, `systemctl restart kubelet`, or `kubectl delete pod`
|
||||
do not force kubelet to reconcile the new manifest.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Edited `/etc/kubernetes/manifests/kube-apiserver.yaml` (or other static pod manifests)
|
||||
- The running process (`ps aux | grep kube-apiserver`) shows old flags
|
||||
- `kubectl get pod -n kube-system kube-apiserver-* -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'` returns a stale hash
|
||||
- Any of these actions failed to apply the changes:
|
||||
- `touch /etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
- `systemctl restart kubelet`
|
||||
- `kubectl delete pod kube-apiserver-*`
|
||||
- Killing the API server process directly
|
||||
|
||||
## Root Cause
|
||||
Kubelet maintains an internal cache of static pod specs keyed by a hash of the manifest.
|
||||
When the manifest changes, kubelet should detect the new hash and recreate the pod.
|
||||
However, in practice (observed on Kubernetes 1.34.x), kubelet can get stuck with the
|
||||
old hash if:
|
||||
- The pod's mirror object in the API server still exists with the old hash
|
||||
- Kubelet's internal pod cache wasn't cleared between restarts
|
||||
- The container runtime (containerd) still has the old container running
|
||||
|
||||
## Solution
|
||||
|
||||
Full restart cycle on the master node:
|
||||
|
||||
```bash
|
||||
# 1. Back up the manifest
|
||||
sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
|
||||
|
||||
# 2. Remove the manifest (kubelet will stop the pod)
|
||||
sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 3. Stop kubelet
|
||||
sudo systemctl stop kubelet
|
||||
|
||||
# 4. Wait for the API server container to stop
|
||||
sleep 5
|
||||
|
||||
# 5. Force-remove any remaining API server containers
|
||||
sudo crictl rm -f $(sudo crictl ps -aq --name kube-apiserver 2>/dev/null) 2>/dev/null
|
||||
|
||||
# 6. Re-add the manifest (with your changes)
|
||||
sudo cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 7. Start kubelet
|
||||
sudo systemctl start kubelet
|
||||
|
||||
# 8. Wait for API server to come up (30-60 seconds)
|
||||
sleep 45
|
||||
|
||||
# 9. Verify new flags are active
|
||||
sudo cat /proc/$(pgrep -f 'kube-apiserver --' | head -1)/cmdline | tr '\0' '\n' | grep 'your-new-flag'
|
||||
```
|
||||
|
||||
**Critical:** The order matters. Removing the manifest BEFORE stopping kubelet ensures
|
||||
kubelet processes the removal. Then clearing containers ensures no stale state. Finally,
|
||||
re-adding the manifest with kubelet running triggers a fresh pod creation.
|
||||
|
||||
## What Does NOT Work
|
||||
|
||||
| Approach | Why it fails |
|
||||
|----------|-------------|
|
||||
| `touch manifest.yaml` | Kubelet may not detect mtime-only changes |
|
||||
| `systemctl restart kubelet` | Kubelet reuses cached pod spec if hash matches |
|
||||
| `kubectl delete pod` | Deletes mirror pod but kubelet recreates from cached spec |
|
||||
| `kill <apiserver-pid>` | Container runtime restarts the same container with old args |
|
||||
| Moving manifest away and back without stopping kubelet | Kubelet may cache the old spec in memory |
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check the running process has new flags
|
||||
ps aux | grep kube-apiserver | grep -v grep | grep 'your-new-flag'
|
||||
|
||||
# Check the config hash changed
|
||||
kubectl get pod -n kube-system kube-apiserver-$(hostname) \
|
||||
-o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'
|
||||
|
||||
# Check API server logs for successful startup
|
||||
kubectl logs -n kube-system kube-apiserver-$(hostname) | tail -5
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This applies to ALL static pods, not just kube-apiserver (etcd, controller-manager, scheduler)
|
||||
- The cluster will be briefly unavailable during the restart (30-60 seconds)
|
||||
- On single-master clusters, kubectl commands will fail during the restart — use `sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf` from the master
|
||||
- Always validate the YAML before removing the manifest: `python3 -c "import yaml; yaml.safe_load(open('/etc/kubernetes/manifests/kube-apiserver.yaml'))"`
|
||||
- See also: `authentik-oidc-kubernetes` skill for the full OIDC setup context
|
||||
143
.claude/skills/archived/local-llm-gpu-selection/SKILL.md
Normal file
143
.claude/skills/archived/local-llm-gpu-selection/SKILL.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
name: local-llm-gpu-selection
|
||||
description: |
|
||||
Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and
|
||||
comparing to Apple Silicon alternatives. Use when: (1) user asks about running
|
||||
local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs,
|
||||
(3) user wants to compare local models to Claude for coding, (4) user asks about
|
||||
quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for
|
||||
LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility,
|
||||
multi-GPU considerations, and realistic quality comparisons to Claude models.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-06-11
|
||||
---
|
||||
|
||||
# Local LLM GPU Selection & Performance Guide
|
||||
|
||||
## Problem
|
||||
Choosing the right hardware for local LLM inference requires understanding the
|
||||
relationship between VRAM capacity, memory bandwidth, GPU compatibility with
|
||||
server chassis, and realistic model quality expectations.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks about running quantized models locally (Ollama, llama.cpp)
|
||||
- User wants to know which GPU fits their server (Dell R730 or similar 2U)
|
||||
- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
|
||||
- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
|
||||
|
||||
## Key Principle: Memory Bandwidth Is Everything
|
||||
|
||||
LLM token generation is **memory-bandwidth bound**, not compute bound. The formula:
|
||||
```
|
||||
approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
|
||||
```
|
||||
This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs
|
||||
despite having less raw compute.
|
||||
|
||||
## VRAM Requirements by Model Size
|
||||
|
||||
| Model Size | Quant | VRAM Needed | Examples |
|
||||
|------------|-------|-------------|----------|
|
||||
| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
|
||||
| 7-8B | Q8_0 | ~8 GB | |
|
||||
| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
|
||||
| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
|
||||
| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
|
||||
| 32B | Q8_0 | ~34 GB | |
|
||||
| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
|
||||
| 70B | Q8_0 | ~70 GB | |
|
||||
|
||||
Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
|
||||
|
||||
## Dell R730 GPU Compatibility
|
||||
|
||||
### Constraints
|
||||
- **2U chassis**: Full-height cards fit, but limited to dual-slot width
|
||||
- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration
|
||||
- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
|
||||
- **PSU**: Check wattage headroom (dual 750W or 1100W typical)
|
||||
|
||||
### Compatible GPUs
|
||||
|
||||
**No external power needed (<=75W):**
|
||||
- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
|
||||
- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
|
||||
- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
|
||||
- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
|
||||
|
||||
**Requires power cable (>75W):**
|
||||
- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
|
||||
- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
|
||||
- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
|
||||
|
||||
**Won't fit:**
|
||||
- RTX 3090/4090: Too thick (3-slot), too long
|
||||
- A100: Fits physically but very expensive
|
||||
- Any consumer RTX: Generally too large for 2U
|
||||
|
||||
### Multi-GPU Considerations
|
||||
- Ollama splits model layers across GPUs automatically
|
||||
- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
|
||||
- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
|
||||
- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
|
||||
|
||||
## Apple Silicon Comparison
|
||||
|
||||
Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
|
||||
|
||||
| Device | Memory | Bandwidth | Advantage |
|
||||
|--------|--------|-----------|-----------|
|
||||
| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
|
||||
| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
|
||||
| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
|
||||
|
||||
A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for
|
||||
LLM inference due to zero cross-GPU overhead and high unified bandwidth.
|
||||
|
||||
## Best Coding Models (for Ollama)
|
||||
|
||||
For coding tasks specifically, prefer dedicated coding models:
|
||||
1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class
|
||||
2. **Codestral 22B** — Mistral's dedicated coding model
|
||||
3. **DeepSeek Coder V2** — good quality, efficient
|
||||
4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB
|
||||
|
||||
## Realistic Quality Comparison to Claude
|
||||
|
||||
For Claude Code-style agentic coding workflows:
|
||||
|
||||
| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
|
||||
|-----------|-------------|-------|---------------------|-------------|
|
||||
| Single function gen | Excellent | Good | Good | Decent |
|
||||
| Multi-file refactoring | Excellent | Decent | Weak | Weak |
|
||||
| Tool use / agentic loops | Excellent | Good | Poor | Poor |
|
||||
| Long context (large codebases) | Excellent | Good | Weak | Weak |
|
||||
|
||||
Local models work for simple completions and code questions. They struggle badly
|
||||
with Claude Code's complex multi-step tool-use workflows, long context windows,
|
||||
and self-correction capabilities.
|
||||
|
||||
## Quantization Quality Guide
|
||||
|
||||
From best to worst quality (and largest to smallest):
|
||||
- FP16: Full precision, baseline quality
|
||||
- Q8_0: Near-lossless, ~50% size reduction
|
||||
- Q6_K: Minimal quality loss
|
||||
- Q5_K_M: Good balance
|
||||
- Q4_K_M: **Recommended default** — best quality/size tradeoff
|
||||
- Q3_K_M: Noticeable degradation on complex reasoning
|
||||
- Q2_K: Significant quality loss, emergency only
|
||||
|
||||
## Verification
|
||||
- Check GPU compatibility: `lspci | grep -i nvidia` on the host
|
||||
- Check available VRAM: `nvidia-smi` inside the GPU VM
|
||||
- Check model fit: Ollama shows VRAM usage during `ollama run`
|
||||
- Check inference speed: Count tokens/sec in Ollama output
|
||||
|
||||
## Notes
|
||||
- GPU prices fluctuate significantly in the used market; check current prices
|
||||
- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
|
||||
- Power consumption matters for 24/7 homelab use (electricity cost)
|
||||
- For Claude Code specifically, API-based Claude models remain significantly
|
||||
superior to any local model for agentic coding workflows
|
||||
143
.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
Normal file
143
.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
name: loki-helm-deployment-pitfalls
|
||||
description: |
|
||||
Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
|
||||
Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
|
||||
or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
|
||||
to be enabled", (3) Helm install fails with "cannot re-use a name that is still
|
||||
in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
|
||||
Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
|
||||
Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Loki Helm Chart Deployment Pitfalls
|
||||
|
||||
## Problem
|
||||
Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
|
||||
multiple non-obvious failures that aren't documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying Loki via `helm_release` in Terraform
|
||||
- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
|
||||
- First-time deployment or redeployment after failures
|
||||
|
||||
## Pitfall 1: Read-Only Root Filesystem
|
||||
|
||||
**Error:** `mkdir /loki/compactor: read-only file system`
|
||||
|
||||
**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
|
||||
for security. The compactor `working_directory` and ruler `rule_path` default to
|
||||
paths under `/loki/` which is on the read-only root FS.
|
||||
|
||||
**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
|
||||
volume there:
|
||||
```yaml
|
||||
compactor:
|
||||
working_directory: /var/loki/compactor # NOT /loki/compactor
|
||||
ruler:
|
||||
rule_path: /var/loki/scratch # NOT /loki/scratch
|
||||
```
|
||||
|
||||
## Pitfall 2: Canary Required
|
||||
|
||||
**Error:** `Helm test requires the Loki Canary to be enabled`
|
||||
|
||||
**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
|
||||
to be true. You cannot disable it.
|
||||
|
||||
**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
|
||||
`chunksCache`, and `resultsCache` to reduce resource usage:
|
||||
```yaml
|
||||
gateway:
|
||||
enabled: false
|
||||
chunksCache:
|
||||
enabled: false
|
||||
resultsCache:
|
||||
enabled: false
|
||||
# Do NOT add: lokiCanary: enabled: false
|
||||
```
|
||||
|
||||
## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
|
||||
|
||||
**Error:** `cannot re-use a name that is still in use`
|
||||
|
||||
**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
|
||||
sometimes leaves a stale release secret in Kubernetes. Terraform then can't
|
||||
create a new release with the same name.
|
||||
|
||||
**Fix:** Delete the stale Helm secret:
|
||||
```bash
|
||||
kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
|
||||
```
|
||||
Also consider removing `atomic = true` for initial deployments and adding it
|
||||
back after the first successful install. Use a longer `timeout` (600s+) for
|
||||
first deploy since image pulls take time.
|
||||
|
||||
## Pitfall 4: PV Stuck in Released State
|
||||
|
||||
**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
|
||||
|
||||
**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
|
||||
`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
|
||||
|
||||
**Fix:** Clear the stale claimRef:
|
||||
```bash
|
||||
kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
|
||||
```
|
||||
The PV will transition from `Released` to `Available` and can be bound again.
|
||||
|
||||
## Pitfall 5: "Entry Too Far Behind" Log Spam
|
||||
|
||||
**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
|
||||
|
||||
**Cause:** Alloy reads all historical log files from the Kubernetes API on first
|
||||
startup. Old entries are rejected by Loki's ingester because they're behind the
|
||||
newest entry for that stream.
|
||||
|
||||
**Fix:** This is harmless and self-resolving — Alloy catches up to present time
|
||||
and errors stop. To clear immediately:
|
||||
```bash
|
||||
kubectl rollout restart ds -n monitoring alloy
|
||||
```
|
||||
After restart, Alloy tails from approximately "now" for each container.
|
||||
|
||||
## Pitfall 6: Alertmanager Service Name
|
||||
|
||||
**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
|
||||
|
||||
**Cause:** The Prometheus Helm chart names the Alertmanager service
|
||||
`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
|
||||
silent alert delivery failures.
|
||||
|
||||
**Fix:**
|
||||
```yaml
|
||||
ruler:
|
||||
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
||||
```
|
||||
Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Loki pod running
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
|
||||
|
||||
# Loki receiving logs
|
||||
kubectl port-forward -n monitoring svc/loki 3100:3100 &
|
||||
curl -s 'http://localhost:3100/loki/api/v1/labels'
|
||||
# Should return JSON with namespace, pod, container labels
|
||||
|
||||
# PV bound
|
||||
kubectl get pv loki
|
||||
# STATUS should be "Bound"
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Always check PV status before retrying a failed deploy
|
||||
- The Loki Helm chart creates many components by default (gateway, canary,
|
||||
memcached caches) — disable what you don't need for single-binary mode
|
||||
- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
|
||||
disk-friendly setups, but data is lost on pod crash
|
||||
- See also: `helm-release-force-rerender` for Helm values not updating resources
|
||||
|
|
@ -0,0 +1,148 @@
|
|||
---
|
||||
name: music-assistant-librespot-wrong-account
|
||||
description: |
|
||||
Fix for Music Assistant Spotify playback failing with "librespot does not support free
|
||||
accounts" even when the Spotify account has Premium. Use when: (1) Songs load for 1-2
|
||||
seconds then auto-pause, (2) Music Assistant logs show "librespot does not support free
|
||||
accounts" followed by FFmpeg "Invalid data found when processing input" exit code 183,
|
||||
(3) Spotify provider shows "Successfully logged in" but streaming fails. Root cause is
|
||||
stale librespot credential cache pointing to a different (free-tier) Spotify account.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# Music Assistant Librespot Wrong Account / Stale Credentials
|
||||
|
||||
## Problem
|
||||
Music Assistant (MASS) Spotify playback fails immediately — songs appear to load for 1-2
|
||||
seconds then auto-pause. Every track is marked "unplayable". The error log shows librespot
|
||||
rejecting the account as "free" despite the configured Spotify account having Premium.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Music Assistant addon on Home Assistant (tested with v2.7.8, addon `d5369777_music_assistant`)
|
||||
- Symptoms: Song starts loading, pauses after 1-2 seconds, skipped as "unplayable"
|
||||
- Log pattern (all three appear together on every play attempt):
|
||||
```
|
||||
WARNING [music_assistant.spotify] [librespot] librespot does not support "free" accounts.
|
||||
WARNING [music_assistant.audio.media_stream] Error opening input: Invalid data found when processing input
|
||||
ERROR [music_assistant.streams] AudioError while streaming queue item ... FFMpeg exited with code 183
|
||||
```
|
||||
- OAuth login succeeds: `Successfully logged in to Spotify as <Name>`
|
||||
- But librespot streaming fails with the "free" account error
|
||||
|
||||
## Root Cause
|
||||
Music Assistant uses **two separate auth mechanisms** for Spotify:
|
||||
1. **OAuth (PKCE flow)** — for browsing, search, metadata. Uses access tokens refreshed via
|
||||
the Spotify Web API. This is what produces the "Successfully logged in" message.
|
||||
2. **Librespot** — for actual audio streaming. Uses cached credentials stored in
|
||||
`/data/.cache/spotify--<id>/credentials.json` inside the addon container.
|
||||
|
||||
The librespot credential cache can become stale or point to a **different Spotify account**
|
||||
(e.g., if another family member logged in, or credentials were cached from before a Premium
|
||||
upgrade). Librespot uses these cached credentials to connect to Spotify's internal API, which
|
||||
returns a `ProductInfo` XML packet containing the account `type`. If the cached account is
|
||||
"free", librespot calls `exit(1)`, killing the audio pipeline before FFmpeg receives any data.
|
||||
|
||||
## How Librespot Determines Account Type
|
||||
Librespot reads the `type` field from Spotify's `ProductInfo` server packet
|
||||
(`librespot-org/librespot`, `core/src/session.rs`):
|
||||
```rust
|
||||
fn check_catalogue(attributes: &UserAttributes) {
|
||||
if let Some(account_type) = attributes.get("type") {
|
||||
if account_type != "premium" {
|
||||
error!("librespot does not support {account_type:?} accounts.");
|
||||
exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
The check is an exact string match against `"premium"`.
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify the Problem
|
||||
Check Music Assistant addon logs for the "free accounts" error:
|
||||
```bash
|
||||
# Via HA API (from a machine with the HA token)
|
||||
python3 -c "
|
||||
import os, json, requests
|
||||
url = os.environ.get('HOME_ASSISTANT_SOFIA_URL', '').rstrip('/')
|
||||
token = os.environ.get('HOME_ASSISTANT_SOFIA_TOKEN', '')
|
||||
headers = {'Authorization': f'Bearer {token}'}
|
||||
r = requests.get(f'{url}/api/hassio/addons/d5369777_music_assistant/logs', headers=headers)
|
||||
for line in r.text.split('\n'):
|
||||
if 'free' in line.lower() or 'librespot' in line.lower():
|
||||
print(line)
|
||||
"
|
||||
```
|
||||
|
||||
### Step 2: Identify the Music Assistant Container
|
||||
From the SSH addon (ha-sofia: `ssh vbarzin@192.168.1.8`):
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/json | \
|
||||
python3 -c "import sys,json; [print(c['Names'][0], c['Id'][:12]) for c in json.load(sys.stdin) if 'music' in c['Names'][0].lower()]"
|
||||
```
|
||||
|
||||
### Step 3: Check Cached Credentials
|
||||
Exec into the container to read the librespot cache:
|
||||
```bash
|
||||
# Create exec
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["cat","/data/.cache/spotify--5s3mSP8y/credentials.json"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
Check the `username` field — if it doesn't match the expected Premium account, that's the problem.
|
||||
|
||||
### Step 4: Clear the Cache
|
||||
```bash
|
||||
# Create exec to delete cache
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["rm","-rf","/data/.cache/spotify--5s3mSP8y"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
|
||||
### Step 5: Restart Music Assistant
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/restart" -X POST
|
||||
```
|
||||
|
||||
### Step 6: Verify
|
||||
After restart, check logs for:
|
||||
- `Successfully logged in to Spotify as <Name>` (OAuth OK)
|
||||
- No "free accounts" error when playing a track
|
||||
- Optionally re-check `/data/.cache/spotify--5s3mSP8y/credentials.json` to confirm the
|
||||
`username` now matches the Premium account
|
||||
|
||||
## Verification
|
||||
1. Play any Spotify track through Music Assistant
|
||||
2. The track should stream without pausing after 1-2 seconds
|
||||
3. Logs should show `Start Queue Flow stream` without subsequent `AudioError`
|
||||
|
||||
## Notes
|
||||
- The cache directory name `spotify--5s3mSP8y` is an internal Music Assistant provider ID
|
||||
and may differ across installations. Use `find /data -name credentials.json` to locate it.
|
||||
- The `username` field in the credentials cache is Spotify's internal user ID (numeric for
|
||||
newer accounts, text for older ones), not necessarily the display name or email.
|
||||
- Spotify Family plan **owners** have account type `"premium"`. Family plan **members** also
|
||||
report as `"premium"` when their membership is active.
|
||||
- If the problem recurs, it may indicate that Music Assistant's Spotify provider re-caches
|
||||
the wrong credentials — check if multiple Spotify accounts are configured or if another
|
||||
user logged in via the Music Assistant UI.
|
||||
- The SSH addon on HA OS needs `sudo` for Docker socket access (`/run/docker.sock` is owned
|
||||
by `root:messagebus`).
|
||||
- The HA long-lived token typically does NOT have Supervisor API access (hassio endpoints
|
||||
return 401), so addon management must go through the Docker socket from the SSH addon.
|
||||
128
.claude/skills/archived/nextcloud-calendar/SKILL.md
Normal file
128
.claude/skills/archived/nextcloud-calendar/SKILL.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
---
|
||||
name: nextcloud-calendar
|
||||
description: |
|
||||
Create, list, and query calendar events in Nextcloud via CalDAV. Use when:
|
||||
(1) User asks to create a calendar event, (2) User asks what's on their calendar,
|
||||
(3) User says "add to calendar" or "schedule", (4) User asks about upcoming events.
|
||||
Always use Nextcloud calendar unless user specifies otherwise.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-25
|
||||
---
|
||||
|
||||
# Nextcloud Calendar Management
|
||||
|
||||
## Problem
|
||||
Need to create, query, or manage calendar events in the user's Nextcloud calendar.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks to create/add a calendar event
|
||||
- User asks "what's on my calendar?" or similar
|
||||
- User mentions scheduling something
|
||||
- User says "remind me" with a date (create calendar event)
|
||||
- Default calendar is always Nextcloud unless otherwise specified
|
||||
|
||||
## Prerequisites
|
||||
- Python 3 with `caldav` and `icalendar` packages available (installed via PYTHONPATH or system packages)
|
||||
- Environment variables `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` must be set
|
||||
|
||||
## Solution
|
||||
|
||||
### Script Location
|
||||
```
|
||||
.claude/calendar-query.py
|
||||
```
|
||||
|
||||
### Execution Pattern (CRITICAL)
|
||||
Run the script directly with python3 (env vars are set in the environment):
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py [command] [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### List Calendars
|
||||
```bash
|
||||
python .claude/calendar-query.py list
|
||||
```
|
||||
|
||||
#### Query Events
|
||||
```bash
|
||||
# Today's events
|
||||
python .claude/calendar-query.py today
|
||||
|
||||
# Tomorrow's events
|
||||
python .claude/calendar-query.py tomorrow
|
||||
|
||||
# This week
|
||||
python .claude/calendar-query.py week
|
||||
|
||||
# This month
|
||||
python .claude/calendar-query.py month
|
||||
|
||||
# Custom date range
|
||||
python .claude/calendar-query.py events --days 14
|
||||
python .claude/calendar-query.py events --date 2026-04-10
|
||||
|
||||
# From specific calendar
|
||||
python .claude/calendar-query.py today --calendar "Work"
|
||||
```
|
||||
|
||||
#### Create Events
|
||||
```bash
|
||||
# All-day event (single day)
|
||||
python .claude/calendar-query.py create --title "Doctor appointment" --start "2026-03-15" --all-day
|
||||
|
||||
# All-day event (multi-day) - end date is EXCLUSIVE
|
||||
# For April 10-13, use end date April 14
|
||||
python .claude/calendar-query.py create --title "Vacation" --start "2026-04-10" --end "2026-04-14" --all-day
|
||||
|
||||
# Timed event
|
||||
python .claude/calendar-query.py create --title "Meeting" --start "2026-03-15 14:00" --end "2026-03-15 15:00"
|
||||
|
||||
# With location and description
|
||||
python .claude/calendar-query.py create --title "Lunch" --start "tomorrow 12:00" --location "Cafe" --description "Team lunch"
|
||||
|
||||
# Relative dates work
|
||||
python .claude/calendar-query.py create --title "Call" --start "today 16:00"
|
||||
python .claude/calendar-query.py create --title "Review" --start "tomorrow 10:00"
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
```bash
|
||||
# JSON output (for parsing)
|
||||
python .claude/calendar-query.py today --json
|
||||
|
||||
# Text output (default, human-readable)
|
||||
python .claude/calendar-query.py week
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
To create an event "Team offsite" from March 20-22, 2026:
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py create --title "Team offsite" --start "2026-03-20" --end "2026-03-23" --all-day
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
1. **End dates are exclusive** for all-day events (CalDAV standard). To create an event spanning April 10-13, set end to April 14.
|
||||
|
||||
2. **No delete/update commands** - The script currently only supports create and query. To modify events, user must do it manually in Nextcloud.
|
||||
|
||||
4. **Default calendar** is "Personal" - use `--calendar` flag for others.
|
||||
|
||||
## Verification
|
||||
- For queries: Output shows formatted event list
|
||||
- For creates: Output shows "Event created: [title]" with calendar name and start date
|
||||
- Exit code 0 = success, 1 = error (check output for details)
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD must be set` | Env vars not set | Ensure `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` are in the environment |
|
||||
| `Required packages not installed` | caldav/icalendar missing | Ensure PYTHONPATH includes the installed packages |
|
||||
| `Calendar 'X' not found` | Wrong calendar name | Run `list` command to see available calendars |
|
||||
132
.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
Normal file
132
.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
---
|
||||
name: nfsv4-idmapd-uid-mapping
|
||||
description: |
|
||||
Fix for all file UIDs showing as 65534 (nobody) inside Kubernetes containers when using
|
||||
NFS volumes from TrueNAS/FreeBSD. Use when: (1) ls -lan inside a container shows all files
|
||||
owned by 65534:65534 despite correct ownership on the NFS server, (2) PostgreSQL fails with
|
||||
"data directory has wrong ownership", (3) chown inside containers returns "Invalid argument"
|
||||
on NFS volumes, (4) services that check file ownership (PostgreSQL, MySQL) crash on startup,
|
||||
(5) the same NFS mount shows correct UIDs on the host but 65534 inside containers,
|
||||
(6) NFSv4.2 appears in container mount output even though host mounts use NFSv3.
|
||||
Root cause: Kubernetes inline NFS volumes auto-negotiate NFSv4.2 (not NFSv3), and NFSv4
|
||||
idmapd fails to map UIDs when domains don't match or users don't exist on the server.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# NFSv4 idmapd UID Mapping — All Files Show as nobody (65534)
|
||||
|
||||
## Problem
|
||||
All files on NFS volumes appear owned by UID 65534 (nobody:nogroup) inside Kubernetes
|
||||
containers, even though `ls -lan` on the NFS server shows the correct UIDs (e.g., 999, 472).
|
||||
This breaks any service that checks file ownership: PostgreSQL refuses to start ("data
|
||||
directory has wrong ownership"), MySQL's entrypoint `chown` fails with "Invalid argument",
|
||||
and any `chown` inside the container returns EINVAL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- TrueNAS CORE (FreeBSD) or TrueNAS SCALE as NFS server
|
||||
- NFSv4 enabled on the NFS server (`v4: true` in TrueNAS NFS config)
|
||||
- Kubernetes using inline NFS volumes (not PV/PVC with mount options)
|
||||
- **Key symptom**: `mount` inside the container shows `type nfs4 (vers=4.2,...)` even
|
||||
though existing kubelet mounts on the host show `vers=3`
|
||||
- **Key symptom**: Same NFS path mounted directly on the host shows correct UIDs, but
|
||||
inside any container shows 65534
|
||||
|
||||
## Root Cause
|
||||
|
||||
Kubernetes inline NFS volumes don't support `mountOptions`. When kubelet mounts NFS for a
|
||||
new pod, the Linux NFS client auto-negotiates the highest available version — NFSv4.2 if
|
||||
the server supports it.
|
||||
|
||||
NFSv4 uses **idmapd** for UID translation: the server translates UID→username (e.g.,
|
||||
`999→postgres@domain`), sends the username string over the wire, and the client translates
|
||||
it back to a local UID. This fails when:
|
||||
|
||||
1. **Domain mismatch**: Server domain (from hostname) differs from client domain
|
||||
- TrueNAS: `viktorbarzin.me` (from `truenas.viktorbarzin.me`)
|
||||
- K8s nodes: `viktorbarzin.lan` (from `k8s-node4.viktorbarzin.lan`)
|
||||
- When domains don't match, ALL UIDs fall back to `nobody` (65534)
|
||||
|
||||
2. **Unknown UIDs**: Even with matching domains, if the NFS server has no local user for
|
||||
UID 999 (common for container UIDs), idmapd maps it to `nobody`
|
||||
|
||||
**Why existing mounts work**: Older kubelet mounts (established before NFSv4 was enabled,
|
||||
or when the NFS client defaulted to v3) continue using NFSv3 with direct numeric UID
|
||||
passthrough. Only NEW mounts negotiate NFSv4.2.
|
||||
|
||||
## Solution
|
||||
|
||||
**Fix on TrueNAS (no NFS restart required):**
|
||||
|
||||
```bash
|
||||
# 1. Enable NFSv3-style numeric UID passthrough for NFSv4
|
||||
midclt call nfs.update '{"v4_v3owner": true, "v4_domain": "viktorbarzin.lan"}'
|
||||
|
||||
# 2. Restart nfsuserd with the correct domain (NOT nfsd — that would crash the cluster)
|
||||
killall nfsuserd
|
||||
nfsuserd -domain viktorbarzin.lan -force
|
||||
```
|
||||
|
||||
**Clear caches on all K8s nodes:**
|
||||
|
||||
```bash
|
||||
for node in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh wizard@$node "sudo nfsidmap -c && sudo keyctl clear @u"
|
||||
done
|
||||
```
|
||||
|
||||
**Key settings explained:**
|
||||
- `v4_v3owner = true`: Makes NFSv4 use numeric UID passthrough like NFSv3, completely
|
||||
bypassing the username-based idmapd translation. **This is the critical fix.**
|
||||
- `v4_domain`: Should match the K8s nodes' DNS domain (check with `hostname -d` on a node)
|
||||
- `nfsuserd -domain <domain> -force`: FreeBSD daemon that handles NFSv4 user mapping.
|
||||
The `-force` flag is required if it thinks it's already running.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Run a test pod and check UIDs
|
||||
kubectl run nfs-test --rm -it --restart=Never --image=alpine \
|
||||
--overrides='{"spec":{"containers":[{"name":"test","image":"alpine",
|
||||
"command":["sh","-c","ls -lan /data | head -5"],
|
||||
"volumeMounts":[{"name":"nfs","mountPath":"/data"}]}],
|
||||
"volumes":[{"name":"nfs","nfs":{"server":"10.0.10.15","path":"/mnt/main/some-path"}}]}}'
|
||||
|
||||
# Should show actual UIDs (e.g., 999, 472) instead of 65534
|
||||
```
|
||||
|
||||
## Debugging Steps
|
||||
|
||||
If you're not sure whether this is the issue:
|
||||
|
||||
```bash
|
||||
# 1. Check mount type INSIDE a container (not on the host!)
|
||||
kubectl exec <pod> -- mount | grep nfs
|
||||
# If it shows "type nfs4" with "vers=4.2" — this is the issue
|
||||
|
||||
# 2. Compare UIDs: host vs container
|
||||
# On host (via kubelet mount path):
|
||||
sudo ls -lan /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<vol>/
|
||||
# Inside container:
|
||||
kubectl exec <pod> -- ls -lan /mount-path/
|
||||
|
||||
# 3. Check TrueNAS NFS config
|
||||
midclt call nfs.config # Look for v4: true, v4_v3owner, v4_domain
|
||||
|
||||
# 4. Check nfsuserd is running with the right domain
|
||||
ps aux | grep nfsuserd # On TrueNAS
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **NEVER restart NFS (nfsd)** on TrueNAS — it causes mount failures across ALL pods
|
||||
cluster-wide. Only restart `nfsuserd` (the ID mapping daemon).
|
||||
- Existing NFSv3 mounts continue working fine. The issue only affects NEW mounts.
|
||||
- The `v4_v3owner` setting is persistent across TrueNAS reboots (stored in middleware config).
|
||||
- The `nfsuserd` restart is NOT persistent — TrueNAS may restart it without the `-domain`
|
||||
flag after a reboot. The `v4_domain` setting in the middleware config should handle this,
|
||||
but verify after any TrueNAS restart.
|
||||
- On Linux NFS servers (not FreeBSD/TrueNAS), the equivalent fix is setting `Domain` in
|
||||
`/etc/idmapd.conf` on both server and all clients.
|
||||
216
.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
Normal file
216
.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
---
|
||||
name: openclaw-k8s-deployment
|
||||
description: |
|
||||
Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when:
|
||||
(1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet",
|
||||
(2) exec fails with "requires a paired node (none available)",
|
||||
(3) gateway shows "Config invalid" for exec.host or exec.security values,
|
||||
(4) OpenClaw can't write files (EACCES on workspace or home),
|
||||
(5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange),
|
||||
(6) 502 Bad Gateway from Traefik after pod restart,
|
||||
(7) setting up Telegram bot channel,
|
||||
(8) configuring modelrelay sidecar for free model routing.
|
||||
Covers all non-obvious deployment gotchas discovered through trial and error.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# OpenClaw Kubernetes Deployment
|
||||
|
||||
## Problem
|
||||
Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration
|
||||
requirements. The gateway process, Telegram integration, exec permissions, and
|
||||
file ownership all have specific constraints not documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying OpenClaw from `ghcr.io/openclaw/openclaw` container image
|
||||
- Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
|
||||
- Want Telegram bot integration, tool execution, and persistent state
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Gateway Configuration (openclaw.json)
|
||||
|
||||
**Required fields that aren't obvious:**
|
||||
|
||||
```json
|
||||
{
|
||||
"gateway": {
|
||||
"mode": "local",
|
||||
"bind": "lan",
|
||||
"controlUi": {
|
||||
"dangerouslyDisableDeviceAuth": true,
|
||||
"dangerouslyAllowHostHeaderOriginFallback": true
|
||||
}
|
||||
},
|
||||
"wizard": {
|
||||
"lastRunAt": "2026-03-01T00:00:00.000Z",
|
||||
"lastRunVersion": "2026.2.26",
|
||||
"lastRunCommand": "configure",
|
||||
"lastRunMode": "local"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `gateway.mode = "local"` — **required** or gateway refuses to start
|
||||
- `dangerouslyAllowHostHeaderOriginFallback = true` — required in v2026.2.26+
|
||||
for non-loopback Control UI (error: "non-loopback Control UI requires
|
||||
gateway.controlUi.allowedOrigins")
|
||||
- `wizard` block — **required** for Telegram to start. Without it, gateway logs
|
||||
"Telegram configured, not enabled yet" on every startup. The wizard block
|
||||
signals that initial setup was completed.
|
||||
|
||||
### 2. Exec Configuration
|
||||
|
||||
Valid values for `tools.exec`:
|
||||
|
||||
| Field | Valid Values | Notes |
|
||||
|-------|-------------|-------|
|
||||
| `host` | `sandbox`, `gateway`, `node` | NOT "local" — that's invalid |
|
||||
| `security` | `deny`, `allowlist`, `full` | NOT "off" — that's invalid |
|
||||
| `ask` | `"off"` | Disables confirmation prompts |
|
||||
|
||||
- `host = "gateway"` — runs commands on the container host directly
|
||||
- `host = "node"` — requires a "paired node" companion app (doesn't work in containers)
|
||||
- `host = "sandbox"` — requires Docker-in-Docker
|
||||
- `security = "full"` — most permissive valid option
|
||||
|
||||
### 3. Sandbox Mode
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"defaults": {
|
||||
"sandbox": { "mode": "off" },
|
||||
"workspace": "/workspace/infra"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `sandbox.mode = "off"` disables Docker sandboxing
|
||||
- `workspace` must be set explicitly — defaults to `~/.openclaw/workspace`
|
||||
|
||||
### 4. File Permissions
|
||||
|
||||
The init container runs as root but the main container runs as `node` (UID 1000).
|
||||
|
||||
**Must chown in init container:**
|
||||
```sh
|
||||
chown -R 1000:1000 /workspace/infra
|
||||
chown -R 1000:1000 /openclaw-home
|
||||
chmod 700 /openclaw-home
|
||||
```
|
||||
|
||||
**Must create directories:**
|
||||
```sh
|
||||
mkdir -p /openclaw-home/agents/main/sessions \
|
||||
/openclaw-home/credentials \
|
||||
/openclaw-home/canvas \
|
||||
/openclaw-home/devices \
|
||||
/openclaw-home/cron
|
||||
```
|
||||
|
||||
Without these: `EACCES: permission denied` errors for AGENTS.md, canvas,
|
||||
cron/jobs.json, devices, and other runtime files.
|
||||
|
||||
### 5. Startup Command
|
||||
|
||||
```sh
|
||||
node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan
|
||||
```
|
||||
|
||||
Run `doctor --fix` before the gateway to auto-enable Telegram and fix
|
||||
config issues. Without this, Telegram stays "not enabled yet".
|
||||
|
||||
### 6. Resource Requirements
|
||||
|
||||
- **CPU limit: 2 cores minimum** — the Node.js gateway startup is CPU-intensive.
|
||||
With 150-300m CPU, startup takes 5+ minutes.
|
||||
- **Memory limit: 2Gi minimum** — the gateway OOM-kills at 1Gi during startup
|
||||
(V8 heap exhaustion).
|
||||
- **Goldilocks VPA will override these** — see "VPA Override" section below.
|
||||
|
||||
### 7. Readiness Probe
|
||||
|
||||
```hcl
|
||||
readiness_probe {
|
||||
tcp_socket { port = 18789 }
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 10
|
||||
}
|
||||
```
|
||||
|
||||
Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening
|
||||
and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik
|
||||
during startup without killing the container.
|
||||
|
||||
### 8. Telegram Integration
|
||||
|
||||
```json
|
||||
{
|
||||
"channels": {
|
||||
"telegram": {
|
||||
"enabled": true,
|
||||
"botToken": "...",
|
||||
"dmPolicy": "allowlist",
|
||||
"allowFrom": ["tg:USER_ID"],
|
||||
"groupPolicy": "allowlist",
|
||||
"streamMode": "partial"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Telegram won't start without:
|
||||
1. The `wizard` block in config (signals setup was run)
|
||||
2. `doctor --fix` at startup (auto-enables the channel)
|
||||
3. Both `groupPolicy` and `streamMode` fields
|
||||
|
||||
### 9. NFS Volume Strategy
|
||||
|
||||
| Volume | Purpose | Type |
|
||||
|--------|---------|------|
|
||||
| `/home/node/.openclaw` | Persistent state (SOUL.md, sessions, memory, telegram) | NFS |
|
||||
| `/tools` | Cached binaries (kubectl, terraform, terragrunt, python libs) | NFS |
|
||||
| `/workspace` | Infra repo clone | NFS |
|
||||
| `/data` | General data | NFS |
|
||||
|
||||
Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping
|
||||
binary downloads and pip installs on subsequent starts.
|
||||
|
||||
### 10. ModelRelay Sidecar
|
||||
|
||||
Deploy as a sidecar container for automatic free model routing:
|
||||
|
||||
```hcl
|
||||
container {
|
||||
name = "modelrelay"
|
||||
image = "node:22-alpine"
|
||||
command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
|
||||
env { name = "NVIDIA_API_KEY"; value = "..." }
|
||||
env { name = "OPENROUTER_API_KEY"; value = "..." }
|
||||
}
|
||||
```
|
||||
|
||||
Configure as provider: `baseUrl = "http://127.0.0.1:7352/v1"`, model `auto-fastest`.
|
||||
|
||||
## Verification
|
||||
1. `kubectl logs -c openclaw` should show `[gateway] listening on ws://0.0.0.0:18789`
|
||||
2. No "Telegram configured, not enabled yet" message
|
||||
3. No `EACCES` permission errors
|
||||
4. `kubectl exec ... -- cat /proc/net/tcp` shows listening sockets
|
||||
5. Telegram bot responds to `/start`
|
||||
|
||||
## Notes
|
||||
- ConfigMap changes require pod restart (init container copies config at start)
|
||||
- ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
|
||||
- Goldilocks VPA recreates itself from namespace labels — must delete VPA on
|
||||
every pod recreation if namespace has `goldilocks.fairwinds.com/vpa-update-mode`
|
||||
- The `--allow-unconfigured` flag is needed for the gateway command
|
||||
- v2026.2.26 introduced breaking change requiring `dangerouslyAllowHostHeaderOriginFallback`
|
||||
|
||||
## See also
|
||||
- `openclaw-custom-model-provider` — basic model provider configuration
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange causing OOM (related but different)
|
||||
|
|
@ -0,0 +1,169 @@
|
|||
---
|
||||
name: pfsense-dnsmasq-interface-binding
|
||||
description: |
|
||||
Restrict pfSense dnsmasq (DNS Forwarder) to specific interfaces to free port 53 on
|
||||
other interfaces for port forwarding. Use when: (1) pfSense blocks port 53 NAT port
|
||||
forward because dnsmasq is listening on *:53, (2) need to forward DNS from WAN to an
|
||||
internal DNS server while preserving client source IPs, (3) dnsmasq shows *:53 in
|
||||
sockstat despite --listen-address flags, (4) pfSense loses DNS resolution after
|
||||
restricting dnsmasq interfaces, (5) NAT rdr rules for port 53 silently fail to
|
||||
generate in /tmp/rules.debug.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# pfSense dnsmasq Interface Binding for DNS Port Forwarding
|
||||
|
||||
## Problem
|
||||
pfSense's dnsmasq (DNS Forwarder) binds to `*:53` by default. This prevents creating
|
||||
NAT port forward rules for port 53 — pfSense silently skips generating the pf `rdr`
|
||||
directive. You need to restrict dnsmasq to specific interfaces to free port 53 on other
|
||||
interfaces (e.g., WAN) for forwarding to an internal DNS server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Attempting to create a NAT port forward for port 53 on the WAN interface
|
||||
- Port forward rule saves to config.xml but `pfctl -sn` shows no corresponding `rdr` rule
|
||||
- `sockstat -4 | grep ":53"` shows `dnsmasq` on `*:53`
|
||||
- Goal: Forward DNS queries from one network to an internal DNS server (e.g., Technitium)
|
||||
while preserving client source IPs (no masquerading)
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Bind dnsmasq to specific interfaces
|
||||
|
||||
Set the interface field in pfSense's dnsmasq config:
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$config["dnsmasq"]["interface"] = "lan,opt1"; // Only LAN and OPT1, NOT wan
|
||||
write_config("Bind dnsmasq to LAN and OPT1 only");
|
||||
'"'"''
|
||||
```
|
||||
|
||||
This adds `--listen-address=<IP>` flags to dnsmasq but does NOT change socket binding.
|
||||
|
||||
### Step 2: Add bind-dynamic (CRITICAL)
|
||||
|
||||
Without `bind-dynamic`, dnsmasq still binds the socket to `*:53` even with
|
||||
`--listen-address` flags. The `--listen-address` only controls which queries get
|
||||
responses, not the actual socket binding.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$existing = base64_decode($config["dnsmasq"]["custom_options"]);
|
||||
if (strpos($existing, "bind-dynamic") === false) {
|
||||
$existing = "bind-dynamic\n" . $existing;
|
||||
$config["dnsmasq"]["custom_options"] = base64_encode($existing);
|
||||
write_config("Add bind-dynamic to restrict dnsmasq socket binding");
|
||||
}
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 3: Add localhost listen address (CRITICAL)
|
||||
|
||||
pfSense's own `resolv.conf` points to `127.0.0.1`. Without this, pfSense itself
|
||||
loses DNS resolution after the interface restriction.
|
||||
|
||||
```php
|
||||
# Add to custom_options (base64-encoded in config):
|
||||
listen-address=127.0.0.1
|
||||
```
|
||||
|
||||
### Step 4: Restart dnsmasq
|
||||
|
||||
```php
|
||||
services_dnsmasq_configure();
|
||||
```
|
||||
|
||||
### Step 5: Verify binding
|
||||
|
||||
```bash
|
||||
sockstat -4 | grep ":53 "
|
||||
# Should show specific IPs, not *:53:
|
||||
# 127.0.0.1:53
|
||||
# 10.0.10.1:53 (lan)
|
||||
# 10.0.20.1:53 (opt1)
|
||||
# NOT 192.168.1.2:53 (wan)
|
||||
```
|
||||
|
||||
### Step 6: Add the port forward rule
|
||||
|
||||
**Critical format note**: The `source` field must use `array("any" => "")`, NOT
|
||||
`array("network" => "192.168.1.0/24")`. The CIDR source format silently fails to
|
||||
generate the pf `rdr` directive.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
require_once("shaper.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
|
||||
$rule = array(
|
||||
"source" => array("any" => ""), // MUST be "any", not CIDR
|
||||
"destination" => array(
|
||||
"network" => "wanip",
|
||||
"port" => "53"
|
||||
),
|
||||
"ipprotocol" => "inet",
|
||||
"protocol" => "udp",
|
||||
"target" => "10.0.20.204", // Internal DNS server
|
||||
"local-port" => "53",
|
||||
"interface" => "wan",
|
||||
"associated-rule-id" => "pass",
|
||||
"descr" => "DNS to internal DNS (preserve client IP)",
|
||||
"created" => array("time" => (string)time(), "username" => "admin"),
|
||||
"updated" => array("time" => (string)time(), "username" => "admin")
|
||||
);
|
||||
array_unshift($config["nat"]["rule"], $rule);
|
||||
write_config("Add DNS port forward");
|
||||
filter_configure();
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 7: Verify the redirect rule
|
||||
|
||||
```bash
|
||||
pfctl -sn | grep "domain\|:53"
|
||||
# Should show: rdr pass on vtnet0 inet proto udp from any to 192.168.1.2 port = domain -> 10.0.20.204
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
1. pfSense own DNS: `nslookup google.com 127.0.0.1` (from pfSense shell)
|
||||
2. Internal DNS: `nslookup google.com 10.0.20.1` (from LAN/OPT1 clients)
|
||||
3. Port forward: `dig @192.168.1.2 example.com` (from WAN-side client)
|
||||
4. Client IP: Check DNS server logs — should show real client IP, not pfSense IP
|
||||
|
||||
## Pitfalls
|
||||
|
||||
| Pitfall | Symptom | Fix |
|
||||
|---------|---------|-----|
|
||||
| Missing `bind-dynamic` | sockstat shows `*:53`, port forward still blocked | Add `bind-dynamic` to custom_options |
|
||||
| Missing `listen-address=127.0.0.1` | pfSense loses all DNS resolution | Add to custom_options |
|
||||
| Source `"network" => "CIDR"` in NAT rule | Rule saves to config but no `rdr` in `pfctl -sn` | Use `"any" => ""` instead |
|
||||
| Using local `$config` variable | Config not persisted after PHP exit | Always use `global $config` |
|
||||
| Not calling `filter_configure()` | Rule in config.xml but not in pf | Call after `write_config()` |
|
||||
| Custom options not base64 | dnsmasq fails to start | pfSense stores custom_options as base64 |
|
||||
|
||||
## Notes
|
||||
- `bind-dynamic` is preferred over `bind-interfaces` because it handles interfaces that
|
||||
come up after dnsmasq starts (e.g., VPN tunnels)
|
||||
- The pf `rdr` rule is a redirect, not masquerade — source IP is preserved
|
||||
- dnsmasq custom_options in pfSense config.xml are base64-encoded
|
||||
- Check `/tmp/rules.debug` for the generated pf ruleset (before loading into pf)
|
||||
- Use `pfctl -sn` to see rules actually loaded in the running firewall
|
||||
|
||||
## See also
|
||||
- `pfsense` — General pfSense management skill
|
||||
- `k8s-ndots-search-domain-nxdomain-flood` — Related DNS optimization
|
||||
105
.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
Normal file
105
.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
name: pfsense-nat-rule-creation
|
||||
description: |
|
||||
Create NAT port forward rules on pfSense programmatically via PHP/SSH.
|
||||
Use when: (1) adding port forwards for new K8s services, (2) NAT rules
|
||||
added via PHP don't appear in pfctl output, (3) config_read_array() throws
|
||||
"undefined function" error, (4) destination "wanip" not working in NAT rules,
|
||||
(5) rules saved to config.xml but not loaded into pfctl. Covers the correct
|
||||
PHP array structure, config API differences between pfSense versions, and
|
||||
the required pfctl reload step.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# pfSense NAT Rule Creation via PHP
|
||||
|
||||
## Problem
|
||||
Creating NAT port forward rules on pfSense programmatically via SSH/PHP has
|
||||
multiple gotchas around the config API, rule structure, and rule loading.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Adding a port forward for a new Kubernetes service (e.g., TURN, game server)
|
||||
- Using `ssh admin@10.0.20.1` + PHP to automate pfSense config
|
||||
- NAT rules don't appear in `pfctl -sn` after `write_config()` + `filter_configure()`
|
||||
- `config_read_array()` throws "Call to undefined function"
|
||||
- Rules saved to config.xml but pfctl doesn't have them
|
||||
|
||||
## Solution
|
||||
|
||||
### Correct PHP for adding NAT rules
|
||||
|
||||
```php
|
||||
<?php
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
global $config; // NOT config_read_array() — that doesn't exist in pfSense 2.7.x
|
||||
|
||||
$config["nat"]["rule"][] = array(
|
||||
"interface" => "wan",
|
||||
"ipprotocol" => "inet", // Required! Must be "inet" for IPv4
|
||||
"protocol" => "tcp/udp", // Or "udp" or "tcp"
|
||||
"source" => array("any" => ""),
|
||||
"destination" => array(
|
||||
"network" => "wanip", // Use "network" => "wanip", NOT "address" => "wanip"
|
||||
"port" => "3478" // Single port or "start:end" for range
|
||||
),
|
||||
"target" => "10.0.20.200", // Internal destination IP
|
||||
"local-port" => "3478", // Internal port (for ranges, just the start port)
|
||||
"descr" => "My port forward",
|
||||
"associated-rule-id" => "pass" // Auto-create firewall pass rule
|
||||
);
|
||||
|
||||
write_config("Description for config history");
|
||||
filter_configure();
|
||||
```
|
||||
|
||||
### Key gotchas
|
||||
|
||||
1. **`config_read_array()` doesn't exist** in pfSense 2.7.x. Use `global $config` instead.
|
||||
|
||||
2. **Destination format**: Use `"network" => "wanip"`, NOT `"address" => "wanip"` or `"address" => "192.168.1.2"`. The `"network"` key with `"wanip"` tells pfSense to resolve the WAN IP dynamically.
|
||||
|
||||
3. **`ipprotocol` is required**: Must include `"ipprotocol" => "inet"` or rules won't generate in `/tmp/rules.debug`.
|
||||
|
||||
4. **Port ranges**: Use `"port" => "49152:49252"` for ranges. The `"local-port"` should be just the start port — pfSense maps the range automatically.
|
||||
|
||||
5. **Rules may not load immediately**: After `write_config()` + `filter_configure()`, rules appear in `/tmp/rules.debug` but may not be in pfctl until the next filter reload. Force with:
|
||||
```bash
|
||||
pfctl -f /tmp/rules.debug
|
||||
```
|
||||
|
||||
6. **SSH quoting**: The pfsense.py `php` command breaks on `\n` in strings. For multi-line PHP, write a `.php` file, `scp` it, and execute:
|
||||
```bash
|
||||
scp script.php admin@10.0.20.1:/tmp/
|
||||
ssh admin@10.0.20.1 "php /tmp/script.php"
|
||||
```
|
||||
|
||||
### Execution via pfsense.py
|
||||
|
||||
For simple single-line PHP (no newlines or backslashes):
|
||||
```bash
|
||||
python3 .claude/pfsense.py php 'require_once("config.inc"); ...; echo "Done";'
|
||||
```
|
||||
|
||||
For complex scripts, use scp + ssh as above.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check rules in config
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /cf/conf/config.xml"
|
||||
|
||||
# Check generated pf rules
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /tmp/rules.debug"
|
||||
|
||||
# Check active pfctl rules
|
||||
python3 .claude/pfsense.py pfctl "-sn" | grep YOUR_PORT
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Existing working NAT rules on this pfSense use the same structure (check WireGuard port 51820 as reference)
|
||||
- The `associated-rule-id: pass` auto-creates a WAN firewall rule to allow the forwarded traffic
|
||||
- pfSense applies NAT rules across ALL interfaces when using the web UI, but PHP-created rules only apply to the specified interface
|
||||
- See also: `pfsense` skill for general pfSense management
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
name: proxmox-vm-disk-expansion-pitfalls
|
||||
description: |
|
||||
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
|
||||
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
|
||||
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
|
||||
with "invalid option -- P", (3) kubectl drain times out with pods stuck
|
||||
terminating, (4) filesystem shows old size after qm resize. Covers
|
||||
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
|
||||
tuning, and recovery from partial failures.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Proxmox VM Disk Expansion Pitfalls
|
||||
|
||||
## Problem
|
||||
|
||||
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
|
||||
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
|
||||
incompatibilities, and Kubernetes drain timeouts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
|
||||
- Ubuntu 24.04 cloud-init images (the default k8s node template)
|
||||
- Kubernetes nodes with many pods or stateful workloads
|
||||
- Using `scripts/extend_vm_storage.sh` or similar automation
|
||||
|
||||
## Issues and Solutions
|
||||
|
||||
### 1. `growpart: command not found` on Ubuntu 24.04
|
||||
|
||||
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
|
||||
with "command not found". `resize2fs` then reports "Nothing to do!" because the
|
||||
partition table hasn't been updated.
|
||||
|
||||
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
|
||||
by default. The `growpart` tool (which updates the partition table to use new
|
||||
disk space) is in this package.
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
sudo growpart /dev/sda 1
|
||||
sudo resize2fs /dev/sda1
|
||||
```
|
||||
|
||||
**Prevention**: Check for `growpart` before attempting partition expansion:
|
||||
```bash
|
||||
if ! command -v growpart &>/dev/null; then
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. `grep -P` (PCRE) not available on macOS
|
||||
|
||||
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
|
||||
|
||||
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
|
||||
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
|
||||
|
||||
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
|
||||
```bash
|
||||
# BAD (GNU grep only):
|
||||
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
|
||||
|
||||
# GOOD (portable):
|
||||
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
|
||||
```
|
||||
|
||||
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
|
||||
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
|
||||
regex or bash built-in `[[ =~ ]]` for pattern matching.
|
||||
|
||||
### 3. `kubectl drain` timeout with stuck pods
|
||||
|
||||
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
|
||||
for multiple pods. Pods are evicted but don't terminate in time.
|
||||
|
||||
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
|
||||
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
|
||||
pods are draining simultaneously.
|
||||
|
||||
**Fix**: Use `--force` flag and a longer timeout, or retry:
|
||||
```bash
|
||||
# First attempt with standard timeout
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
|
||||
|
||||
# If it fails, force with longer timeout (pods already evicting)
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
|
||||
```
|
||||
|
||||
**Note**: After a failed drain, the node is already cordoned. A second drain
|
||||
attempt only needs to wait for already-evicting pods to finish.
|
||||
|
||||
### 4. Recovery from partial failure
|
||||
|
||||
If the script fails mid-way (after drain but before uncordon):
|
||||
|
||||
```bash
|
||||
# Check VM status
|
||||
ssh root@192.168.1.127 "qm status <vmid>"
|
||||
|
||||
# Start VM if stopped
|
||||
ssh root@192.168.1.127 "qm start <vmid>"
|
||||
|
||||
# Uncordon node
|
||||
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After successful expansion:
|
||||
```bash
|
||||
# On the VM
|
||||
df -h /
|
||||
# Should show new size (128G disk → ~126G usable for ext4)
|
||||
|
||||
# On the cluster
|
||||
kubectl get node <name>
|
||||
# Should show Ready status
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
|
||||
the script handling both paths
|
||||
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
|
||||
this is not an error
|
||||
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
|
||||
- SSH host keys may change if VMs are recreated or network changes — use
|
||||
`-o StrictHostKeyChecking=no` in automated scripts
|
||||
|
||||
See also: `extend-vm-storage.md` (the operational skill for running the script)
|
||||
182
.claude/skills/archived/python-filename-sanitization/SKILL.md
Normal file
182
.claude/skills/archived/python-filename-sanitization/SKILL.md
Normal file
|
|
@ -0,0 +1,182 @@
|
|||
---
|
||||
name: python-filename-sanitization
|
||||
description: |
|
||||
Secure filename sanitization pattern for Python web applications. Use when:
|
||||
(1) Accepting user-provided filenames for file operations, (2) Building file
|
||||
rename/upload functionality, (3) Preventing path traversal attacks (../../../etc/passwd),
|
||||
(4) Preventing shell injection through filenames, (5) FastAPI/Flask file handling.
|
||||
Provides regex-based whitelist approach with pathlib for safe file operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# Python Filename Sanitization
|
||||
|
||||
## Problem
|
||||
User-provided filenames can contain malicious characters that enable path traversal
|
||||
attacks, shell injection, or filesystem corruption. Direct use of user input in
|
||||
file paths is a security vulnerability.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Building file upload, rename, or download functionality
|
||||
- User can specify filenames via API or form input
|
||||
- Files are stored on server filesystem
|
||||
- Need to prevent: `../`, shell metacharacters, null bytes, etc.
|
||||
|
||||
## Solution
|
||||
|
||||
### Complete Sanitization Function
|
||||
```python
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
def sanitize_filename(filename: str, max_length: int = 200) -> str:
|
||||
"""
|
||||
Sanitize a filename to prevent path traversal and shell injection.
|
||||
Only allows alphanumeric characters, spaces, hyphens, underscores,
|
||||
parentheses, and dots.
|
||||
"""
|
||||
if not filename:
|
||||
raise ValueError("Filename cannot be empty")
|
||||
|
||||
# Remove any path components (prevent path traversal)
|
||||
filename = Path(filename).name
|
||||
|
||||
# Only allow safe characters: alphanumeric, space, hyphen, underscore, parentheses, dot
|
||||
# This regex removes anything that isn't in the allowed set
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
|
||||
# Collapse multiple spaces/dots
|
||||
safe_filename = re.sub(r'\s+', ' ', safe_filename)
|
||||
safe_filename = re.sub(r'\.+', '.', safe_filename)
|
||||
|
||||
# Strip leading/trailing whitespace and dots
|
||||
safe_filename = safe_filename.strip(' .')
|
||||
|
||||
# Limit length
|
||||
if len(safe_filename) > max_length:
|
||||
safe_filename = safe_filename[:max_length]
|
||||
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
|
||||
return safe_filename
|
||||
```
|
||||
|
||||
### FastAPI Integration Example
|
||||
```python
|
||||
from fastapi import APIRouter, HTTPException
|
||||
from pydantic import BaseModel
|
||||
from pathlib import Path
|
||||
|
||||
class RenameRequest(BaseModel):
|
||||
new_name: str
|
||||
|
||||
@router.patch("/files/{file_id}/rename")
|
||||
async def rename_file(file_id: str, request: RenameRequest):
|
||||
"""Rename a file with sanitized input."""
|
||||
file_dir = Path("/data/files") / file_id
|
||||
|
||||
if not file_dir.exists():
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
# Find existing file
|
||||
files = list(file_dir.glob("*"))
|
||||
if not files:
|
||||
raise HTTPException(status_code=404, detail="No file found")
|
||||
|
||||
current_file = files[0]
|
||||
current_extension = current_file.suffix
|
||||
|
||||
# Sanitize the new name
|
||||
try:
|
||||
safe_name = sanitize_filename(request.new_name)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
# Preserve original extension
|
||||
if not safe_name.lower().endswith(current_extension.lower()):
|
||||
safe_name = safe_name + current_extension
|
||||
|
||||
# Create new path (same directory, new filename)
|
||||
new_file = file_dir / safe_name
|
||||
|
||||
# Check for conflicts
|
||||
if new_file.exists() and new_file != current_file:
|
||||
raise HTTPException(status_code=400, detail="A file with that name already exists")
|
||||
|
||||
# Rename using pathlib (no shell commands!)
|
||||
current_file.rename(new_file)
|
||||
|
||||
return {"status": "renamed", "new_filename": safe_name}
|
||||
```
|
||||
|
||||
## Key Security Principles
|
||||
|
||||
### 1. Whitelist, Don't Blacklist
|
||||
```python
|
||||
# BAD: Trying to block dangerous characters
|
||||
filename = filename.replace('../', '').replace('\x00', '')
|
||||
|
||||
# GOOD: Only allow known-safe characters
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
```
|
||||
|
||||
### 2. Use pathlib, Not Shell Commands
|
||||
```python
|
||||
# BAD: Shell command (vulnerable to injection)
|
||||
os.system(f'mv "{old_path}" "{new_path}"')
|
||||
|
||||
# GOOD: Pure Python (no shell)
|
||||
old_path.rename(new_path)
|
||||
```
|
||||
|
||||
### 3. Extract Basename First
|
||||
```python
|
||||
# BAD: User could submit "../../../etc/passwd"
|
||||
filename = user_input
|
||||
|
||||
# GOOD: Extract just the filename part
|
||||
filename = Path(user_input).name
|
||||
```
|
||||
|
||||
### 4. Validate After Sanitization
|
||||
```python
|
||||
# Ensure something remains after sanitization
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
```
|
||||
|
||||
## Verification
|
||||
```python
|
||||
# Test cases that should be handled safely
|
||||
assert sanitize_filename("normal.txt") == "normal.txt"
|
||||
assert sanitize_filename("../../../etc/passwd") == "etcpasswd"
|
||||
assert sanitize_filename("file; rm -rf /") == "file rm -rf"
|
||||
assert sanitize_filename(" spaces .txt") == "spaces.txt"
|
||||
assert sanitize_filename("$(whoami).txt") == "whoami.txt"
|
||||
|
||||
# Test cases that should raise errors
|
||||
try:
|
||||
sanitize_filename("") # Should raise ValueError
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
try:
|
||||
sanitize_filename("$#@!") # Should raise ValueError (no valid chars)
|
||||
except ValueError:
|
||||
pass
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is intentionally restrictive; expand the regex if you need Unicode support
|
||||
- For Unicode filenames, consider `unicodedata.normalize('NFKD', ...)` first
|
||||
- Max length of 200 is conservative; filesystem limits vary (255 bytes typical)
|
||||
- Always preserve file extensions when renaming to avoid breaking file associations
|
||||
- Consider adding a UUID prefix for guaranteed uniqueness in upload scenarios
|
||||
|
||||
## References
|
||||
- [OWASP Path Traversal](https://owasp.org/www-community/attacks/Path_Traversal)
|
||||
- [CWE-22: Path Traversal](https://cwe.mitre.org/data/definitions/22.html)
|
||||
- [Python pathlib documentation](https://docs.python.org/3/library/pathlib.html)
|
||||
116
.claude/skills/archived/sops-age-secrets-migration/SKILL.md
Normal file
116
.claude/skills/archived/sops-age-secrets-migration/SKILL.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
name: sops-age-secrets-migration
|
||||
description: |
|
||||
Migrate from git-crypt to SOPS + age for multi-user secret management in a
|
||||
Terraform/Terragrunt infrastructure repo. Use when: (1) need per-user secret
|
||||
access control (git-crypt is all-or-nothing), (2) want operators to push PRs
|
||||
without seeing secrets (CI decrypts), (3) migrating from a single encrypted
|
||||
terraform.tfvars to structured secret management. Covers: JSON format (not YAML
|
||||
— Terraform can't parse YAML tfvars), race condition avoidance with parallel
|
||||
terragrunt applies, CI pipeline integration with Woodpecker, age key management,
|
||||
and the complete migration sequence.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# SOPS + age Secrets Migration from git-crypt
|
||||
|
||||
## Problem
|
||||
git-crypt encrypts entire files — anyone with the key decrypts everything. For multi-user
|
||||
setups where operators should push code without seeing secrets, you need per-value encryption
|
||||
with CI-only decryption.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Single `terraform.tfvars` encrypted with git-crypt containing 100+ secrets
|
||||
- Need to onboard operators who shouldn't see API keys, passwords, SSH keys
|
||||
- Want GitOps (secrets in git) but with access control
|
||||
- Terraform/Terragrunt stack-per-service architecture
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Use JSON, not YAML
|
||||
SOPS outputs the same format as input. `sops -d file.yaml` → YAML. `sops -d file.json` → JSON.
|
||||
Terraform natively supports `*.auto.tfvars.json` files. YAML is NOT valid HCL.
|
||||
|
||||
```
|
||||
secrets.sops.json → sops -d → secrets.auto.tfvars.json → Terraform reads it
|
||||
```
|
||||
|
||||
### 2. Split tfvars into config + secrets
|
||||
```
|
||||
config.tfvars ← plaintext (hostnames, IPs, DNS records)
|
||||
secrets.sops.json ← SOPS-encrypted (passwords, tokens, keys)
|
||||
```
|
||||
|
||||
### 3. Global decrypt, not per-stack hooks
|
||||
**CRITICAL**: Do NOT use `before_hook`/`after_hook` for decryption. With `terragrunt run --all`,
|
||||
70+ stacks run hooks in parallel, all writing to the same output file — race condition.
|
||||
|
||||
Instead, use a wrapper script that decrypts once:
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/tg — decrypt then terragrunt
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
if [ ! -f "$REPO_ROOT/secrets.auto.tfvars.json" ] || \
|
||||
[ "$REPO_ROOT/secrets.sops.json" -nt "$REPO_ROOT/secrets.auto.tfvars.json" ]; then
|
||||
sops -d "$REPO_ROOT/secrets.sops.json" > "$REPO_ROOT/secrets.auto.tfvars.json"
|
||||
fi
|
||||
exec terragrunt "$@"
|
||||
```
|
||||
|
||||
### 4. Terragrunt loads both (backward compatible)
|
||||
```hcl
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = ["${get_repo_root()}/config.tfvars"]
|
||||
optional_var_files = [
|
||||
"${get_repo_root()}/terraform.tfvars", # legacy (git-crypt)
|
||||
"${get_repo_root()}/secrets.auto.tfvars.json" # new (SOPS)
|
||||
]
|
||||
}
|
||||
before_hook "check_secrets" {
|
||||
commands = ["apply", "plan", "destroy"]
|
||||
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Complex types work in JSON
|
||||
Maps, lists, nested objects, multiline strings (SSH keys as `\n`-escaped) all work:
|
||||
```json
|
||||
{
|
||||
"simple_password": "abc123",
|
||||
"mailserver_accounts": {"user@domain": "pass"},
|
||||
"ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n"
|
||||
}
|
||||
```
|
||||
|
||||
### 6. CI integration (Woodpecker)
|
||||
- Store age private key as CI secret (`SOPS_AGE_KEY`)
|
||||
- Write to temp file for `SOPS_AGE_KEY_FILE` (Woodpecker `from_secret` only does env vars)
|
||||
- `git add stacks/ state/ .woodpecker/` — NEVER `git add .`
|
||||
- Cleanup step with `status: [success, failure]`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Encrypt
|
||||
sops -e -i secrets.sops.json
|
||||
|
||||
# Decrypt and verify
|
||||
sops -d secrets.sops.json | jq .
|
||||
|
||||
# Verify SSH keys
|
||||
sops -d secrets.sops.json | jq -r '.ssh_key' | ssh-keygen -l -f -
|
||||
|
||||
# Test with terragrunt
|
||||
scripts/tg validate
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Keep git-crypt for binary files (TLS certs, deploy keys) — SOPS can't encrypt binary
|
||||
- `sensitive = true` on all secret variable declarations — prevents plan output leaks
|
||||
- Don't add `sensitive = true` to non-secret variables with "secret" in the name (e.g., `tls_secret_name`, `ingress_path`) — breaks `for_each` on lists
|
||||
- Age keys are one line — much simpler than GPG
|
||||
- `.sops.yaml` path_regex should be anchored: `^secrets\.sops\.json$`
|
||||
|
|
@ -0,0 +1,97 @@
|
|||
---
|
||||
name: terraform-state-identity-mismatch
|
||||
description: |
|
||||
Fix Terraform "Unexpected Identity Change" errors during plan/apply. Use when:
|
||||
(1) Terraform fails with "the Terraform Provider unexpectedly returned a different
|
||||
identity", (2) State refresh shows identity mismatch between stored and current values,
|
||||
(3) Resource was created but terraform apply timed out, leaving state inconsistent.
|
||||
Solution involves removing and reimporting the affected resource.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-28
|
||||
---
|
||||
|
||||
# Terraform State Identity Mismatch Fix
|
||||
|
||||
## Problem
|
||||
Terraform fails during plan or apply with an "Unexpected Identity Change" error,
|
||||
indicating the stored state identity doesn't match what the provider returns when
|
||||
reading the resource.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Error message contains: "Unexpected Identity Change: During the read operation,
|
||||
the Terraform Provider unexpectedly returned a different identity"
|
||||
- Often occurs after a terraform apply times out mid-creation
|
||||
- Resource exists in the cluster/cloud but state is corrupted
|
||||
- Common with Kubernetes provider after deployment rollout timeouts
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the affected resource
|
||||
The error message includes the resource address:
|
||||
```
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
### Step 2: Remove from state
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
```
|
||||
Note: Use single quotes around the address to handle brackets properly.
|
||||
|
||||
### Step 3: Import the resource back
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' <namespace>/<name>
|
||||
```
|
||||
For Kubernetes deployments, the import ID is `namespace/deployment-name`.
|
||||
|
||||
### Step 4: Verify with plan
|
||||
```bash
|
||||
terraform plan -target=<module-path>
|
||||
```
|
||||
Should show minimal or no changes if import was successful.
|
||||
|
||||
### Step 5: Apply to sync any drift
|
||||
```bash
|
||||
terraform apply -target=<module-path>
|
||||
```
|
||||
|
||||
## Verification
|
||||
- `terraform plan` runs without identity errors
|
||||
- `terraform apply` completes successfully
|
||||
- Resource still exists and functions correctly
|
||||
|
||||
## Example
|
||||
**Error:**
|
||||
```
|
||||
Error: Unexpected Identity Change
|
||||
|
||||
Current Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.NullVal...})
|
||||
New Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.StringVal("apps/v1")...})
|
||||
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
# Output: Removed ... Successfully removed 1 resource instance(s).
|
||||
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' resume/resume
|
||||
# Output: Import successful!
|
||||
|
||||
terraform apply -target=module.kubernetes_cluster.module.resume -auto-approve
|
||||
# Output: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a provider bug, not user error - consider reporting to provider maintainers
|
||||
- The resource continues to work fine; only the terraform state is affected
|
||||
- Always verify the resource exists before importing (don't import non-existent resources)
|
||||
- For Kubernetes resources, import IDs are typically `namespace/name`
|
||||
- For AWS resources, import IDs vary by resource type (check provider docs)
|
||||
- Consider adding `-lock=false` if state locking causes issues during recovery
|
||||
|
||||
## See Also
|
||||
- Terraform state management documentation
|
||||
- Kubernetes provider import documentation
|
||||
405
.claude/skills/archived/traefik-helm-configuration/SKILL.md
Normal file
405
.claude/skills/archived/traefik-helm-configuration/SKILL.md
Normal file
|
|
@ -0,0 +1,405 @@
|
|||
---
|
||||
name: traefik-helm-configuration
|
||||
description: |
|
||||
Consolidated Traefik Helm chart configuration skill covering HTTP/3 (QUIC), UDP
|
||||
cross-namespace routing, and plugin download failures. Use when:
|
||||
(1) enabling HTTP/3 on Traefik or Alt-Svc header shows wrong port (e.g., 8443 instead of 443),
|
||||
(2) HTTP/3 is configured in Helm values but not working end-to-end,
|
||||
(3) Cloudflare-proxied domains need HTTP/3 enabled,
|
||||
(4) custom UDP entrypoints don't appear in the LoadBalancer Service,
|
||||
(5) IngressRouteUDP logs show "udp service is not in the parent resource namespace",
|
||||
(6) DNS or other UDP traffic through Traefik times out despite correct IngressRouteUDP config,
|
||||
(7) all Traefik routes suddenly return 404 after a restart or pod recreation,
|
||||
(8) Traefik logs show "Plugins are disabled because an error has occurred",
|
||||
(9) plugin download fails with "context deadline exceeded" for crowdsec-bouncer or rewrite-body.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Helm Chart Configuration
|
||||
|
||||
Consolidated guide for three common Traefik Helm chart issues: HTTP/3 (QUIC) enablement,
|
||||
UDP cross-namespace routing, and plugin download failures causing global 404s.
|
||||
|
||||
---
|
||||
|
||||
## HTTP/3 (QUIC)
|
||||
|
||||
### Problem
|
||||
|
||||
You want to enable HTTP/3 (QUIC) on a Traefik ingress controller in Kubernetes so that
|
||||
clients can negotiate HTTP/3 connections via the `Alt-Svc` response header.
|
||||
|
||||
### Context / When to Use
|
||||
|
||||
- Enabling HTTP/3 for the first time on Traefik
|
||||
- Troubleshooting HTTP/3 not working despite configuration
|
||||
- Alt-Svc header shows internal container port (8443) instead of external port (443)
|
||||
- Need to enable HTTP/3 on both origin (Traefik) and CDN (Cloudflare)
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Configure Traefik Helm Chart Values
|
||||
|
||||
In the Traefik Helm release values, add `http3` configuration to the `websecure` entrypoint:
|
||||
|
||||
```hcl
|
||||
# In modules/kubernetes/traefik/main.tf
|
||||
ports = {
|
||||
websecure = {
|
||||
port = 8443
|
||||
exposedPort = 443
|
||||
protocol = "TCP"
|
||||
http = {
|
||||
tls = {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
# Enable HTTP/3 (QUIC)
|
||||
http3 = {
|
||||
enabled = true
|
||||
advertisedPort = 443 # CRITICAL: Must match the external port
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key gotcha: `advertisedPort = 443`**
|
||||
|
||||
Without `advertisedPort`, Traefik advertises the *internal container port* (8443) in the
|
||||
`Alt-Svc` header:
|
||||
```
|
||||
Alt-Svc: h3=":8443"; ma=2592000
|
||||
```
|
||||
|
||||
This is wrong because clients connect on external port 443, not 8443. The correct header is:
|
||||
```
|
||||
Alt-Svc: h3=":443"; ma=2592000
|
||||
```
|
||||
|
||||
Setting `advertisedPort = 443` fixes this.
|
||||
|
||||
#### Step 2: Ensure Helm Chart Fully Re-renders
|
||||
|
||||
Changing `http3.enabled=true` in values alone may not cause the Helm chart to add the
|
||||
required UDP port to the Service and Deployment specs. The Traefik Helm chart templates
|
||||
need to re-render to include `websecure-http3: 443/UDP` in the Service.
|
||||
|
||||
If the Service doesn't show a UDP port after applying:
|
||||
- See the companion skill `helm-release-force-rerender` for fixing this
|
||||
- The root cause is that `helm upgrade --reuse-values` (Terraform's default behavior)
|
||||
may not trigger template re-rendering for structural changes like adding new ports
|
||||
|
||||
After a successful apply, verify the Service has the UDP port:
|
||||
```bash
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A5 "443"
|
||||
```
|
||||
|
||||
Expected output should include both:
|
||||
```yaml
|
||||
- name: websecure
|
||||
port: 443
|
||||
protocol: TCP
|
||||
targetPort: websecure
|
||||
- name: websecure-http3
|
||||
port: 443
|
||||
protocol: UDP
|
||||
targetPort: websecure-http3
|
||||
```
|
||||
|
||||
#### Step 3: Enable HTTP/3 on Cloudflare (if using Cloudflare proxy)
|
||||
|
||||
For Cloudflare-proxied domains, HTTP/3 must also be enabled at the Cloudflare zone level.
|
||||
|
||||
**Cloudflare Provider v4** (current in this repo):
|
||||
```hcl
|
||||
resource "cloudflare_zone_settings_override" "http3" {
|
||||
zone_id = var.cloudflare_zone_id
|
||||
|
||||
settings {
|
||||
http3 = "on" # String values: "on" or "off"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: In Cloudflare provider v5, this uses `cloudflare_zone_setting` (singular) with
|
||||
different syntax. The v4 resource is `cloudflare_zone_settings_override` (plural + override).
|
||||
|
||||
#### Step 4: Verify End-to-End
|
||||
|
||||
##### Testing from macOS
|
||||
|
||||
macOS system curl does NOT support HTTP/3. Install curl with HTTP/3:
|
||||
```bash
|
||||
brew install curl
|
||||
```
|
||||
|
||||
Then use the Homebrew version explicitly:
|
||||
```bash
|
||||
# Test HTTP/3 negotiation (Alt-Svc header)
|
||||
/opt/homebrew/opt/curl/bin/curl -sI https://example.viktorbarzin.me 2>&1 | grep -i alt-svc
|
||||
# Expected: alt-svc: h3=":443"; ma=2592000
|
||||
|
||||
# Test actual HTTP/3 connection
|
||||
/opt/homebrew/opt/curl/bin/curl --http3-only -sI https://example.viktorbarzin.me
|
||||
# Expected: HTTP/3 200
|
||||
```
|
||||
|
||||
##### Testing from within the Cluster
|
||||
|
||||
```bash
|
||||
# Use a curl image with HTTP/3 support (amd64 only)
|
||||
kubectl run curl-h3 --rm -it --image=ymuski/curl-http3 --restart=Never -- \
|
||||
curl --http3-only -sI https://example.viktorbarzin.me
|
||||
|
||||
# Note: ymuski/curl-http3 is amd64-only; it will fail on arm64 nodes
|
||||
```
|
||||
|
||||
##### Checking Traefik Logs
|
||||
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100 | grep -i quic
|
||||
```
|
||||
|
||||
### Verification Checklist
|
||||
|
||||
1. Traefik Service shows UDP port 443 (`websecure-http3`)
|
||||
2. `Alt-Svc` response header shows `h3=":443"` (not `h3=":8443"`)
|
||||
3. `/opt/homebrew/opt/curl/bin/curl --http3-only` successfully connects
|
||||
4. Cloudflare zone has HTTP/3 enabled (for proxied domains)
|
||||
|
||||
### Current Configuration (This Repo)
|
||||
|
||||
- **Traefik config**: `modules/kubernetes/traefik/main.tf` (lines 89-92)
|
||||
- **Cloudflare HTTP/3**: `modules/kubernetes/cloudflared/cloudflare.tf` (line 153)
|
||||
- **MetalLB IP**: 10.0.20.202 (Traefik LoadBalancer service)
|
||||
|
||||
### Notes
|
||||
|
||||
- HTTP/3 uses QUIC over UDP. Firewalls must allow UDP 443 inbound.
|
||||
- Traefik automatically handles TLS for HTTP/3 using the same certs as HTTPS.
|
||||
- The `Alt-Svc` header is sent on HTTP/2 responses to tell clients HTTP/3 is available.
|
||||
Clients then upgrade to HTTP/3 on subsequent requests.
|
||||
- For non-Cloudflare (direct DNS) domains, only the Traefik-side config is needed.
|
||||
- Cloudflare handles its own HTTP/3 negotiation with end users; the origin connection
|
||||
between Cloudflare and Traefik uses HTTP/1.1 or HTTP/2 (not HTTP/3).
|
||||
|
||||
---
|
||||
|
||||
## UDP Cross-Namespace Routing
|
||||
|
||||
### Problem
|
||||
|
||||
Adding a custom UDP entrypoint (e.g., DNS on port 53) to Traefik v3 via Helm chart values
|
||||
doesn't work out of the box. Traffic times out even though the Traefik pod listens on the
|
||||
port internally. Two separate issues compound:
|
||||
|
||||
1. The Helm chart defaults `expose` to `false` for custom entrypoints -- the port is never
|
||||
added to the LoadBalancer Service
|
||||
2. `allowCrossNamespace` defaults to `false` -- IngressRouteUDP in namespace A can't
|
||||
reference a Service in namespace B
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- Traefik Helm chart v39.0.0+ (Traefik v3.x)
|
||||
- Custom UDP entrypoint defined in `ports` values
|
||||
- `IngressRouteUDP` referencing a service in a different namespace
|
||||
- Symptoms:
|
||||
- `kubectl get svc traefik` doesn't show your custom UDP port
|
||||
- UDP traffic to the LoadBalancer IP times out
|
||||
- Traefik logs show: `"udp service <namespace>/<service> is not in the parent resource namespace <traefik-namespace>"`
|
||||
- `netstat -ulnp` inside Traefik pod confirms it IS listening on the port
|
||||
|
||||
### Solution
|
||||
|
||||
#### Fix 1: Expose the UDP port on the Service
|
||||
|
||||
In the Helm values, add `expose = { default = true }` to the entrypoint:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
ports = {
|
||||
dns-udp = {
|
||||
port = 5353
|
||||
exposedPort = 53
|
||||
protocol = "UDP"
|
||||
expose = { default = true } # <-- Required for custom entrypoints
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML equivalent
|
||||
ports:
|
||||
dns-udp:
|
||||
port: 5353
|
||||
exposedPort: 53
|
||||
protocol: UDP
|
||||
expose:
|
||||
default: true
|
||||
```
|
||||
|
||||
Note: The built-in `web` and `websecure` entrypoints have `expose.default = true` by
|
||||
default, but custom entrypoints do NOT.
|
||||
|
||||
#### Fix 2: Enable cross-namespace CRD references
|
||||
|
||||
In the Helm values, add `allowCrossNamespace = true` to the kubernetesCRD provider:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
providers = {
|
||||
kubernetesCRD = {
|
||||
enabled = true
|
||||
allowCrossNamespace = true # <-- Required for cross-namespace IngressRouteUDP
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML
|
||||
providers:
|
||||
kubernetesCRD:
|
||||
enabled: true
|
||||
allowCrossNamespace: true
|
||||
```
|
||||
|
||||
This is required whenever an `IngressRouteUDP` (or `IngressRouteTCP`, `IngressRoute`)
|
||||
references a Kubernetes Service in a different namespace.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# 1. Verify the port appears in the Service
|
||||
kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'
|
||||
# Should include your custom entrypoint name (e.g., "dns-udp")
|
||||
|
||||
# 2. Check Traefik logs for cross-namespace errors
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "not in the parent resource namespace"
|
||||
# Should return nothing after the fix
|
||||
|
||||
# 3. Test the UDP service
|
||||
dig @<traefik-lb-ip> example.com
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
DNS forwarding through Traefik to Technitium DNS:
|
||||
- IngressRouteUDP in `traefik` namespace routes `dns-udp` entrypoint to
|
||||
`technitium-dns:53` in `technitium` namespace
|
||||
- Without Fix 1: port 53 never exposed on LoadBalancer -- traffic can't reach Traefik
|
||||
- Without Fix 2: Traefik rejects the route -- logs error every ~60 seconds
|
||||
- With both fixes: DNS queries to LoadBalancer IP:53 -> Traefik -> Technitium
|
||||
|
||||
### Notes
|
||||
|
||||
1. **Debugging order matters**: Fix 1 (expose) must come first. Without the port on the
|
||||
Service, you can't even test if the routing works. Fix 2 (cross-namespace) errors only
|
||||
appear in Traefik logs, not as user-visible failures.
|
||||
2. **`allowCrossNamespace` is a security consideration**: It allows any IngressRoute CRD
|
||||
to reference services in any namespace. If this is too broad, consider using
|
||||
`TraefikService` middleware or moving the IngressRouteUDP to the target namespace.
|
||||
3. **Rolling update**: Changing `allowCrossNamespace` triggers a Traefik pod restart
|
||||
(new CLI args). Changing `expose` only updates the Service (no pod restart needed).
|
||||
4. **This applies to TCP too**: `IngressRouteTCP` with cross-namespace services needs the
|
||||
same `allowCrossNamespace` setting.
|
||||
|
||||
---
|
||||
|
||||
## Plugin Download Failure (Global 404)
|
||||
|
||||
### Problem
|
||||
|
||||
After a node maintenance operation (containerd restart, node drain/uncordon, etc.),
|
||||
all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist
|
||||
and look correct, making this extremely confusing to debug.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- ALL Traefik routes return 404 simultaneously (not just one service)
|
||||
- Traefik pods are Running and Ready
|
||||
- Ingress resources exist with correct annotations
|
||||
- Middlewares exist in the correct namespaces
|
||||
- TLS secrets exist
|
||||
- Traefik startup logs contain: `Plugins are disabled because an error has occurred`
|
||||
- Plugin download error: `unable to download plugin ... context deadline exceeded`
|
||||
- Happened after a node restart, containerd restart, or network disruption
|
||||
|
||||
### Root Cause
|
||||
|
||||
Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from
|
||||
`plugins.traefik.io` on **every pod startup**. If the download fails (network
|
||||
unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**.
|
||||
|
||||
Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually
|
||||
every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the
|
||||
missing plugin middleware as a fatal routing error and returns 404 for every route
|
||||
that references it -- which is typically all of them.
|
||||
|
||||
### Solution
|
||||
|
||||
```bash
|
||||
# 1. Confirm the diagnosis - check Traefik startup logs
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20
|
||||
# Look for: "Plugins are disabled because an error has occurred"
|
||||
|
||||
# 2. Verify outbound connectivity is restored
|
||||
kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \
|
||||
-o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io
|
||||
|
||||
# 3. Rollout restart to retry plugin download
|
||||
kubectl rollout restart deployment -n traefik traefik
|
||||
|
||||
# 4. Verify plugins loaded
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins"
|
||||
# Should show: "Plugins loaded."
|
||||
|
||||
# 5. Verify routes work
|
||||
curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k
|
||||
# Should return 200 instead of 404
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`)
|
||||
- Routes return expected HTTP status codes (200, 302, etc.) instead of 404
|
||||
- `kubectl logs -n traefik <pod> | grep "does not exist"` shows no middleware errors
|
||||
|
||||
### Why This Is Hard to Debug
|
||||
|
||||
1. **Traefik pods show Running/Ready** -- health checks pass even without plugins
|
||||
2. **All Kubernetes resources look correct** -- Ingresses, Services, Middlewares all exist
|
||||
3. **The error is in startup logs only** -- not in per-request logs (requests just get 404)
|
||||
4. **The 404 is Traefik's default** -- same as "no route matched", not a backend error
|
||||
5. **The middleware error is logged once at startup** -- easy to miss in a stream of logs
|
||||
|
||||
### Prevention
|
||||
|
||||
- During planned maintenance (node drain, containerd restart), restart Traefik pods
|
||||
AFTER network connectivity is confirmed restored
|
||||
- Consider pre-caching Traefik plugins in the container image or using an init container
|
||||
- Monitor for the `Plugins are disabled` log message in your alerting system
|
||||
|
||||
### Notes
|
||||
|
||||
- This affects ALL plugin-based middlewares, not just crowdsec
|
||||
- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected
|
||||
- Traefik v3.x downloads plugins on every startup; there is no persistent cache
|
||||
- If only some routes return 404, the problem is likely different (missing middleware
|
||||
or TLS secret, not a plugin issue)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Traefik HTTP/3 Documentation](https://doc.traefik.io/traefik/routing/entrypoints/#http3)
|
||||
- [Traefik Helm Chart Values](https://github.com/traefik/traefik-helm-chart/blob/master/traefik/values.yaml)
|
||||
- [Cloudflare HTTP/3 Settings](https://developers.cloudflare.com/speed/optimization/protocol/http3/)
|
||||
- [Traefik Helm Chart Ports Configuration](https://github.com/traefik/traefik-helm-chart)
|
||||
- [Traefik v3 Providers Documentation](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
|
||||
|
||||
## See Also
|
||||
|
||||
- `traefik-rewrite-body-troubleshooting` -- Traefik rewrite-body plugin troubleshooting (compression, Accept header issues)
|
||||
- `helm-release-force-rerender` -- Force Helm chart re-render when structural changes don't take effect
|
||||
|
|
@ -0,0 +1,200 @@
|
|||
---
|
||||
name: traefik-rewrite-body-troubleshooting
|
||||
description: |
|
||||
Troubleshooting guide for the Traefik rewrite-body plugin (packruler/rewrite-body).
|
||||
Covers two failure modes: (1) Compression failure — plugin logs "flate: corrupt input
|
||||
before offset 5" when backends send gzip-compressed responses, corrupting response
|
||||
bodies and breaking WebSocket connections, authentication flows, and mobile app
|
||||
connectivity. (2) Silent skip — plugin silently skips content injection (rybbit
|
||||
analytics, trap links, or any HTML rewriting) when the request Accept header doesn't
|
||||
contain "text/html" (e.g., curl's default Accept: */*), making it appear broken
|
||||
despite correct configuration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Rewrite-Body Plugin Troubleshooting
|
||||
|
||||
Two distinct failure modes for the `packruler/rewrite-body` Traefik plugin used for
|
||||
injecting analytics scripts (rybbit) and anti-AI trap links into HTML responses.
|
||||
|
||||
---
|
||||
|
||||
## Problem 1: Compression Failure
|
||||
|
||||
### Symptoms
|
||||
- Traefik logs show: `Rewrite-Body | ERROR ... Error loading content: flate: corrupt input before offset 5`
|
||||
- Mobile apps (e.g., Home Assistant Companion) fail while browser works
|
||||
- HA Companion app shows repeated `GET /?external_auth=1` requests (auth loop)
|
||||
- WebSocket connections (`/api/websocket`) are very short-lived (seconds instead of minutes)
|
||||
- HTTP 499 errors on API calls (client disconnects due to corrupted responses)
|
||||
- Using `packruler/rewrite-body` plugin v1.2.0 with `monitoring.types = ["text/html"]`
|
||||
|
||||
### Root Cause
|
||||
Despite the `monitoring.types = ["text/html"]` filter, the plugin attempts to decompress
|
||||
ALL responses before checking content type. When decompression fails on certain gzip
|
||||
encodings, it corrupts the response body, breaking:
|
||||
- WebSocket upgrade handshakes
|
||||
- Authentication flows (HA Companion app's `external_auth` callback)
|
||||
- Mobile app connectivity (while browser appears to work due to auto-reconnect)
|
||||
|
||||
### Misleading Symptoms
|
||||
- HTTP/3 (QUIC) may appear to be the cause because HTTP/3 requests show 499 errors.
|
||||
This is a red herring -- the rewrite-body plugin corruption affects all protocols.
|
||||
- WebSocket issues may look like a timeout or proxy configuration problem.
|
||||
- The `monitoring.types = ["text/html"]` config suggests the plugin should only touch
|
||||
HTML, but it still processes all responses for decompression before filtering.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Create a strip-accept-encoding middleware
|
||||
Add a Traefik middleware that removes `Accept-Encoding` from requests, forcing
|
||||
backends to send uncompressed responses that the plugin can safely process:
|
||||
|
||||
```hcl
|
||||
# In traefik/middleware.tf
|
||||
resource "kubernetes_manifest" "middleware_strip_accept_encoding" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "strip-accept-encoding"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
headers = {
|
||||
customRequestHeaders = {
|
||||
"Accept-Encoding" = ""
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
#### Step 2: Add middleware to routes with rewrite-body
|
||||
In the ingress factory middleware chain, add `strip-accept-encoding` BEFORE the
|
||||
rewrite-body middleware:
|
||||
|
||||
```hcl
|
||||
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
|
||||
```
|
||||
|
||||
The order matters: strip-accept-encoding must come first so the request reaches
|
||||
the backend without Accept-Encoding, and the uncompressed response then passes
|
||||
through the rewrite-body plugin.
|
||||
|
||||
### Verification (Compression Fix)
|
||||
1. Check Traefik logs for absence of `flate: corrupt input` errors:
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=200 | grep -i "flate\|rewrite-body"
|
||||
```
|
||||
2. Verify the middleware chain includes strip-accept-encoding before rybbit:
|
||||
```bash
|
||||
kubectl get ingress -n <namespace> <name> -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
|
||||
```
|
||||
3. Test mobile app connectivity (HA Companion, etc.)
|
||||
|
||||
### Notes (Compression)
|
||||
- This affects ALL services using the rewrite-body plugin, not just HA
|
||||
- The fix is applied conditionally: `strip-accept-encoding` is only added to the
|
||||
middleware chain when `rybbit_site_id` is set, so services without analytics
|
||||
are unaffected
|
||||
- Both `ingress_factory` and `reverse_proxy/factory` modules need the fix
|
||||
- Traefik may still compress responses to clients via its own compression middleware;
|
||||
the strip only affects the backend request
|
||||
- The plugin's `monitoring.types` filter works for deciding what to rewrite, but
|
||||
decompression is attempted on all responses regardless
|
||||
|
||||
---
|
||||
|
||||
## Problem 2: Silent Skip (Accept Header Mismatch)
|
||||
|
||||
### Symptoms
|
||||
- rewrite-body middleware is in the ingress middleware chain and shows status "enabled" in Traefik API
|
||||
- `curl https://example.com/` returns original HTML with no injected content
|
||||
- Browser shows injected content (rybbit script, trap links, etc.)
|
||||
- No errors in Traefik logs -- the plugin silently skips processing
|
||||
- `monitoring.types = ["text/html"]` is configured in the middleware spec
|
||||
- Middleware chain order is correct (strip-accept-encoding before rewrite-body)
|
||||
|
||||
### Root Cause
|
||||
In the plugin source code, `SupportsProcessing()` checks the **request** `Accept`
|
||||
header (not the response `Content-Type`) against `monitoring.types`:
|
||||
|
||||
```go
|
||||
func (r *Rewriter) SupportsProcessing(req *http.Request) bool {
|
||||
accept := req.Header.Get("Accept")
|
||||
for _, monitoringType := range r.monitoring.Types {
|
||||
if strings.Contains(accept, monitoringType) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
It uses `strings.Contains(accept, "text/html")`. The curl default `Accept: */*` does
|
||||
NOT contain the substring `text/html`, so the plugin returns false and skips all
|
||||
processing. Browser requests include `Accept: text/html,application/xhtml+xml,...`
|
||||
which does match.
|
||||
|
||||
### Misleading Symptoms
|
||||
- Appears as if the middleware isn't working at all
|
||||
- May look like a middleware ordering issue or configuration error
|
||||
- `kubectl get middleware` shows the resource exists with correct spec
|
||||
- Traefik API (`/api/http/middlewares/`) shows the middleware as "enabled"
|
||||
- Checking the rewrite-body regex patterns seems pointless since nothing is being processed
|
||||
|
||||
### Solution
|
||||
This is **working as designed** -- not a bug. The fix depends on context:
|
||||
|
||||
#### For testing with curl
|
||||
Add the `Accept` header to simulate a browser:
|
||||
```bash
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/
|
||||
```
|
||||
|
||||
#### For verifying injection is working
|
||||
```bash
|
||||
# Check for injected content (trap links, analytics, etc.)
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'href="https://poison[^"]*"'
|
||||
|
||||
# Check for rybbit analytics
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'src="https://rybbit[^"]*"'
|
||||
```
|
||||
|
||||
#### For programmatic clients that need injection
|
||||
If a non-browser client needs to receive injected content, ensure it sends
|
||||
`Accept: text/html` in its request headers.
|
||||
|
||||
### Verification (Accept Header)
|
||||
```bash
|
||||
# Without Accept header -- no injection (expected)
|
||||
curl -s https://example.com/ | grep -c "rybbit"
|
||||
# Output: 0
|
||||
|
||||
# With Accept header -- injection works
|
||||
curl -s -H "Accept: text/html" https://example.com/ | grep -c "rybbit"
|
||||
# Output: 1
|
||||
```
|
||||
|
||||
### Notes (Accept Header)
|
||||
- This behavior is independent of the compression issue (Problem 1 above)
|
||||
- The check is on the **request** `Accept` header, not the **response** `Content-Type`
|
||||
- `Accept: */*` does NOT match -- `strings.Contains("*/*", "text/html")` is false
|
||||
- Real AI scrapers typically send browser-like Accept headers, so trap links will be
|
||||
injected for them correctly
|
||||
- API calls (which typically send `Accept: application/json`) are correctly skipped
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
- `traefik-helm-configuration` -- Traefik Helm chart configuration and entrypoints
|
||||
- `ingress-factory-migration` -- Covers the ingress factory module that creates
|
||||
rybbit analytics middlewares
|
||||
Loading…
Add table
Add a link
Reference in a new issue