fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00 · 2026-06-09 08:45:33 +00:00 · fd0f4a0365
commit fd0f4a0365
parent 6d224861c4
1166 changed files with 358546 additions and 0 deletions
--- a/.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
+++ b/.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
@ -0,0 +1,170 @@
+---
+name: authentik-oidc-kubernetes
+description: |
+  Configure Authentik as OIDC provider for Kubernetes API server authentication.
+  Use when: (1) setting up OIDC auth for kubectl with Authentik, (2) kube-apiserver
+  rejects OIDC tokens with "oidc: email not verified", (3) JWKS endpoint returns
+  empty {} despite provider being configured, (4) kubelogin fails with "claim not
+  present" for email, (5) redirect_uri mismatch errors during kubelogin browser auth,
+  (6) kube-apiserver static pod manifest changes don't take effect after restart.
+  Covers all gotchas discovered when integrating Authentik 2025.10.x with Kubernetes
+  1.34.x using kubelogin (int128/kubelogin).
+author: Claude Code
+version: 1.0.0
+date: 2026-02-17
+---
+
+# Authentik OIDC for Kubernetes API Authentication
+
+## Problem
+Setting up Authentik as an OIDC identity provider for Kubernetes kubectl access
+involves multiple non-obvious pitfalls that cause silent failures at different
+stages of the authentication flow.
+
+## Context / Trigger Conditions
+- Setting up multi-user kubectl access with OIDC
+- Using Authentik as the identity provider and kubelogin (int128/kubelogin) as the kubectl plugin
+- Any of these errors:
+  - `oidc: email not verified`
+  - `oidc: parse username claims "email": claim not present`
+  - `The request fails due to a missing, invalid, or mismatching redirection URI`
+  - JWKS endpoint (`/application/o/<app>/jwks/`) returns `{}`
+  - `Unauthorized` after successful browser login
+
+## Solution
+
+### Gotcha 1: Signing Key Must Be Assigned
+
+Authentik's OAuth2 provider does NOT assign a signing key by default. Without it,
+the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
+
+**Fix:** Assign a signing key (e.g., "authentik Self-signed Certificate") to the
+OAuth2 provider:
+```python
+# Via Django shell (kubectl exec into authentik server pod)
+from authentik.providers.oauth2.models import OAuth2Provider
+from authentik.crypto.models import CertificateKeyPair
+
+provider = OAuth2Provider.objects.get(name='kubernetes')
+cert = CertificateKeyPair.objects.filter(name='authentik Self-signed Certificate').first()
+provider.signing_key = cert
+provider.save()
+```
+
+Or via API:
+```bash
+curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
+  "$AUTHENTIK_URL/api/v3/providers/oauth2/<pk>/" \
+  -d '{"signing_key": "<certificate-keypair-uuid>"}'
+```
+
+### Gotcha 2: Default Email Mapping Sets `email_verified: False`
+
+Authentik's built-in email scope mapping hardcodes `email_verified: False`:
+```python
+return {
+    "email": request.user.email,
+    "email_verified": False  # <-- This causes kube-apiserver to reject the token
+}
+```
+
+kube-apiserver requires `email_verified: true` by default.
+
+**Fix:** Create a custom scope mapping with `email_verified: True` and assign it
+to the provider instead of the default:
+```python
+from authentik.providers.oauth2.models import OAuth2Provider, ScopeMapping
+
+# Create custom mapping
+mapping, _ = ScopeMapping.objects.get_or_create(
+    name='Kubernetes Email (verified)',
+    defaults={
+        'scope_name': 'email',
+        'expression': 'return {"email": request.user.email, "email_verified": True}'
+    }
+)
+
+# Replace default email mapping on the provider
+provider = OAuth2Provider.objects.get(name='kubernetes')
+default_email = ScopeMapping.objects.filter(
+    managed='goauthentik.io/providers/oauth2/scope-email'
+).first()
+if default_email:
+    provider.property_mappings.remove(default_email)
+provider.property_mappings.add(mapping)
+```
+
+### Gotcha 3: kubelogin Needs Extra Scopes
+
+By default, kubelogin only requests the `openid` scope. The token will lack
+`email` and `groups` claims, causing:
+```
+oidc: parse username claims "email": claim not present
+```
+
+**Fix:** Add `--oidc-extra-scope` flags to the kubeconfig exec plugin:
+```yaml
+users:
+- name: oidc-user
+  user:
+    exec:
+      command: kubectl
+      args:
+        - oidc-login
+        - get-token
+        - --oidc-issuer-url=https://authentik.example.com/application/o/kubernetes/
+        - --oidc-client-id=kubernetes
+        - --oidc-extra-scope=email      # Required!
+        - --oidc-extra-scope=profile
+        - --oidc-extra-scope=groups
+```
+
+### Gotcha 4: Redirect URIs Must Use Regex Mode
+
+kubelogin picks a random available port (tries 8000, 18000, then random).
+Strict redirect URI matching like `http://localhost:8000/callback` will fail
+when kubelogin uses a different port.
+
+**Fix:** Use regex matching in the Authentik provider:
+```json
+{
+  "redirect_uris": [
+    {"matching_mode": "regex", "url": "http://localhost:.*"},
+    {"matching_mode": "regex", "url": "http://127\\.0\\.0\\.1:.*"}
+  ]
+}
+```
+
+### Gotcha 5: Property Mappings API Endpoint Changed
+
+In Authentik 2025.10.x, scope mappings are at:
+- `propertymappings/provider/scope/` (new, correct)
+- NOT `propertymappings/scope/` (old, returns 405 Method Not Allowed on POST)
+
+### Gotcha 6: Static Pod Manifest Changes Need Full Cycle
+
+See skill: `kubelet-static-pod-manifest-update` for the full restart procedure.
+
+## Verification
+
+After all fixes:
+```bash
+# 1. JWKS has a key
+curl -s https://authentik.example.com/application/o/kubernetes/jwks/ | jq '.keys | length'
+# Expected: 1 (or more)
+
+# 2. Test auth
+KUBECONFIG=/path/to/oidc-kubeconfig kubectl get namespaces
+# Expected: browser opens, login, namespaces returned
+
+# 3. Check API server logs for success
+ssh master "sudo kubectl logs -n kube-system kube-apiserver-* | grep oidc | tail -5"
+# Expected: no "Unable to authenticate" errors
+```
+
+## Notes
+- The OAuth2 provider should use `client_type: public` (no client secret needed for kubelogin)
+- Set `sub_mode: user_email` so the OIDC subject matches the RBAC binding
+- Set `include_claims_in_id_token: true` for the token to contain claims directly
+- Use `issuer_mode: per_provider` for a clean issuer URL
+- RBAC ClusterRoleBindings should match on the user's email (the `--oidc-username-claim=email` value)
--- a/.claude/skills/archived/authentik/SKILL.md
+++ b/.claude/skills/archived/authentik/SKILL.md
@ -0,0 +1,297 @@
+---
+name: authentik
+description: |
+  Manage the Authentik identity provider via its REST API. Use when:
+  (1) User asks to create, update, or delete users in Authentik,
+  (2) User asks to manage groups or group memberships,
+  (3) User asks to create a new OAuth2/OIDC application or provider,
+  (4) User asks to protect a service with forward auth (Authentik + Traefik),
+  (5) User asks about SSO, single sign-on, authentication, or identity,
+  (6) User asks to manage Authentik flows, stages, or policies,
+  (7) User asks to configure social login (Google, GitHub, Facebook),
+  (8) User asks about OIDC for Kubernetes or who has access to what,
+  (9) User deploys a new service that needs authentication.
+  Authentik v2025.10.3 running in Kubernetes, managed via REST API.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-17
+---
+
+# Authentik Identity Provider Management
+
+## Overview
+- **URL**: `https://authentik.viktorbarzin.me`
+- **Admin UI**: `https://authentik.viktorbarzin.me/if/admin/`
+- **API Base**: `https://authentik.viktorbarzin.me/api/v3/`
+- **API Docs**: `https://authentik.viktorbarzin.me/api/v3/docs/`
+- **Helm Chart**: authentik v2025.10.3
+- **Namespace**: `authentik`
+
+## API Access
+
+### Getting the Token
+The API token is stored in `terraform.tfvars` (git-crypt encrypted):
+```bash
+AUTHENTIK_TOKEN=$(grep authentik_api_token terraform.tfvars | cut -d'"' -f2)
+```
+
+### Making API Calls
+```bash
+# Generic pattern
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
+
+# With JSON body (POST/PATCH/PUT)
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/<endpoint>/" \
+  -d '{"key": "value"}'
+```
+
+### Verify Token Works
+```bash
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/core/users/me/" | python3 -m json.tool
+```
+
+## Key API Endpoints
+
+| Endpoint | Methods | Purpose |
+|----------|---------|---------|
+| `core/users/` | GET, POST | List/create users |
+| `core/users/{id}/` | GET, PATCH, DELETE | Get/update/delete user |
+| `core/groups/` | GET, POST | List/create groups |
+| `core/groups/{pk}/` | GET, PATCH, DELETE | Get/update/delete group |
+| `core/applications/` | GET, POST | List/create applications |
+| `core/tokens/` | GET, POST | List/create tokens |
+| `core/tokens/{identifier}/view_key/` | GET | View token secret key |
+| `providers/all/` | GET | List all providers |
+| `providers/oauth2/` | GET, POST | OAuth2/OIDC providers |
+| `providers/proxy/` | GET, POST | Proxy providers (forward auth) |
+| `flows/instances/` | GET | List flows |
+| `stages/all/` | GET | List stages |
+| `sources/all/` | GET | List sources (social login) |
+| `outposts/instances/` | GET | List outposts |
+| `propertymappings/provider/scope/` | GET, POST | OIDC scope mappings |
+| `rbac/roles/` | GET | List roles |
+
+## Common Operations
+
+### List All Users
+```bash
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/core/users/?page_size=50" | \
+  python3 -c "
+import json,sys
+for u in json.load(sys.stdin)['results']:
+    groups=[g['name'] for g in u.get('groups_obj',[])]
+    print(f\"  {u['username']:<40} {u['name']:<30} groups={groups}\")
+"
+```
+
+### Create a New User
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/core/users/" \
+  -d '{
+    "username": "user@example.com",
+    "name": "Full Name",
+    "email": "user@example.com",
+    "is_active": true,
+    "type": "internal",
+    "path": "users"
+  }'
+```
+
+### Add User to Group
+```bash
+# First get the group to find current users
+GROUP_PK="<group-uuid>"
+CURRENT_USERS=$(curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" | \
+  python3 -c "import json,sys; print(json.load(sys.stdin)['users'])")
+
+# Then PATCH with the updated user list (add new user pk)
+curl -s -X PATCH \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" \
+  -d '{"users": [<existing_pks>, <new_pk>]}'
+```
+
+### Create a New Group
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/core/groups/" \
+  -d '{
+    "name": "My New Group",
+    "is_superuser": false,
+    "parent": "<parent-group-pk-or-null>"
+  }'
+```
+
+### Create OAuth2/OIDC Application (Full Flow)
+
+**Step 1: Create the OAuth2 Provider**
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/providers/oauth2/" \
+  -d '{
+    "name": "Provider for myapp",
+    "authorization_flow": "<flow-pk>",
+    "invalidation_flow": "<invalidation-flow-pk>",
+    "client_type": "confidential",
+    "client_id": "<generated-or-custom>",
+    "client_secret": "<generated-or-custom>",
+    "redirect_uris": "https://myapp.viktorbarzin.me/callback",
+    "property_mappings": ["<scope-mapping-pks>"],
+    "signing_key": "<signing-key-pk>"
+  }'
+```
+
+**Step 2: Create the Application**
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/core/applications/" \
+  -d '{
+    "name": "My App",
+    "slug": "myapp",
+    "provider": <provider-pk-from-step-1>,
+    "meta_launch_url": "https://myapp.viktorbarzin.me"
+  }'
+```
+
+### List Applications
+```bash
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/core/applications/?page_size=50" | \
+  python3 -c "
+import json,sys
+for a in json.load(sys.stdin)['results']:
+    ptype = a.get('provider_obj',{}).get('verbose_name','N/A')
+    print(f\"  {a['name']:<30} slug={a['slug']:<25} provider={ptype}\")
+"
+```
+
+### Create a Non-Expiring API Token
+```bash
+# Create token
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/core/tokens/" \
+  -d '{
+    "identifier": "my-token-name",
+    "intent": "api",
+    "expiring": false,
+    "description": "Description here"
+  }'
+
+# Retrieve the key
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/core/tokens/my-token-name/view_key/"
+```
+
+## Important Reference UUIDs
+
+### Authorization Flows
+| Flow | Slug | Use For |
+|------|------|---------|
+| Authorize Application (explicit consent) | `default-provider-authorization-explicit-consent` | Apps that should show consent screen |
+| Authorize Application (implicit consent) | `default-provider-authorization-implicit-consent` | Internal/trusted apps, auto-redirect |
+| Logout | `default-invalidation-flow` | Invalidation/logout flow |
+
+### Common Property Mappings (OIDC Scopes)
+These are the standard scope mappings used by most providers:
+- `60e33a8c-66a2-414f-840c-b13012b4d4bd` — openid
+- `1f51c659-f13b-4ad4-ba89-70458ef88e9c` — email
+- `4c0bf430-7f74-4216-b9d7-23703ab544ba` — profile
+
+### Login Sources
+| Source | Slug | Matching Mode |
+|--------|------|---------------|
+| Google | `google` | identifier |
+| GitHub | `github` | email_link |
+| Facebook | `facebook` | email_link |
+
+## Protecting a Service with Forward Auth
+
+To protect a service via Authentik + Traefik forward auth:
+
+1. In the service's Terraform module, set `protected = true` in the `ingress_factory` call
+2. This adds the `authentik-forward-auth` Traefik middleware
+3. Unauthenticated users get redirected to the Authentik login page
+4. After login, these headers are forwarded to the service:
+   - `X-authentik-username`
+   - `X-authentik-uid`
+   - `X-authentik-email`
+   - `X-authentik-name`
+   - `X-authentik-groups`
+
+## Invitation Management
+
+### Create Invitation
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  -H "Content-Type: application/json" \
+  "https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/" \
+  -d '{
+    "name": "invite-slug-name",
+    "single_use": true,
+    "fixed_data": {"group": "Target Group Name"},
+    "flow": "<invitation-enrollment-flow-pk>"
+  }'
+# Returns PK which is the itoken
+# Link: https://authentik.viktorbarzin.me/if/flow/invitation-enrollment/?itoken=<pk>
+```
+
+### List Invitations
+```bash
+curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/?page_size=50"
+```
+
+### Delete Invitation
+```bash
+curl -s -X DELETE -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
+  "https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/<pk>/"
+```
+
+### Helper Script
+Use `.claude/scripts/authentik-invite.sh` for invitation management:
+```bash
+./authentik-invite.sh create "Group Name" [--days N]
+./authentik-invite.sh assign <username> "Group Name"
+./authentik-invite.sh list
+```
+
+### Important Notes
+- OAuth source `enrollment_flow` is set to `invitation-enrollment` -- new social login users require invitation
+- Source updates require Django ORM (PATCH not supported on `sources/oauth/<slug>/`)
+- Invitation `name` field must be a slug (letters, numbers, hyphens, underscores)
+
+## Gotchas
+
+1. **API pagination**: All list endpoints return paginated results. Use `?page_size=50` or check `pagination.next` for more pages.
+2. **Group user updates**: PATCH to groups replaces the entire user list — always fetch current users first, then append.
+3. **Provider property mappings**: Must reference existing scope mapping UUIDs. Query `propertymappings/provider/scope/` to find them.
+4. **Signing key for OIDC**: Must assign a signing key to OAuth2 providers or JWKS endpoint returns empty `{}`.
+5. **Email verified claim**: Default email scope mapping sets `email_verified: False`. For Kubernetes OIDC, create a custom mapping that returns `True`.
+6. **Token identifier uniqueness**: Token identifiers must be unique across the entire instance.
+
+## Notes
+- Authentik is classified as DEFCON Level 1 (Critical) — handle with care
+- Changes to Authentik configuration (Helm chart, PgBouncer, etc.) must go through Terraform
+- API-level changes (users, groups, applications) are fine to make directly via the API
+- The embedded outpost auto-discovers providers assigned to it
+- See also: `ingress-factory-migration` skill for protecting services
--- a/.claude/skills/archived/bluestacks-burp-interception/SKILL.md
+++ b/.claude/skills/archived/bluestacks-burp-interception/SKILL.md
@ -0,0 +1,175 @@
+---
+name: bluestacks-burp-interception
+description: |
+  Intercept Android app HTTPS traffic using BlueStacks and Burp Suite on macOS.
+  Use when: (1) Need to analyze Android app API calls, (2) App ignores HTTP proxy,
+  (3) App uses SSL pinning that blocks interception, (4) Need to install Burp CA
+  as system certificate. Covers ADB setup, proxy configuration, Zygisk SSL unpinning,
+  and Magisk trustusercerts module for system CA installation.
+author: Claude Code
+version: 1.0.0
+date: 2026-01-24
+---
+
+# BlueStacks + Burp Suite HTTPS Traffic Interception
+
+## Problem
+You want to intercept HTTPS traffic from an Android app running in BlueStacks to analyze
+API calls, but the app either ignores the proxy or uses SSL certificate pinning.
+
+## Context / Trigger Conditions
+- Running BlueStacks on macOS with Burp Suite
+- App traffic not appearing in Burp Suite
+- App crashes or refuses to connect when proxy is set
+- Need to bypass SSL pinning for security testing/research
+
+## Prerequisites
+- BlueStacks with Magisk (kitsune variant) and root enabled
+- Zygisk-SSL-Unpinning module installed
+- trustusercerts Magisk module installed
+- Android SDK installed (for ADB)
+- Burp Suite running on port 8080
+
+## Solution
+
+### Step 1: Connect ADB to BlueStacks
+
+```bash
+# ADB location on macOS (Android SDK)
+ADB=~/Library/Android/sdk/platform-tools/adb
+
+# Connect to BlueStacks
+$ADB connect localhost:5555
+
+# Verify connection
+$ADB devices
+# Should show: emulator-5554 or localhost:5555
+```
+
+Note: BlueStacks runs **arm64-v8a** (not x86 as you might expect).
+
+### Step 2: Set HTTP Proxy
+
+Use your Mac's WiFi IP address (not 10.0.2.2 or localhost):
+
+```bash
+# Get Mac WiFi IP
+IP=$(ipconfig getifaddr en0)
+
+# Set proxy (Burp default port 8080)
+$ADB shell settings put global http_proxy ${IP}:8080
+
+# Verify
+$ADB shell settings get global http_proxy
+
+# Disable proxy when done
+$ADB shell settings put global http_proxy :0
+```
+
+### Step 3: Configure SSL Unpinning for Target App
+
+```bash
+# Find app package name
+$ADB shell pm list packages | grep <keyword>
+
+# Edit config
+$ADB shell "su -c 'cat > /data/local/tmp/zyg.ssl/config.json << EOF
+{
+    \"targets\": [
+        {
+            \"pkg_name\" : \"com.example.app\",
+            \"enable\": true,
+            \"start_safe\": true,
+            \"start_delay\": 1000
+        }
+    ]
+}
+EOF'"
+
+# Restart the app
+$ADB shell am force-stop com.example.app
+$ADB shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1
+
+# Verify SSL unpinning is active
+$ADB shell "logcat -d | grep -i ZygiskSSL | tail -10"
+# Should show: "App detected: com.example.app" and "[*] SSL UNPINNING [#]"
+```
+
+### Step 4: Install Burp CA as System Certificate
+
+```bash
+# Download Burp CA cert
+curl -x http://127.0.0.1:8080 http://burp/cert -o /tmp/burp-cert.der
+
+# Convert to PEM
+openssl x509 -inform DER -in /tmp/burp-cert.der -out /tmp/burp-cert.pem
+
+# Get hash for Android cert store naming
+HASH=$(openssl x509 -inform PEM -subject_hash_old -in /tmp/burp-cert.pem | head -1)
+cp /tmp/burp-cert.pem /tmp/${HASH}.0
+
+# Push to device
+$ADB push /tmp/${HASH}.0 /sdcard/
+
+# Install via trustusercerts Magisk module
+$ADB shell "su -c 'cp /sdcard/${HASH}.0 /data/adb/modules/trustusercerts/system/etc/security/cacerts/'"
+$ADB shell "su -c 'chmod 644 /data/adb/modules/trustusercerts/system/etc/security/cacerts/${HASH}.0'"
+
+# Reboot required for Magisk overlay
+$ADB shell "su -c 'reboot'"
+
+# After reboot, verify cert is in system store
+$ADB shell "su -c 'ls /system/etc/security/cacerts/${HASH}.0'"
+```
+
+### Step 5: Test Interception
+
+1. Re-enable proxy after reboot: `$ADB shell settings put global http_proxy ${IP}:8080`
+2. Launch target app
+3. Check Burp Suite → Proxy → HTTP history for requests
+
+## Verification
+
+- Proxy set: `adb shell settings get global http_proxy` returns `<ip>:8080`
+- SSL unpinning active: `logcat | grep ZygiskSSL` shows "SSL UNPINNING"
+- Burp CA installed: `ls /system/etc/security/cacerts/<hash>.0` exists
+- Traffic visible in Burp Suite HTTP history
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| No traffic in Burp | Proxy not set | Check `settings get global http_proxy` |
+| App shows SSL error | Cert not installed | Verify cert in system store, reboot |
+| SSL unpinning not working | Config not loaded | Force-stop app, check config.json syntax |
+| ADB connection refused | BlueStacks ADB disabled | Enable in BlueStacks Settings → Advanced |
+| Wrong cert hash | Using wrong openssl flag | Use `subject_hash_old` not `subject_hash` |
+
+## Notes
+
+- BlueStacks runs arm64-v8a, so Zygisk modules need arm64 support
+- The trustusercerts module copies certs at boot via Magisk overlay
+- System partition is read-only; use Magisk modules instead of direct mounting
+- Burp cert hash is typically `9a5ba575` but verify for your instance
+- Some apps may use additional protections (root detection, Frida detection)
+
+## Quick Reference
+
+```bash
+# Set proxy
+adb shell settings put global http_proxy <ip>:8080
+
+# Disable proxy
+adb shell settings put global http_proxy :0
+
+# Check SSL unpinning logs
+adb shell "logcat -d | grep -i ZygiskSSL"
+
+# Force restart app
+adb shell am force-stop <package> && adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1
+```
+
+## References
+- [Zygisk-SSL-Unpinning](https://github.com/m0szy/Zygisk-SSL-Unpinning)
+- [MagiskTrustUserCerts](https://github.com/NVISOsecurity/MagiskTrustUserCerts)
+- [Burp Suite Documentation](https://portswigger.net/burp/documentation)
--- a/.claude/skills/archived/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
+++ b/.claude/skills/archived/clickhouse-k8s-nfs-system-log-bloat/SKILL.md
@ -0,0 +1,189 @@
+---
+name: clickhouse-k8s-nfs-system-log-bloat
+description: |
+  Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
+  NFS storage, caused by unbounded system log table growth triggering continuous background
+  merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
+  (2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
+  (3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
+  grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
+  76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
+  Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
+  ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
+  system log truncation.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-01
+---
+
+# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
+
+## Problem
+
+ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
+even when actual user queries are negligible. The CPU is consumed by background merge
+operations on system log tables that grow unboundedly with no default TTL.
+
+## Context / Trigger Conditions
+
+- ClickHouse pod using 500m-1000m+ CPU with no active user queries
+- `SELECT * FROM system.processes` shows only diagnostic queries
+- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
+- System log tables have grown to gigabytes:
+  - `system.trace_log`: 5+ GiB, 200M+ rows
+  - `system.text_log`: 3+ GiB, 90M+ rows
+  - `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
+  - `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
+- Actual user data (e.g., `clickhouse.events`) is only kilobytes
+- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
+- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
+
+## Root Cause
+
+Two compounding issues:
+
+1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
+   `text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
+   retention policy and grow indefinitely.
+
+2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
+   merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
+   slower than local disk, creating a feedback loop:
+   - Slow merges → parts accumulate faster than they can be merged
+   - More parts → more merge operations spawned
+   - More merges → more CPU for decompression/recompression while waiting on NFS I/O
+
+## Solution
+
+### Immediate Fix: Truncate System Tables
+
+```bash
+CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
+```
+
+This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
+
+### Permanent Fix: CronJob for Periodic Truncation
+
+Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
+
+```hcl
+resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
+  metadata {
+    name      = "clickhouse-truncate-logs"
+    namespace = "<namespace>"
+  }
+  spec {
+    schedule                      = "0 */6 * * *"
+    successful_jobs_history_limit = 1
+    failed_jobs_history_limit     = 1
+    job_template {
+      metadata {}
+      spec {
+        template {
+          metadata {}
+          spec {
+            restart_policy = "OnFailure"
+            container {
+              name  = "truncate"
+              image = "curlimages/curl:8.12.1"
+              command = ["sh", "-c", join(" && ", [
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
+                "curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
+                "echo 'System logs truncated'"
+              ])]
+            }
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+### What Does NOT Work: Config.d XML Mount
+
+**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
+via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
+
+- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
+  the entire directory, deleting the built-in `docker_related_config.xml` that the
+  entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
+
+- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
+  with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
+
+- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
+  crash with exit code 36.
+
+This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
+and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
+
+## Verification
+
+After truncation, verify:
+
+```bash
+# CPU should drop from ~900m to ~100m within minutes
+kubectl top pod -n <namespace> -l app=clickhouse
+
+# No active merges
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
+  "SELECT count() FROM system.merges"
+
+# System tables should be small
+kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
+  "SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
+   FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
+   FORMAT Pretty"
+```
+
+## Diagnostic Commands
+
+```bash
+# Check what's consuming CPU (merges vs queries)
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT * FROM system.merges FORMAT Pretty"
+
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
+
+# Check background pool config
+kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
+  "SELECT name, value FROM system.server_settings \
+   WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
+   FORMAT Pretty"
+
+# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
+```
+
+## Notes
+
+- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
+  of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
+
+- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
+  Kubernetes. Root cause unclear but reproducible across mount methods.
+
+- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
+  `background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
+  workload. This overhead is unavoidable without config file changes.
+
+- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
+  persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
+  local PV storage instead.
+
+## See Also
+
+- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
+- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
--- a/.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
+++ b/.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
@ -0,0 +1,145 @@
+---
+name: coturn-k8s-without-hostnetwork
+description: |
+  Deploy coturn (TURN/STUN server) on Kubernetes without hostNetwork by using a
+  narrow relay port range and MetalLB LoadBalancer service. Use when: (1) deploying
+  a WebRTC relay server on k8s, (2) want coturn to run on any node (not pinned),
+  (3) avoiding hostNetwork for better pod scheduling and multi-replica support,
+  (4) need TURN for NAT traversal in WebRTC apps (video streaming, conferencing).
+  Covers relay port range sizing, MetalLB IP sharing, ephemeral TURN credentials
+  via HMAC-SHA1, and pfSense port forwarding.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-21
+---
+
+# coturn on Kubernetes Without hostNetwork
+
+## Problem
+TURN servers traditionally require hostNetwork because they relay media over a wide
+UDP port range (49152-65535). This pins the server to a single node, prevents rolling
+updates, and wastes cluster flexibility.
+
+## Context / Trigger Conditions
+- Deploying a TURN/STUN server for WebRTC applications on Kubernetes
+- Want the TURN pod to be schedulable on any node
+- Need to avoid hostNetwork for better availability and scheduling
+
+## Solution
+
+### Key insight: Narrow the relay port range
+A home lab with ~20 concurrent WebRTC viewers needs ~40 relay ports (2 per viewer).
+Use 100 ports (49152-49252) instead of 16K. This makes it practical to expose via
+a K8s LoadBalancer service.
+
+### Terraform module structure
+
+```hcl
+locals {
+  turn_port = 3478
+  min_port  = 49152
+  max_port  = 49252  # 100 ports — enough for ~50 concurrent streams
+}
+
+resource "kubernetes_deployment" "coturn" {
+  spec {
+    # No hostNetwork, no nodeSelector — runs anywhere
+    template {
+      spec {
+        container {
+          image = "coturn/coturn:latest"
+          args  = ["-c", "/etc/turnserver/turnserver.conf"]
+          port {
+            container_port = 3478
+            protocol       = "UDP"
+          }
+        }
+      }
+    }
+  }
+}
+
+resource "kubernetes_service" "coturn" {
+  metadata {
+    annotations = {
+      # Share an existing MetalLB IP to avoid consuming a new one
+      "metallb.universe.tf/loadBalancerIPs"  = "10.0.20.200"
+      "metallb.universe.tf/allow-shared-ip" = "shared"
+    }
+  }
+  spec {
+    type = "LoadBalancer"
+    # Signaling port
+    port {
+      name     = "turn-udp"
+      port     = 3478
+      protocol = "UDP"
+    }
+    # Relay ports — dynamic block generates 100 port definitions
+    dynamic "port" {
+      for_each = range(49152, 49253)
+      content {
+        name        = "relay-${port.value}"
+        port        = port.value
+        target_port = port.value
+        protocol    = "UDP"
+      }
+    }
+  }
+}
+```
+
+### coturn config (turnserver.conf)
+
+```
+listening-port=3478
+fingerprint
+lt-cred-mech
+use-auth-secret
+static-auth-secret=YOUR_SECRET_HERE
+realm=yourdomain.com
+listening-ip=0.0.0.0
+min-port=49152
+max-port=49252
+no-multicast-peers
+no-cli
+```
+
+### MetalLB IP sharing
+To reuse an existing MetalLB IP (e.g., the WireGuard/Shadowsocks shared IP):
+1. Add `metallb.universe.tf/allow-shared-ip: shared` to the coturn service
+2. The same annotation must exist on all other services sharing that IP
+3. **Port conflicts are not allowed** — verify no other service uses 3478 or 49152-49252
+4. After changing the IP annotation, **delete and recreate** the service — MetalLB won't reassign IPs on annotation changes alone
+
+### Ephemeral TURN credentials
+coturn's `use-auth-secret` mode generates time-limited credentials via HMAC-SHA1:
+
+```javascript
+const crypto = require('crypto');
+const TURN_SECRET = 'your-shared-secret';
+
+function getTurnCredentials(name = 'user', ttl = 86400) {
+  const timestamp = Math.floor(Date.now() / 1000) + ttl;
+  const username = `${timestamp}:${name}`;
+  const credential = crypto.createHmac('sha1', TURN_SECRET)
+    .update(username).digest('base64');
+  return { username, credential };
+}
+```
+
+## Verification
+
+```bash
+# STUN binding request (raw UDP probe)
+echo -ne '\x00\x01\x00\x00\x21\x12\xa4\x42\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
+  | nc -u -w2 <METALLB_IP> 3478 | xxd | head -3
+# Response starting with 0101 = successful STUN binding response
+```
+
+## Notes
+- 100 relay ports supports ~50 concurrent streams (2 ports per stream)
+- If you need more, increase `max_port` and add more ports to the service
+- coturn auto-detects pod IP — no need to set `relay-ip` or `external-ip` explicitly
+- For public access, add NAT port forwards on pfSense for UDP 3478 + 49152-49252
+- See also: `pfsense-nat-rule-creation` skill for adding the port forwards
--- a/.claude/skills/archived/crowdsec-agent-registration-failure/SKILL.md
+++ b/.claude/skills/archived/crowdsec-agent-registration-failure/SKILL.md
@ -0,0 +1,99 @@
+---
+name: crowdsec-agent-registration-failure
+description: |
+  Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
+  machine registrations. Use when: (1) CrowdSec agent init container fails with
+  "user already exist" error during cscli lapi register, (2) agent pods show hundreds
+  of init container restarts, (3) LAPI was restarted or redeployed but agents kept
+  running with old credentials, (4) cscli machines list shows stale entries for
+  current agent pod names. Covers deleting stale registrations to allow re-registration.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-15
+---
+
+# CrowdSec Agent Registration Failure
+
+## Problem
+After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
+credentials but LAPI retains the old machine registrations. When agents try to
+re-register with the same pod name, the `wait-for-lapi-and-register` init container
+fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
+
+## Context / Trigger Conditions
+- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
+- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
+- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
+- LAPI pods were recently restarted or redeployed
+- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
+
+## Solution
+
+### Step 1: Identify stuck agents
+```bash
+kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
+```
+Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
+
+### Step 2: Confirm the init container error
+```bash
+kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
+```
+Should show `user already exist` error.
+
+### Step 3: Find a running LAPI pod
+```bash
+kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
+```
+
+### Step 4: Delete stale machine registrations from LAPI
+```bash
+kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
+```
+Repeat for each stuck agent.
+
+### Step 5: Wait for agents to recover
+The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
+automatically retry registration and succeed after the stale entry is deleted. This can
+take up to 5 minutes per agent depending on where they are in the backoff cycle.
+
+## Verification
+```bash
+# All agents should show Running status
+kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
+# DaemonSet should show all pods READY
+kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
+```
+
+## Example
+```bash
+# Identify stuck agents
+$ kubectl get pods -n crowdsec | grep agent
+crowdsec-agent-jr5q7  0/1  CrashLoopBackOff  485  3d
+crowdsec-agent-jw76q  1/1  Running            8    3d
+crowdsec-agent-mtgxh  0/1  CrashLoopBackOff  483  3d
+crowdsec-agent-pfw2l  0/1  CrashLoopBackOff  481  3d
+
+# Delete stale registrations
+$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
+level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
+$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
+$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
+
+# Wait ~5 minutes, then verify
+$ kubectl get pods -n crowdsec | grep agent
+crowdsec-agent-jr5q7  1/1  Running  1  3d
+crowdsec-agent-jw76q  1/1  Running  8  3d
+crowdsec-agent-mtgxh  1/1  Running  1  3d
+crowdsec-agent-pfw2l  1/1  Running  1  3d
+```
+
+## Notes
+- This is a known limitation of the CrowdSec Helm chart — the init container registration
+  script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
+- The `cscli machines list` output will show many historical stale entries from past
+  DaemonSet rollouts. These are harmless but can be cleaned up if desired.
+- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
+  agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
+  the blocklist import.
+- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.
--- a/.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
+++ b/.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
@ -0,0 +1,310 @@
+---
+name: fastapi-svelte-gpu-webui
+description: |
+  Pattern for building web UIs for GPU-based CLI tools. Use when:
+  (1) Wrapping a command-line tool with a web interface, (2) Building job queue
+  systems for long-running GPU tasks, (3) Creating file upload/download workflows,
+  (4) Need real-time progress updates via WebSocket, (5) Deploying to Kubernetes
+  with GPU scheduling. Covers FastAPI backend, Svelte 5 frontend, NFS storage,
+  and Terraform deployment.
+author: Claude Code
+version: 1.0.0
+date: 2025-01-31
+---
+
+# FastAPI + Svelte GPU WebUI Pattern
+
+## Problem
+Many powerful tools are command-line only, making them inaccessible to non-technical
+users. Building a web UI requires handling file uploads, job queuing, progress tracking,
+and GPU resource scheduling.
+
+## Context / Trigger Conditions
+- You have a CLI tool that does heavy processing (ML inference, media conversion, etc.)
+- Want to add a web interface for easier access
+- Need to track long-running job progress
+- Deploying to Kubernetes with GPU nodes
+- Files need to persist across pod restarts (NFS storage)
+
+## Solution Overview
+
+### Directory Structure
+```
+project-web/
+├── backend/
+│   ├── main.py              # FastAPI app
+│   ├── api/
+│   │   ├── __init__.py
+│   │   └── routes.py        # REST endpoints
+│   ├── services/
+│   │   ├── __init__.py
+│   │   └── converter.py     # CLI wrapper + job manager
+│   ├── models/
+│   │   ├── __init__.py
+│   │   └── schemas.py       # Pydantic models
+│   └── requirements.txt
+├── frontend/
+│   ├── src/
+│   │   ├── App.svelte
+│   │   ├── lib/
+│   │   │   ├── FileUpload.svelte
+│   │   │   ├── JobsList.svelte
+│   │   │   └── ProgressBar.svelte
+│   │   └── stores/
+│   │       └── jobs.js
+│   ├── package.json
+│   └── vite.config.js
+├── Dockerfile
+└── README.md
+```
+
+### Backend: Job Manager Pattern
+```python
+# services/converter.py
+import asyncio
+import uuid
+from datetime import datetime
+from pathlib import Path
+from typing import Optional, Callable
+import subprocess
+
+class Job:
+    id: str
+    filename: str
+    status: str  # pending, processing, completed, failed
+    progress: float
+    created_at: datetime
+    output_file: Optional[str]
+    error: Optional[str]
+
+class JobManager:
+    def __init__(self, storage_path: str = "/mnt"):
+        self.storage_path = Path(storage_path)
+        self.jobs: dict[str, Job] = {}
+        self.progress_callbacks: dict[str, list[Callable]] = {}
+
+    def create_job(self, filename: str, **options) -> Job:
+        job_id = str(uuid.uuid4())
+        job = Job(
+            id=job_id,
+            filename=filename,
+            status="pending",
+            progress=0.0,
+            created_at=datetime.now(),
+            **options
+        )
+        self.jobs[job_id] = job
+        return job
+
+    async def run_conversion(self, job_id: str):
+        job = self.jobs[job_id]
+        job.status = "processing"
+
+        input_path = self.storage_path / "uploads" / job.filename
+        output_dir = self.storage_path / "outputs" / job_id
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+        # Build command for CLI tool
+        cmd = [
+            "/path/to/cli-tool",
+            str(input_path),
+            "-o", str(output_dir),
+            # Add other options...
+        ]
+
+        # Run with output capture for progress parsing
+        process = await asyncio.create_subprocess_exec(
+            *cmd,
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.PIPE,
+        )
+
+        # Parse output for progress updates
+        async def read_output(stream):
+            while True:
+                line = await stream.readline()
+                if not line:
+                    break
+                line_str = line.decode().strip()
+                # Parse progress from CLI output
+                if "%" in line_str:
+                    # Extract and update progress
+                    self.update_progress(job_id, parsed_progress)
+
+        await asyncio.gather(
+            read_output(process.stdout),
+            read_output(process.stderr)
+        )
+
+        returncode = await process.wait()
+
+        if returncode == 0:
+            output_files = list(output_dir.glob("*.m4b"))
+            if output_files:
+                job.output_file = output_files[0].name
+                job.status = "completed"
+        else:
+            job.status = "failed"
+            job.error = f"Exit code {returncode}"
+
+job_manager = JobManager()
+```
+
+### Backend: API Routes
+```python
+# api/routes.py
+from fastapi import APIRouter, UploadFile, File, HTTPException
+from fastapi.responses import FileResponse
+from pathlib import Path
+import shutil
+import asyncio
+
+router = APIRouter(prefix="/api")
+
+@router.post("/upload")
+async def upload_file(file: UploadFile = File(...)):
+    upload_dir = Path("/mnt/uploads")
+    upload_dir.mkdir(parents=True, exist_ok=True)
+    file_path = upload_dir / file.filename
+
+    with file_path.open("wb") as buffer:
+        shutil.copyfileobj(file.file, buffer)
+
+    return {"filename": file.filename, "size": file_path.stat().st_size}
+
+@router.post("/jobs")
+async def create_job(request: JobCreate):
+    job = job_manager.create_job(filename=request.filename, ...)
+    asyncio.create_task(job_manager.run_conversion(job.id))
+    return job
+
+@router.get("/jobs")
+async def list_jobs():
+    return job_manager.get_all_jobs()
+
+@router.get("/jobs/{job_id}/download")
+async def download_job(job_id: str):
+    job = job_manager.get_job(job_id)
+    if not job or job.status != "completed":
+        raise HTTPException(404)
+    output_path = Path("/mnt/outputs") / job_id / job.output_file
+    return FileResponse(output_path, filename=job.output_file)
+```
+
+### Frontend: Svelte 5 Components
+```svelte
+<!-- FileUpload.svelte -->
+<script>
+  let { onUpload } = $props();
+  let dragOver = $state(false);
+  let uploading = $state(false);
+
+  async function handleUpload(file) {
+    uploading = true;
+    const formData = new FormData();
+    formData.append('file', file);
+
+    const response = await fetch('/api/upload', {
+      method: 'POST',
+      body: formData
+    });
+
+    if (response.ok) {
+      const data = await response.json();
+      onUpload(data.filename);
+    }
+    uploading = false;
+  }
+</script>
+
+<div class="dropzone"
+     class:dragover={dragOver}
+     ondragover={(e) => { e.preventDefault(); dragOver = true; }}
+     ondragleave={() => dragOver = false}
+     ondrop={(e) => { e.preventDefault(); handleUpload(e.dataTransfer.files[0]); }}>
+  Drop file here
+</div>
+```
+
+### Dockerfile
+```dockerfile
+FROM python:3.12-slim
+
+# Install Node for frontend build
+RUN apt-get update && apt-get install -y nodejs npm
+
+# Build frontend
+COPY frontend/ /app/frontend/
+WORKDIR /app/frontend
+RUN npm install && npm run build
+
+# Install backend
+COPY backend/ /app/backend/
+WORKDIR /app/backend
+RUN pip install -r requirements.txt
+
+# Serve static files from FastAPI
+EXPOSE 8000
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+### Terraform Deployment (GPU)
+```hcl
+resource "kubernetes_deployment" "myapp" {
+  spec {
+    template {
+      spec {
+        node_selector = { "gpu" : "true" }
+
+        toleration {
+          key      = "nvidia.com/gpu"
+          operator = "Equal"
+          value    = "true"
+          effect   = "NoSchedule"
+        }
+
+        container {
+          image = "myregistry/myapp@sha256:..."
+          name  = "myapp"
+
+          resources {
+            limits = { "nvidia.com/gpu" = "1" }
+          }
+
+          volume_mount {
+            name       = "data"
+            mount_path = "/mnt"
+          }
+        }
+
+        volume {
+          name = "data"
+          nfs {
+            server = "10.0.10.15"
+            path   = "/mnt/main/myapp"
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+## Verification
+1. Upload a file via the UI
+2. Start a conversion job
+3. Watch progress update in real-time
+4. Download the completed file
+5. Verify files persist across pod restarts
+
+## Notes
+- Use image digest for reliable deployments (see `k8s-docker-registry-cache-bypass` skill)
+- NFS storage persists across pod restarts
+- GPU node taints require matching tolerations
+- Consider adding job persistence (database) for production use
+- WebSocket can provide smoother progress updates than polling
+
+## See Also
+- `k8s-docker-registry-cache-bypass` - Fixing image cache issues
+- `k8s-gpu-no-nvidia-devices` - GPU device troubleshooting
+- `python-filename-sanitization` - Secure file handling
--- a/.claude/skills/archived/grafana-stale-datasource-cleanup/SKILL.md
+++ b/.claude/skills/archived/grafana-stale-datasource-cleanup/SKILL.md
@ -0,0 +1,105 @@
+---
+name: grafana-stale-datasource-cleanup
+description: |
+  Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
+  with provisioned ones, or when stale datasources persist in the MySQL database.
+  Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
+  (2) Grafana API returns "datasources:delete permissions needed" when trying to remove
+  a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
+  the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
+  service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
+  blocks API operations.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-13
+---
+
+# Grafana Stale Datasource Cleanup
+
+## Problem
+Grafana uses a stale or incorrect datasource from its MySQL database instead of
+the correctly provisioned one. Common when Helm charts auto-create datasources
+that point to services you've disabled (e.g., Loki gateway).
+
+## Context / Trigger Conditions
+- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
+- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
+  different one stored in MySQL
+- Grafana API returns `"permissions needed: datasources:delete"` or
+  `"permissions needed: datasources:write"` even with admin credentials
+- Dashboard references a datasource UID that points to a wrong URL
+
+## Solution
+
+### Step 1: Identify the stale datasource
+
+List all datasources via API (this usually works even with RBAC):
+```bash
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s "http://localhost:3000/api/datasources" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
+  "import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
+```
+
+### Step 2: Try API deletion first
+
+```bash
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
+```
+
+If this returns a permissions error, proceed to Step 3.
+
+### Step 3: Delete directly from MySQL
+
+When Grafana RBAC blocks API operations, go through MySQL:
+
+```bash
+# Find the Grafana MySQL password
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'echo $GF_DATABASE_PASSWORD'
+
+# Find the stale datasource
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "SELECT id, uid, name, url FROM data_source;"
+
+# Delete it
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
+```
+
+### Step 4: Fix dashboards referencing the old UID
+
+Dashboards store datasource UIDs in their JSON. Update via MySQL:
+```bash
+kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
+  -e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
+```
+
+### Step 5: Refresh Grafana
+
+Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
+```bash
+kubectl rollout restart deploy -n monitoring grafana
+```
+
+## Verification
+```bash
+# Verify only correct datasources remain
+kubectl exec -n monitoring deploy/grafana -c grafana -- \
+  sh -c 'curl -s "http://localhost:3000/api/datasources" \
+  -u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
+```
+
+## Notes
+- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
+  and provisions datasources from them. These are file-provisioned and show as
+  "provisioned" in the UI.
+- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
+  database pointing to services like `loki-gateway`. If you disable the gateway,
+  this datasource becomes stale.
+- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
+  so dashboard JSON files in the repo are reference copies only.
+- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
+- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
--- a/.claude/skills/archived/helm-release-troubleshooting/SKILL.md
+++ b/.claude/skills/archived/helm-release-troubleshooting/SKILL.md
@ -0,0 +1,253 @@
+---
+name: helm-release-troubleshooting
+description: |
+  Troubleshoot and fix Helm release issues managed by Terraform. Use when:
+  (1) Terraform applies successfully but K8s resources don't reflect new Helm values,
+  (2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
+  (3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
+  (4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
+  (5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
+  (6) helm history shows status "pending-upgrade" or "pending-rollback",
+  (7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
+  (8) helm upgrade fails with "an error occurred while finding last successful release".
+  Covers force re-rendering via state removal/reimport and stuck release recovery via
+  secret cleanup.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-22
+---
+
+# Helm Release Troubleshooting
+
+## Force Re-render
+
+### Problem
+After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
+successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
+the new values. For example, adding a new port in Helm values doesn't result in that port
+appearing in the Service spec.
+
+### Context / Trigger Conditions
+- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
+  the old configuration
+- Structural changes to Helm values (new ports, new containers, new volumes) are not
+  reflected in deployed resources
+- The Helm chart templates need to be fully re-rendered, not just patched
+- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
+  includes resources based on values
+
+### Root Cause
+Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
+changed, Helm may use `--reuse-values` behavior where it merges new values into existing
+ones rather than doing a full template re-render. For structural changes (like enabling
+HTTP/3 which adds a new UDP port to the Service template), the templates may not be
+re-rendered with the new conditional branches active.
+
+Additionally, Terraform may see the stored Helm release state as matching the desired state
+even though the actual Kubernetes resources don't reflect it, creating a state drift that
+Terraform doesn't detect.
+
+### Solution
+
+#### Step 1: Verify the Discrepancy
+
+Confirm that K8s resources don't match Helm values:
+```bash
+# Check the actual resource
+kubectl get svc <service-name> -n <namespace> -o yaml
+
+# Check what Helm thinks is deployed
+helm get values <release-name> -n <namespace>
+helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
+```
+
+#### Step 2: Remove Helm Release from Terraform State
+
+```bash
+terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
+```
+
+**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
+resources remain untouched in the cluster.
+
+#### Step 3: Import the Helm Release Back
+
+```bash
+terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
+```
+
+For Helm releases, the import ID format is `namespace/release-name`.
+
+#### Step 4: Force Apply with Terraform
+
+After reimporting, run terraform apply. Terraform should now detect the drift between
+the desired Helm values and the actual release state:
+
+```bash
+terraform apply -target=module.kubernetes_cluster.module.<service>
+```
+
+If Terraform still shows "no changes", you may need to taint the resource:
+```bash
+terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
+terraform apply -target=module.kubernetes_cluster.module.<service>
+```
+
+#### Step 5: Manual Helm Force Upgrade (Last Resort)
+
+If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
+
+```bash
+# Get the current values file
+helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
+
+# Edit /tmp/values.yaml to include the correct values, or use --set flags
+
+# Force upgrade (re-renders all templates)
+helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
+
+# Then reimport into Terraform
+terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
+terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
+terraform apply -target=module.kubernetes_cluster.module.<service>
+```
+
+**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
+afterward, and use `terraform apply` to verify Terraform is back in sync.
+
+### Verification
+
+```bash
+# Check the K8s resources now match expected configuration
+kubectl get svc <service-name> -n <namespace> -o yaml
+kubectl get deployment <deployment-name> -n <namespace> -o yaml
+
+# Verify Terraform is in sync
+terraform plan -target=module.kubernetes_cluster.module.<service>
+# Should show "No changes" or minimal expected drift
+```
+
+### Example: Traefik HTTP/3 UDP Port Not Appearing
+
+**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
+successfully, but the Traefik Service only had TCP port 443, missing the expected
+UDP port 443 (`websecure-http3`).
+
+**Fix**:
+```bash
+# 1. Remove from state
+terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
+
+# 2. Reimport
+terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
+
+# 3. Apply (Terraform now detects the drift)
+terraform apply -target=module.kubernetes_cluster.module.traefik
+
+# 4. Verify
+kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
+# Should show: port: 443, protocol: UDP
+```
+
+### Notes
+
+- This issue is more common with structural Helm value changes (new ports, new sidecars,
+  conditional template blocks) than with simple value changes (image tags, replica counts)
+- The `helm upgrade --force` flag deletes and recreates resources that have changed,
+  which causes brief downtime. Use with caution on production ingress controllers.
+- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
+
+---
+
+## Stuck Release Recovery
+
+### Problem
+Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
+states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
+Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
+
+### Context / Trigger Conditions
+- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
+- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
+- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
+- `helm upgrade` fails with: `an error occurred while finding last successful release`
+
+### Solution
+
+#### Step 1: Identify the stuck release
+```bash
+helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
+```
+
+Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
+
+#### Step 2: Delete the stuck Helm release secrets
+Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
+Delete all stuck revisions:
+
+```bash
+# Delete specific stuck revision (e.g., revision 5)
+kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
+
+# If multiple stuck revisions exist, delete all of them
+kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
+```
+
+#### Step 3: Verify the release is clean
+```bash
+helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
+```
+
+The latest revision should now show `deployed` status.
+
+#### Step 4: Retry the upgrade
+```bash
+terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
+```
+
+### Important Notes
+
+- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
+  This changes the label but not the encoded release data inside the secret, leaving Helm in an
+  inconsistent state. Always delete the stuck secrets entirely.
+- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
+  the next successful upgrade will reconcile the state.
+- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
+  over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
+
+### Verification
+After deleting stuck secrets and re-applying:
+- `helm history` shows the new revision as `deployed`
+- `terraform apply` completes without errors
+
+### Example
+```bash
+# Helm history shows stuck state
+$ helm history nextcloud -n nextcloud | tail -3
+4  deployed        nextcloud-8.8.1  Upgrade complete
+5  failed          nextcloud-8.8.1  Upgrade failed: etcd timeout
+6  pending-rollback nextcloud-8.8.1 Rollback to 4
+
+# Fix: delete stuck revisions
+$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
+
+# Verify clean state
+$ helm history nextcloud -n nextcloud | tail -1
+4  deployed  nextcloud-8.8.1  Upgrade complete
+
+# Re-apply
+$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
+```
+
+---
+
+## See Also
+
+- `terraform-state-identity-mismatch` - For Terraform provider identity errors
+- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
+
+## References
+
+- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
+- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
+- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)
--- a/.claude/skills/archived/ingress-factory-migration/SKILL.md
+++ b/.claude/skills/archived/ingress-factory-migration/SKILL.md
@ -0,0 +1,157 @@
+---
+name: ingress-factory-migration
+description: |
+  Migrate raw kubernetes_ingress_v1 resources to the centralized ingress_factory module.
+  Use when: (1) a service defines a raw kubernetes_ingress_v1 with hand-rolled Traefik
+  middleware annotations, (2) adding a new service that needs standard ingress with
+  rate limiting, CrowdSec, CSP headers, rybbit analytics, or authentik auth,
+  (3) refactoring existing ingresses for consistency. Covers single-path, multi-path,
+  split UI/API, full_host overrides, custom rate limits, and extra middleware injection.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-10
+---
+
+# Ingress Factory Migration
+
+## Problem
+Services define raw `kubernetes_ingress_v1` resources with hand-rolled Traefik middleware
+chains. This creates inconsistency - middleware chains are copy-pasted per service, making
+it easy to miss security middleware (CrowdSec, rate limiting) or analytics (rybbit). The
+`ingress_factory` module at `modules/kubernetes/ingress_factory/main.tf` provides a single
+point of control.
+
+## Context / Trigger Conditions
+- Service has a raw `kubernetes_ingress_v1` resource instead of using `module "ingress"`
+- Service has a manually defined `kubernetes_manifest` for rybbit analytics middleware
+- New service needs standard ingress configuration
+- Middleware chain needs to be updated across many services
+
+## Solution
+
+### Standard single-path ingress
+Replace the raw resource with:
+```hcl
+module "ingress" {
+  source          = "../ingress_factory"
+  namespace       = kubernetes_namespace.<service>.metadata[0].name
+  name            = "<service-name>"        # becomes the ingress name AND default hostname
+  host            = "<subdomain>"           # optional: override hostname (if different from name)
+  service_name    = "<k8s-service-name>"    # optional: defaults to name
+  port            = 80                      # optional: defaults to 80
+  tls_secret_name = var.tls_secret_name
+  protected       = false                   # set true for authentik forward auth
+}
+```
+
+### Multi-path / split UI+API
+Use two module calls with different names but same host:
+```hcl
+module "ingress" {
+  source          = "../ingress_factory"
+  namespace       = kubernetes_namespace.<service>.metadata[0].name
+  name            = "<service>"
+  host            = "<subdomain>"
+  service_name    = "<ui-service>"
+  tls_secret_name = var.tls_secret_name
+  rybbit_site_id  = "<id>"                  # optional: adds rybbit analytics
+}
+
+module "ingress-api" {
+  source          = "../ingress_factory"
+  namespace       = kubernetes_namespace.<service>.metadata[0].name
+  name            = "<service>-api"
+  host            = "<subdomain>"           # same host as UI
+  service_name    = "<api-service>"
+  ingress_path    = ["/api"]
+  tls_secret_name = var.tls_secret_name
+  # No rybbit_site_id - API returns JSON, not HTML
+}
+```
+
+### Full host override (for root domain like viktorbarzin.me)
+```hcl
+module "ingress" {
+  source          = "../ingress_factory"
+  namespace       = kubernetes_namespace.<service>.metadata[0].name
+  name            = "<service>"
+  service_name    = "<k8s-service>"
+  full_host       = "viktorbarzin.me"       # bypasses name.root_domain construction
+  tls_secret_name = var.tls_secret_name
+}
+```
+
+### Custom rate limiting (e.g., immich)
+```hcl
+module "ingress" {
+  source                  = "../ingress_factory"
+  namespace               = kubernetes_namespace.<service>.metadata[0].name
+  name                    = "<service>"
+  skip_default_rate_limit = true
+  extra_middlewares        = ["traefik-<custom>-rate-limit@kubernetescrd"]
+  tls_secret_name         = var.tls_secret_name
+}
+```
+
+### Key variables reference
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `name` | required | Ingress resource name + default hostname |
+| `host` | null | Override hostname prefix (name used if null) |
+| `full_host` | null | Override entire hostname (bypasses root_domain) |
+| `service_name` | null | K8s service name (name used if null) |
+| `port` | 80 | Backend service port |
+| `ingress_path` | ["/"] | URL paths to match |
+| `protected` | false | Adds authentik forward auth middleware |
+| `rybbit_site_id` | null | Adds rybbit analytics script injection |
+| `skip_default_rate_limit` | false | Omits default rate limiter |
+| `extra_middlewares` | [] | Additional middleware references to append |
+| `extra_annotations` | {} | Additional ingress annotations |
+| `allow_local_access_only` | false | Restricts to LAN/VPN |
+| `exclude_crowdsec` | false | Skips CrowdSec middleware |
+| `custom_content_security_policy` | null | Custom CSP header |
+
+### After migration, delete:
+1. The raw `kubernetes_ingress_v1` resource
+2. Any manually defined `kubernetes_manifest "rybbit_analytics"` (the factory creates this automatically when `rybbit_site_id` is set)
+
+## Gotchas
+
+### Duplicate module names
+If the service directory has multiple `.tf` files (e.g., `main.tf` and `frame.tf`), check
+for existing `module "ingress"` blocks. Module names must be unique within a directory.
+Use a descriptive name like `module "ingress-immich"` instead.
+
+### Terraform target module names with hyphens
+Module names in `terraform state list` may use hyphens (e.g., `module.real-estate-crawler`).
+When using `-target`, you must match the exact name including hyphens:
+```bash
+# Wrong - underscores:
+terraform apply -target=module.kubernetes_cluster.module.real_estate_crawler
+
+# Correct - hyphens (quote to prevent shell interpretation):
+terraform apply '-target=module.kubernetes_cluster.module.real-estate-crawler'
+```
+
+### Service name defaults
+The factory defaults `service_name` to `name`. If the K8s service has a different name
+than the ingress, you must explicitly set `service_name`. Common case: headscale has one
+K8s service named `headscale` with multiple ports, so the UI ingress needs
+`service_name = "headscale"` even though `name = "headscale-ui"`.
+
+### Servarr subdirectory source path
+Services under `servarr/` need `../../ingress_factory` as the source path instead of
+`../ingress_factory`.
+
+## Verification
+1. `terraform validate` - check for syntax errors
+2. `terraform plan -target=module.kubernetes_cluster.module.<service>` - verify old ingress destroyed, new created
+3. `kubectl get ingress -n <namespace>` - verify ingress exists with correct host/paths
+4. Browse the service URL to confirm accessibility
+
+## Notes
+- Services using special protocols (gRPC, mTLS, WebSocket with custom headers) should NOT
+  be migrated - keep raw `kubernetes_ingress_v1` for those
+- The factory automatically includes: rate-limit, CSP headers, CrowdSec, and entrypoint=websecure
+- When `rybbit_site_id` is set, the factory creates a `kubernetes_manifest` for the
+  rewrite-body middleware that injects the analytics script into HTML responses
--- a/.claude/skills/archived/iterative-plan-review-with-subagents/SKILL.md
+++ b/.claude/skills/archived/iterative-plan-review-with-subagents/SKILL.md
@ -0,0 +1,80 @@
+---
+name: iterative-plan-review-with-subagents
+description: |
+  Design pattern for reviewing implementation plans using parallel subagent reviewers
+  with iterative refinement. Use when: (1) designing a complex infrastructure change
+  that needs security + implementation review, (2) creating a migration plan with
+  multiple phases, (3) any plan where missing a critical issue could cause data loss
+  or security exposure. Spawns 2 reviewer agents (security + implementation), collects
+  CRITICAL/IMPORTANT/NIT findings, fixes all CRITICALs, re-runs until zero CRITICALs.
+  Typically converges in 2-3 iterations.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-07
+---
+
+# Iterative Plan Review with Subagents
+
+## Problem
+Complex infrastructure plans have blind spots — security issues, implementation
+incompatibilities, race conditions, format mismatches. A single reviewer misses things.
+Multiple reviewers with different expertise catch more.
+
+## Context / Trigger Conditions
+- Writing a migration plan (e.g., secrets management, storage migration)
+- Designing a multi-phase infrastructure change
+- Any plan where a missed issue = downtime, data loss, or security exposure
+- User explicitly asks for plan review
+
+## Solution
+
+### 1. Write the plan as a markdown document
+Save to `docs/plans/YYYY-MM-DD-<topic>.md`
+
+### 2. Spawn 2 reviewer agents in parallel
+```
+Agent 1: Security reviewer
+- Focus: secret exposure, access control, key management, CI pipeline security
+- Classify each finding: CRITICAL / IMPORTANT / NIT
+
+Agent 2: Implementation reviewer
+- Focus: format compatibility, race conditions, ordering, tool behavior
+- Classify each finding: CRITICAL / IMPORTANT / NIT
+```
+
+Key: give each reviewer specific focus areas and the actual source code to check against.
+
+### 3. Consolidate and fix CRITICALs
+- Merge findings from both reviewers
+- Deduplicate (both often find the same issue)
+- Fix ALL CRITICALs in the plan document
+- Note IMPORTANTs for implementation phase
+
+### 4. Re-run reviewers on the updated plan
+- Same 2 agents, but tell them which CRITICALs were fixed
+- Ask them to VERIFY fixes are correct AND find new issues
+- Repeat until zero CRITICALs
+
+### 5. Typical convergence
+- v1: 5-6 CRITICALs (format issues, race conditions, missing steps)
+- v2: 2-3 CRITICALs (fixes introduced new issues, missed edge cases)
+- v3: 0 CRITICALs, only IMPORTANTs remaining
+
+## Example Findings from Real Usage (SOPS migration)
+
+| Iteration | CRITICALs Found | Examples |
+|-----------|----------------|---------|
+| v1 | 6 | YAML≠HCL format, `git add .` commits secrets, no branch protection, parallel race condition |
+| v2 | 3 | `SOPS_AGE_KEY_FILE` misunderstanding, `renew-tls.yml` not updated, plan leaks in PR logs |
+| v3 | 0 | All verified fixed. 6 IMPORTANTs noted for implementation. |
+
+## Verification
+- Zero CRITICALs from both reviewers on the final iteration
+- IMPORTANTs documented as implementation notes (not blockers)
+
+## Notes
+- Use `sonnet` model for reviewers (fast, thorough enough for review)
+- Give reviewers actual source code paths to read, not just the plan
+- Tell v2+ reviewers what was fixed so they verify, not re-discover
+- The final review should say "ONLY report CRITICALs" to avoid noise
+- This pattern cost ~$3-5 in API calls but caught issues that would have caused hours of debugging
--- a/.claude/skills/archived/k8s-container-image-caching/SKILL.md
+++ b/.claude/skills/archived/k8s-container-image-caching/SKILL.md
@ -0,0 +1,244 @@
+---
+name: k8s-container-image-caching
+description: |
+  Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
+  (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
+  (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
+  (3) need to add pull-through cache for a new upstream registry,
+  (4) `mirrors` cannot be set when `config_path` is provided error in containerd,
+  (5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
+  (6) kubectl shows correct image tag but container runs old code,
+  (7) local registry mirror caches stale images,
+  (8) imagePullPolicy: Always doesn't force fresh pulls,
+  (9) containerd config has mirror that intercepts pulls serving stale images.
+  Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
+  via image digest pinning.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-22
+---
+
+# Kubernetes Container Image Caching
+
+## Pull-Through Cache Setup
+
+### Problem
+
+Docker Registry v2 can only proxy **one upstream registry per instance**. A common
+misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
+to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
+and other registries -- they get routed to the Docker Hub proxy which can't serve them,
+causing `ImagePullBackOff`.
+
+### Context / Trigger Conditions
+
+- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
+- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
+- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
+- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
+
+### Solution
+
+#### 1. Run one Registry v2 container per upstream
+
+Each upstream needs its own Docker Registry v2 instance on a different port:
+
+| Port | Registry | Container Name |
+|------|----------|---------------|
+| 5000 | docker.io | registry |
+| 5010 | ghcr.io | registry-ghcr |
+| 5020 | quay.io | registry-quay |
+| 5030 | registry.k8s.io | registry-k8s |
+| 5040 | reg.kyverno.io | registry-kyverno |
+
+Config for non-Docker-Hub proxies (no auth needed -- they're public):
+
+```yaml
+version: 0.1
+storage:
+  cache:
+    blobdescriptor: inmemory
+  filesystem:
+    rootdirectory: /var/lib/registry
+http:
+  addr: :5000
+proxy:
+  remoteurl: https://ghcr.io  # change per registry
+```
+
+```bash
+docker run -p 5010:5000 -d --restart always --name registry-ghcr \
+  -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
+```
+
+#### 2. Replace deprecated wildcard mirror with `config_path`
+
+Instead of:
+```toml
+# DEPRECATED - breaks non-Docker-Hub registries
+[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
+  endpoint = ["http://10.0.20.10:5000"]
+```
+
+Use the modern `config_path` approach:
+```toml
+[plugins."io.containerd.grpc.v1.cri".registry]
+  config_path = "/etc/containerd/certs.d"
+```
+
+Then create per-registry `hosts.toml` files:
+```bash
+mkdir -p /etc/containerd/certs.d/docker.io
+cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
+server = "https://registry-1.docker.io"
+
+[host."http://10.0.20.10:5000"]
+  capabilities = ["pull", "resolve"]
+EOF
+```
+
+Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
+
+#### 3. Critical: `config_path` and `mirrors` cannot coexist
+
+Containerd will **refuse to start the CRI plugin** if both `config_path` and any
+`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
+(including the `[plugins."...registry.mirrors"]` parent section) before setting
+`config_path`.
+
+This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
+where the config format is slightly different. If unsure, either:
+- Don't use config_path on that node (skip the pull-through cache)
+- Remove the entire `mirrors` section first, then add `config_path`
+
+#### 4. Static IP for registry VM
+
+If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
+via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
+
+### Verification
+
+```bash
+# Test each proxy responds
+for port in 5000 5010 5020 5030 5040; do
+  curl -s http://10.0.20.10:$port/v2/_catalog
+done
+
+# Test containerd can pull through cache
+crictl pull ghcr.io/some/image:tag
+
+# Check containerd logs for mirror usage
+journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
+```
+
+### Notes
+
+- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
+  direct pull from the upstream `server` URL. This provides graceful degradation.
+- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
+  to avoid I/O spikes.
+- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
+- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
+
+---
+
+## Cache Bypass / Stale Image Fix
+
+### Problem
+Kubernetes pods continue running old Docker images even after pushing new versions with
+the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
+and serves stale versions, ignoring `imagePullPolicy: Always`.
+
+### Context / Trigger Conditions
+- Pod is running but application code is outdated
+- `docker push` succeeded with new layers
+- `kubectl describe pod` shows correct image tag
+- Cluster has a local registry mirror configured (e.g., in containerd config)
+- `imagePullPolicy: Always` doesn't fix the issue
+- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
+
+### Solution
+
+#### 1. Get the image digest after pushing
+```bash
+docker push viktorbarzin/myimage:latest
+# Output includes: latest: digest: sha256:abc123... size: 856
+```
+
+#### 2. Use digest instead of tag in deployment
+```hcl
+# Terraform
+container {
+  # Use digest to bypass local registry cache
+  image             = "docker.io/viktorbarzin/myimage@sha256:abc123..."
+  image_pull_policy = "Always"
+  name              = "myimage"
+}
+```
+
+```yaml
+# Kubernetes YAML
+containers:
+  - name: myimage
+    image: docker.io/viktorbarzin/myimage@sha256:abc123...
+    imagePullPolicy: Always
+```
+
+#### 3. Apply and restart
+```bash
+terraform apply -target=module.kubernetes_cluster.module.myservice
+kubectl rollout restart deployment/myservice -n mynamespace
+```
+
+### Why This Works
+- Registry mirrors match by tag, not digest
+- When you specify a digest, the node must fetch that exact manifest
+- The mirror may not have the digest cached, forcing a pull from upstream
+- Even if cached, the digest guarantees the exact image version
+
+### Verification
+```bash
+# Check the pod is using the new image
+kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
+
+# Verify application behavior reflects new code
+kubectl exec -n mynamespace deploy/myservice -- <verification-command>
+```
+
+### Example
+
+Before (problematic):
+```hcl
+image = "docker.io/viktorbarzin/audiblez-web:latest"
+```
+
+After (fixed):
+```hcl
+image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
+```
+
+### Notes
+- You must update the digest each time you push a new image
+- Consider automating digest extraction in CI/CD pipelines
+- This is a workaround; ideally fix the registry mirror configuration
+- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
+- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
+
+### Diagnosing Registry Mirror Issues
+```bash
+# On a k8s node, check containerd config
+cat /etc/containerd/config.toml | grep -A5 mirrors
+
+# Check if mirror is intercepting
+crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
+
+# List cached images on node
+crictl images | grep myimage
+```
+
+---
+
+## References
+
+- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
+- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
--- a/.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
+++ b/.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
@ -0,0 +1,186 @@
+---
+name: k8s-gpu-no-nvidia-devices
+description: |
+  Fix for Kubernetes GPU pods showing "CUDA not supported" or no /dev/nvidia* devices
+  despite nvidia.com/gpu resource allocation. Use when: (1) container runs but torch.cuda.is_available()
+  returns False, (2) ls /dev/nvidia* shows "no matches found", (3) nvidia-smi fails inside pod
+  but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
+  Covers NVIDIA device plugin, time-slicing, and container runtime issues.
+author: Claude Code
+version: 1.1.0
+date: 2026-03-01
+---
+
+# Kubernetes GPU Pod - No NVIDIA Devices Found
+
+## Problem
+
+A Kubernetes pod requests GPU resources (`nvidia.com/gpu: 1`) and schedules on a GPU node,
+but inside the container there are no NVIDIA devices visible. The application falls back
+to CPU with messages like "CUDA not supported by the Torch installed!" despite running
+in a CUDA-enabled container image.
+
+## Context / Trigger Conditions
+
+- Pod shows `Running` status and is on a node with `gpu=true` label
+- `kubectl describe pod` shows GPU limit/request is satisfied
+- Inside container: `ls /dev/nvidia*` returns "no matches found"
+- Inside container: `nvidia-smi` fails or command not found
+- Application logs show: "CUDA not supported", "Switching to CPU", "torch.cuda.is_available() = False"
+- On the host node: `nvidia-smi` works fine
+
+## Solution
+
+### Step 1: Verify GPU Availability
+
+Check if other pods are consuming the GPU:
+
+```bash
+# List all pods using GPU resources
+kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) | "\(.metadata.namespace)/\(.metadata.name)"'
+
+# Check NVIDIA device plugin pods
+kubectl get pods -n nvidia -l app=nvidia-device-plugin
+kubectl logs -n nvidia -l app=nvidia-device-plugin --tail=50
+```
+
+### Step 2: Free GPU Resources
+
+If another workload is using the GPU, unload it:
+
+```bash
+# For Ollama specifically
+kubectl exec -n ollama deployment/ollama -- ollama stop <model_name>
+
+# Or scale down the conflicting deployment
+kubectl scale deployment/<name> -n <namespace> --replicas=0
+```
+
+### Step 3: Restart the Affected Pod
+
+After freeing GPU resources, restart the pod to get fresh device allocation:
+
+```bash
+kubectl rollout restart deployment/<name> -n <namespace>
+
+# Or delete the pod directly
+kubectl delete pod <pod-name> -n <namespace>
+```
+
+### Step 4: Verify GPU Access
+
+```bash
+# Check devices are now visible
+kubectl exec -n <namespace> deployment/<name> -- ls -la /dev/nvidia*
+
+# Test nvidia-smi
+kubectl exec -n <namespace> deployment/<name> -- nvidia-smi
+
+# Test PyTorch CUDA
+kubectl exec -n <namespace> deployment/<name> -- python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
+```
+
+## Verification
+
+After restart, you should see:
+
+```
+/dev/nvidia0
+/dev/nvidiactl
+/dev/nvidia-uvm
+/dev/nvidia-uvm-tools
+```
+
+And `nvidia-smi` should show the GPU with your container process.
+
+## Example
+
+```bash
+# Problem: ebook2audiobook shows "CUDA not supported"
+$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- ls /dev/nvidia*
+zsh:1: no matches found: /dev/nvidia*
+
+# Solution: Unload Ollama model holding the GPU
+$ kubectl exec -n ollama deployment/ollama -- ollama ps
+NAME           SIZE     PROCESSOR
+qwen2.5:14b    10 GB    33%/67% CPU/GPU
+
+$ kubectl exec -n ollama deployment/ollama -- ollama stop qwen2.5:14b
+
+# Restart the affected pod
+$ kubectl rollout restart deployment/ebook2audiobook -n ebook2audiobook
+
+# Verify
+$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
+# Should now show the Tesla T4 GPU
+```
+
+## Notes
+
+- **GPU Time-Slicing**: If using NVIDIA GPU time-slicing (configured in GPU Operator),
+  multiple pods can share a GPU. However, device injection still requires proper timing.
+
+- **Pod Scheduling Order**: Pods that start while GPU is fully allocated may not get
+  devices injected even after GPU becomes available - a restart is required.
+
+- **Container Runtime**: The NVIDIA Container Toolkit must be properly configured.
+  Issues can arise from:
+  - cgroup driver mismatch (systemd vs cgroupfs)
+  - Container updates causing device loss
+  - SELinux blocking device access
+
+- **Image Compatibility**: The container image must have CUDA libraries matching the
+  driver version. Check with `nvidia-smi` on host for driver version.
+
+- **This Cluster**: Uses NVIDIA GPU Operator with time-slicing (20 replicas per GPU).
+  GPU node is `k8s-node1` with Tesla T4.
+
+## See Also
+
+- Check GPU Operator status: `kubectl get pods -n nvidia`
+- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
+
+## Automatic GPU Recovery via Liveness Probe
+
+To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
+both GPU availability and application health. Example for Frigate (but applicable to any
+GPU workload):
+
+```hcl
+# Restart pod if GPU becomes unavailable or app hangs
+liveness_probe {
+  exec {
+    command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
+  }
+  initial_delay_seconds = 120
+  period_seconds        = 60
+  timeout_seconds       = 10
+  failure_threshold     = 3
+}
+# Allow time for GPU model loading at startup
+startup_probe {
+  http_get {
+    path = "/health"
+    port = <port>
+  }
+  period_seconds    = 10
+  failure_threshold = 30  # up to 5 minutes
+}
+```
+
+The liveness probe checks:
+- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
+- `curl` health endpoint — fails if the application process is hung
+
+If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
+which re-acquires the GPU device through the NVIDIA device plugin.
+
+**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
+(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
+configured with a short `initial_delay_seconds`.
+
+## References
+
+- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
+- [Kubernetes GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
+- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html)
--- a/.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
+++ b/.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
@ -0,0 +1,113 @@
+---
+name: k8s-hpa-scaling-storm
+description: |
+  Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
+  maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
+  200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
+  (3) cluster becomes unstable due to resource exhaustion from too many pods,
+  (4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
+  to a deployment that previously had none causes HPA to miscalculate utilization.
+  Covers emergency response and prevention patterns.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-15
+---
+
+# Kubernetes HPA Scaling Storm
+
+## Problem
+When an HPA is configured with a memory or CPU utilization target but the underlying
+deployment has insufficient resource requests, the HPA calculates artificially high
+utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
+This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
+cluster resources and potentially crashing etcd and the API server.
+
+## Context / Trigger Conditions
+- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
+- Pod count for a deployment rapidly increases to maxReplicas
+- etcd timeout errors in `kubectl` or `terraform apply`
+- API server becomes unreachable (`connection refused` or `network is unreachable`)
+- Adding resource requests to a Helm chart that previously had none
+- Memory-based HPA targets with real usage far exceeding requests
+
+## Solution
+
+### Emergency Response (stop the storm)
+
+**Step 1: Delete the HPA immediately**
+```bash
+kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
+```
+
+**Step 2: Scale the deployment down**
+```bash
+kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
+```
+
+**Step 3: Wait for pods to terminate and cluster to stabilize**
+```bash
+# Watch pod count decrease
+kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
+```
+
+If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
+will restart static pods (etcd, kube-apiserver) automatically.
+
+### Prevention
+
+**Rule 1: Set resource requests to match actual usage**
+Before enabling HPA, check actual resource consumption:
+```bash
+kubectl top pods -n <namespace> -l <label>
+```
+Set requests to the baseline (idle) usage, not the minimum possible value.
+
+**Rule 2: Set reasonable maxReplicas**
+Never use maxReplicas > 10 unless you've verified the cluster can handle it.
+Default of 100 is almost never appropriate for a home/small cluster.
+
+**Rule 3: Prefer CPU-only HPA targets**
+Memory-based scaling is problematic because:
+- Memory usage grows over time and rarely decreases
+- Memory-based scaling creates pods that never scale down
+- CPU is more responsive to load changes
+
+**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
+If adding resource requests to a deployment managed by HPA, temporarily disable
+the HPA first, set the requests, verify utilization is reasonable, then re-enable.
+
+## Cascade Effects
+A scaling storm can cause:
+1. etcd storage exhaustion (too many pod objects)
+2. API server OOM or connection limits
+3. VPN/network connectivity loss (if VPN runs in the cluster)
+4. Kyverno webhook failures (admission controller overwhelmed)
+5. Other pods evicted or unable to schedule
+
+## Verification
+- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
+- Pod count is stable at expected replicas
+- `kubectl get nodes` responds promptly
+- No etcd timeout errors
+
+## Example
+```bash
+# Observed: HPA scaling Collabora to 100 pods
+$ kubectl get hpa -n nextcloud
+NAME                 TARGETS                          MINPODS  MAXPODS  REPLICAS
+nextcloud-collabora  cpu: 0%/70%, memory: 220%/50%   2        100      83
+
+# Emergency fix
+$ kubectl delete hpa nextcloud-collabora -n nextcloud
+$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
+
+# Root cause: 256Mi memory request, actual usage 570Mi
+# Fix: increase request to 1Gi or disable memory target
+```
+
+## Notes
+- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
+  Helm upgrade will recreate it. You must also update the Helm values.
+- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
+  the HPA issue entirely.
+- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.
--- a/.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
+++ b/.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
@ -0,0 +1,235 @@
+---
+name: k8s-nfs-mount-troubleshooting
+description: |
+  Debug Kubernetes NFS volume mount failures. Use when: (1) Pod stuck in ContainerCreating
+  for extended time, (2) kubectl describe shows "MountVolume.SetUp failed" with NFS errors,
+  (3) Error message shows "Protocol not supported" or "mount.nfs: access denied",
+  (4) NFS volume defined in pod spec but container won't start, (5) Container starts but
+  gets "Permission denied" writing to NFS volume (non-root container UID mismatch),
+  (6) CronJob or init container fails silently when writing to NFS, (7) Pod shows Running
+  1/1 but service is unresponsive after a node reboot — stale NFS mount causes frozen
+  processes with zero listening sockets. Common root causes are missing NFS export on the
+  server, UID mismatch for non-root containers, and stale mounts after node reboots.
+author: Claude Code
+version: 1.2.0
+date: 2026-02-28
+---
+
+# Kubernetes NFS Mount Troubleshooting
+
+## Problem
+Pods with NFS volumes get stuck in `ContainerCreating` state indefinitely. The error 
+messages from `kubectl describe pod` can be misleading, showing protocol or permission 
+errors when the actual issue is the NFS export doesn't exist.
+
+## Context / Trigger Conditions
+- Pod status shows `ContainerCreating` for more than 1-2 minutes
+- `kubectl describe pod` shows events like:
+  - `MountVolume.SetUp failed for volume "data" : mount failed: exit status 32`
+  - `mount.nfs: Protocol not supported`
+  - `mount.nfs: access denied by server`
+- Pod spec includes an NFS volume mount
+- Other pods on the same node work fine
+
+## Solution
+
+### Step 1: Identify the NFS path
+```bash
+kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Volumes:"
+```
+Look for the NFS server and path (e.g., `10.0.10.15:/mnt/main/myservice`)
+
+### Step 2: Verify the export exists on NFS server
+SSH to the NFS server and check:
+```bash
+ssh root@<nfs-server> "ls -la /mnt/main/myservice"
+```
+
+### Step 3: If directory doesn't exist, create it
+```bash
+ssh root@<nfs-server> "mkdir -p /mnt/main/myservice && chmod 777 /mnt/main/myservice"
+```
+
+### Step 4: Add to NFS exports (TrueNAS specific)
+For TrueNAS, add the path to the NFS share configuration:
+1. Add directory to `scripts/nfs_directories.txt`
+2. Run `scripts/nfs_exports.sh` to update the share via API
+
+### Step 5: Restart the pod
+```bash
+kubectl delete pod -n <namespace> -l app=<app-label>
+```
+The deployment will create a new pod that should now mount successfully.
+
+## Verification
+```bash
+kubectl get pods -n <namespace>
+# Should show 1/1 Running instead of 0/1 ContainerCreating
+
+kubectl exec -n <namespace> <pod-name> -- ls -la /app/data
+# Should show the mounted directory contents
+```
+
+## Example
+**Symptom:**
+```
+Events:
+  Warning  FailedMount  55s (x13 over 11m)  kubelet  MountVolume.SetUp failed for volume "data" : mount failed: exit status 32
+  Mounting command: mount
+  Mounting arguments: -t nfs 10.0.10.15:/mnt/main/resume /var/lib/kubelet/pods/.../data
+  Output: mount.nfs: Protocol not supported
+```
+
+**Root Cause:** The directory `/mnt/main/resume` didn't exist on the TrueNAS server.
+
+**Fix:**
+```bash
+ssh root@10.0.10.15 'mkdir -p /mnt/main/resume && chmod 777 /mnt/main/resume'
+# Then add to NFS exports and restart pod
+```
+
+## Notes
+- The "Protocol not supported" error is misleading - it often means the export path doesn't exist
+- Always check the NFS server first before investigating protocol/firewall issues
+- For TrueNAS, the NFS share must be updated via API/UI after creating new directories
+- NFSv3 vs NFSv4 issues are rare in modern setups; missing paths are more common
+- Check that the NFS client packages are installed on Kubernetes nodes if this is a new cluster
+
+## Variant: Non-Root Container UID Permission Denied
+
+### Problem
+Container starts and mounts NFS successfully, but gets "Permission denied" when
+writing files. The pod appears healthy but operations fail silently.
+
+### Trigger Conditions
+- Container logs show "Permission denied" or "client returned ERROR on write"
+- Pod is Running (not stuck in ContainerCreating)
+- NFS directory exists and is mounted, but owned by root (uid 0)
+- Container image runs as a non-root user (e.g., `curlimages/curl` runs as uid 101)
+- CronJobs or init containers that write to NFS fail with no obvious error
+
+### Common Non-Root Container UIDs
+| Image | UID | User |
+|-------|-----|------|
+| `curlimages/curl` | 101 | curl_user |
+| `nginx` (unprivileged) | 101 | nginx |
+| `node` | 1000 | node |
+| `python` (slim) | 0 | root (safe) |
+| `grafana/grafana` | 472 | grafana |
+
+### Solution
+Fix permissions on the NFS server:
+```bash
+# Option 1: World-writable (simplest, suitable for non-sensitive data)
+ssh root@10.0.10.15 "chmod -R 777 /mnt/main/<service>/<subdir>"
+
+# Option 2: Match container UID (more secure)
+ssh root@10.0.10.15 "chown -R <uid>:<gid> /mnt/main/<service>/<subdir>"
+
+# Option 3: Use securityContext in pod spec to run as root
+spec:
+  securityContext:
+    runAsUser: 0
+```
+
+### Debugging
+```bash
+# Check what UID the container runs as
+kubectl exec -n <namespace> <pod> -- id
+
+# Test write access from inside container
+kubectl exec -n <namespace> <pod> -- sh -c 'echo test > /path/to/nfs/testfile'
+
+# Check NFS directory ownership on server
+ssh root@10.0.10.15 "ls -la /mnt/main/<service>/"
+```
+
+## Variant: Stale NFS Mounts After Node Reboot (Ghost Running Pods)
+
+### Problem
+After a node reboot (e.g., from kured rolling kernel updates), pods are rescheduled and
+show `Running 1/1` status, but the application process is frozen/hung. The service is
+completely unresponsive despite appearing healthy to Kubernetes.
+
+### Trigger Conditions
+- Node was recently rebooted (check `kubectl get nodes` for age, or kured logs)
+- Pod shows `Running 1/1` with 0 restarts (looks perfectly healthy)
+- Service is unresponsive — Uptime Kuma or curl shows timeout/connection refused
+- `kubectl exec <pod> -- ss -tlnp` shows **zero listening sockets** (the process started but is hung)
+- Pod uses NFS volumes (inline `nfs {}` or PVC backed by NFS)
+- Multiple pods across different namespaces all exhibit the same symptom simultaneously
+- `kubectl describe pod` shows no warnings or errors — everything looks normal
+
+### Root Cause
+When a node reboots, the NFS client mounts go stale. If the pod is rescheduled to the
+same or different node before NFS fully recovers, the application process starts but
+immediately hangs when it tries to access the NFS-mounted filesystem. The process is
+stuck in an uninterruptible I/O wait (D state) but Kubernetes sees the container as
+running because the PID exists and liveness probes (if any) may not exercise the NFS path.
+
+### Solution
+Force-delete the affected pods to trigger a clean reschedule with fresh NFS mounts:
+
+```bash
+# Identify hung pods — Running but no listening sockets
+kubectl exec -n <namespace> <pod> -- ss -tlnp 2>/dev/null
+# If output is empty or shows no expected ports, the pod is hung
+
+# Force-delete to skip graceful shutdown (hung process won't respond to SIGTERM)
+kubectl delete pod -n <namespace> <pod> --force --grace-period=0
+
+# The deployment controller creates a new pod with fresh NFS mounts
+kubectl get pods -n <namespace> -w
+```
+
+For bulk remediation after a cluster-wide event:
+```bash
+# Find all pods with NFS volumes that might be hung
+# Check each service's expected port — if ss -tlnp shows nothing, force-delete
+for ns in calibre stirling-pdf send speedtest n8n paperless-ngx; do
+  pod=$(kubectl get pod -n $ns -o name | head -1)
+  sockets=$(kubectl exec -n $ns ${pod} -- ss -tlnp 2>/dev/null | wc -l)
+  if [ "$sockets" -le 1 ]; then
+    echo "HUNG: $ns/$pod (no listening sockets)"
+    kubectl delete ${pod} -n $ns --force --grace-period=0
+  fi
+done
+```
+
+### Verification
+```bash
+# New pod should have listening sockets
+kubectl exec -n <namespace> <new-pod> -- ss -tlnp
+# Should show the application's expected port (e.g., *:8080)
+
+# Service should respond
+kubectl exec -n <namespace> <new-pod> -- curl -sI http://localhost:<port>/
+# Should return HTTP response
+```
+
+### Key Diagnostic Insight
+The critical signal is **Running 1/1 but zero listening sockets**. Normal healthy pods
+always have at least one listening socket for their application port. If `ss -tlnp`
+returns nothing, the process is hung on a stale NFS mount, not crashed — that's why
+Kubernetes thinks it's fine.
+
+### Prevention
+- Add **liveness probes** that hit the application's HTTP endpoint (not just TCP connect):
+  ```hcl
+  liveness_probe {
+    http_get {
+      path = "/"
+      port = 8080
+    }
+    initial_delay_seconds = 60
+    period_seconds        = 30
+    timeout_seconds       = 5
+  }
+  ```
+- This ensures Kubernetes detects hung pods and restarts them automatically.
+
+## See Also
+- **nfsv4-idmapd-uid-mapping** — All UIDs show as 65534 (nobody) inside containers. Different from permission denied; the UIDs are wrong, not the permissions.
+- TrueNAS NFS configuration documentation
+- Kubernetes NFS volume documentation
+- k8s-limitrange-oom-silent-kill (for OOM issues often confused with NFS hangs)
--- a/.claude/skills/archived/kubelet-static-pod-manifest-update/SKILL.md
+++ b/.claude/skills/archived/kubelet-static-pod-manifest-update/SKILL.md
@ -0,0 +1,109 @@
+---
+name: kubelet-static-pod-manifest-update
+description: |
+  Force kubelet to pick up changes to static pod manifests in /etc/kubernetes/manifests/.
+  Use when: (1) edited kube-apiserver.yaml but the running process still has old flags,
+  (2) kubelet restart doesn't pick up manifest changes, (3) touching the manifest file
+  doesn't trigger pod recreation, (4) killing the API server process results in the
+  same old args on restart, (5) the pod's config.hash annotation doesn't match the
+  file's hash. Requires a full cycle: remove manifest, stop kubelet, remove containers,
+  re-add manifest, start kubelet.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-17
+---
+
+# Kubelet Static Pod Manifest Update
+
+## Problem
+After editing a static pod manifest (e.g., `/etc/kubernetes/manifests/kube-apiserver.yaml`
+to add OIDC or audit flags), kubelet continues running the pod with the old configuration.
+Standard approaches like `touch`, `systemctl restart kubelet`, or `kubectl delete pod`
+do not force kubelet to reconcile the new manifest.
+
+## Context / Trigger Conditions
+- Edited `/etc/kubernetes/manifests/kube-apiserver.yaml` (or other static pod manifests)
+- The running process (`ps aux | grep kube-apiserver`) shows old flags
+- `kubectl get pod -n kube-system kube-apiserver-* -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'` returns a stale hash
+- Any of these actions failed to apply the changes:
+  - `touch /etc/kubernetes/manifests/kube-apiserver.yaml`
+  - `systemctl restart kubelet`
+  - `kubectl delete pod kube-apiserver-*`
+  - Killing the API server process directly
+
+## Root Cause
+Kubelet maintains an internal cache of static pod specs keyed by a hash of the manifest.
+When the manifest changes, kubelet should detect the new hash and recreate the pod.
+However, in practice (observed on Kubernetes 1.34.x), kubelet can get stuck with the
+old hash if:
+- The pod's mirror object in the API server still exists with the old hash
+- Kubelet's internal pod cache wasn't cleared between restarts
+- The container runtime (containerd) still has the old container running
+
+## Solution
+
+Full restart cycle on the master node:
+
+```bash
+# 1. Back up the manifest
+sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
+
+# 2. Remove the manifest (kubelet will stop the pod)
+sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml
+
+# 3. Stop kubelet
+sudo systemctl stop kubelet
+
+# 4. Wait for the API server container to stop
+sleep 5
+
+# 5. Force-remove any remaining API server containers
+sudo crictl rm -f $(sudo crictl ps -aq --name kube-apiserver 2>/dev/null) 2>/dev/null
+
+# 6. Re-add the manifest (with your changes)
+sudo cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
+
+# 7. Start kubelet
+sudo systemctl start kubelet
+
+# 8. Wait for API server to come up (30-60 seconds)
+sleep 45
+
+# 9. Verify new flags are active
+sudo cat /proc/$(pgrep -f 'kube-apiserver --' | head -1)/cmdline | tr '\0' '\n' | grep 'your-new-flag'
+```
+
+**Critical:** The order matters. Removing the manifest BEFORE stopping kubelet ensures
+kubelet processes the removal. Then clearing containers ensures no stale state. Finally,
+re-adding the manifest with kubelet running triggers a fresh pod creation.
+
+## What Does NOT Work
+
+| Approach | Why it fails |
+|----------|-------------|
+| `touch manifest.yaml` | Kubelet may not detect mtime-only changes |
+| `systemctl restart kubelet` | Kubelet reuses cached pod spec if hash matches |
+| `kubectl delete pod` | Deletes mirror pod but kubelet recreates from cached spec |
+| `kill <apiserver-pid>` | Container runtime restarts the same container with old args |
+| Moving manifest away and back without stopping kubelet | Kubelet may cache the old spec in memory |
+
+## Verification
+
+```bash
+# Check the running process has new flags
+ps aux | grep kube-apiserver | grep -v grep | grep 'your-new-flag'
+
+# Check the config hash changed
+kubectl get pod -n kube-system kube-apiserver-$(hostname) \
+  -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'
+
+# Check API server logs for successful startup
+kubectl logs -n kube-system kube-apiserver-$(hostname) | tail -5
+```
+
+## Notes
+- This applies to ALL static pods, not just kube-apiserver (etcd, controller-manager, scheduler)
+- The cluster will be briefly unavailable during the restart (30-60 seconds)
+- On single-master clusters, kubectl commands will fail during the restart — use `sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf` from the master
+- Always validate the YAML before removing the manifest: `python3 -c "import yaml; yaml.safe_load(open('/etc/kubernetes/manifests/kube-apiserver.yaml'))"`
+- See also: `authentik-oidc-kubernetes` skill for the full OIDC setup context
--- a/.claude/skills/archived/local-llm-gpu-selection/SKILL.md
+++ b/.claude/skills/archived/local-llm-gpu-selection/SKILL.md
@ -0,0 +1,143 @@
+---
+name: local-llm-gpu-selection
+description: |
+  Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and
+  comparing to Apple Silicon alternatives. Use when: (1) user asks about running
+  local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs,
+  (3) user wants to compare local models to Claude for coding, (4) user asks about
+  quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for
+  LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility,
+  multi-GPU considerations, and realistic quality comparisons to Claude models.
+author: Claude Code
+version: 1.0.0
+date: 2025-06-11
+---
+
+# Local LLM GPU Selection & Performance Guide
+
+## Problem
+Choosing the right hardware for local LLM inference requires understanding the
+relationship between VRAM capacity, memory bandwidth, GPU compatibility with
+server chassis, and realistic model quality expectations.
+
+## Context / Trigger Conditions
+- User asks about running quantized models locally (Ollama, llama.cpp)
+- User wants to know which GPU fits their server (Dell R730 or similar 2U)
+- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
+- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
+
+## Key Principle: Memory Bandwidth Is Everything
+
+LLM token generation is **memory-bandwidth bound**, not compute bound. The formula:
+```
+approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
+```
+This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs
+despite having less raw compute.
+
+## VRAM Requirements by Model Size
+
+| Model Size | Quant | VRAM Needed | Examples |
+|------------|-------|-------------|----------|
+| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
+| 7-8B | Q8_0 | ~8 GB | |
+| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
+| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
+| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
+| 32B | Q8_0 | ~34 GB | |
+| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
+| 70B | Q8_0 | ~70 GB | |
+
+Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
+
+## Dell R730 GPU Compatibility
+
+### Constraints
+- **2U chassis**: Full-height cards fit, but limited to dual-slot width
+- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration
+- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
+- **PSU**: Check wattage headroom (dual 750W or 1100W typical)
+
+### Compatible GPUs
+
+**No external power needed (<=75W):**
+- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
+- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
+- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
+- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
+
+**Requires power cable (>75W):**
+- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
+- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
+- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
+
+**Won't fit:**
+- RTX 3090/4090: Too thick (3-slot), too long
+- A100: Fits physically but very expensive
+- Any consumer RTX: Generally too large for 2U
+
+### Multi-GPU Considerations
+- Ollama splits model layers across GPUs automatically
+- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
+- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
+- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
+
+## Apple Silicon Comparison
+
+Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
+
+| Device | Memory | Bandwidth | Advantage |
+|--------|--------|-----------|-----------|
+| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
+| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
+| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
+
+A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for
+LLM inference due to zero cross-GPU overhead and high unified bandwidth.
+
+## Best Coding Models (for Ollama)
+
+For coding tasks specifically, prefer dedicated coding models:
+1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class
+2. **Codestral 22B** — Mistral's dedicated coding model
+3. **DeepSeek Coder V2** — good quality, efficient
+4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB
+
+## Realistic Quality Comparison to Claude
+
+For Claude Code-style agentic coding workflows:
+
+| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
+|-----------|-------------|-------|---------------------|-------------|
+| Single function gen | Excellent | Good | Good | Decent |
+| Multi-file refactoring | Excellent | Decent | Weak | Weak |
+| Tool use / agentic loops | Excellent | Good | Poor | Poor |
+| Long context (large codebases) | Excellent | Good | Weak | Weak |
+
+Local models work for simple completions and code questions. They struggle badly
+with Claude Code's complex multi-step tool-use workflows, long context windows,
+and self-correction capabilities.
+
+## Quantization Quality Guide
+
+From best to worst quality (and largest to smallest):
+- FP16: Full precision, baseline quality
+- Q8_0: Near-lossless, ~50% size reduction
+- Q6_K: Minimal quality loss
+- Q5_K_M: Good balance
+- Q4_K_M: **Recommended default** — best quality/size tradeoff
+- Q3_K_M: Noticeable degradation on complex reasoning
+- Q2_K: Significant quality loss, emergency only
+
+## Verification
+- Check GPU compatibility: `lspci | grep -i nvidia` on the host
+- Check available VRAM: `nvidia-smi` inside the GPU VM
+- Check model fit: Ollama shows VRAM usage during `ollama run`
+- Check inference speed: Count tokens/sec in Ollama output
+
+## Notes
+- GPU prices fluctuate significantly in the used market; check current prices
+- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
+- Power consumption matters for 24/7 homelab use (electricity cost)
+- For Claude Code specifically, API-based Claude models remain significantly
+  superior to any local model for agentic coding workflows
--- a/.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
+++ b/.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
@ -0,0 +1,143 @@
+---
+name: loki-helm-deployment-pitfalls
+description: |
+  Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
+  Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
+  or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
+  to be enabled", (3) Helm install fails with "cannot re-use a name that is still
+  in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
+  Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
+  Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-13
+---
+
+# Loki Helm Chart Deployment Pitfalls
+
+## Problem
+Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
+multiple non-obvious failures that aren't documented together.
+
+## Context / Trigger Conditions
+- Deploying Loki via `helm_release` in Terraform
+- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
+- First-time deployment or redeployment after failures
+
+## Pitfall 1: Read-Only Root Filesystem
+
+**Error:** `mkdir /loki/compactor: read-only file system`
+
+**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
+for security. The compactor `working_directory` and ruler `rule_path` default to
+paths under `/loki/` which is on the read-only root FS.
+
+**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
+volume there:
+```yaml
+compactor:
+  working_directory: /var/loki/compactor    # NOT /loki/compactor
+ruler:
+  rule_path: /var/loki/scratch              # NOT /loki/scratch
+```
+
+## Pitfall 2: Canary Required
+
+**Error:** `Helm test requires the Loki Canary to be enabled`
+
+**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
+to be true. You cannot disable it.
+
+**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
+`chunksCache`, and `resultsCache` to reduce resource usage:
+```yaml
+gateway:
+  enabled: false
+chunksCache:
+  enabled: false
+resultsCache:
+  enabled: false
+# Do NOT add: lokiCanary: enabled: false
+```
+
+## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
+
+**Error:** `cannot re-use a name that is still in use`
+
+**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
+sometimes leaves a stale release secret in Kubernetes. Terraform then can't
+create a new release with the same name.
+
+**Fix:** Delete the stale Helm secret:
+```bash
+kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
+```
+Also consider removing `atomic = true` for initial deployments and adding it
+back after the first successful install. Use a longer `timeout` (600s+) for
+first deploy since image pulls take time.
+
+## Pitfall 4: PV Stuck in Released State
+
+**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
+
+**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
+`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
+
+**Fix:** Clear the stale claimRef:
+```bash
+kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
+```
+The PV will transition from `Released` to `Available` and can be bound again.
+
+## Pitfall 5: "Entry Too Far Behind" Log Spam
+
+**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
+
+**Cause:** Alloy reads all historical log files from the Kubernetes API on first
+startup. Old entries are rejected by Loki's ingester because they're behind the
+newest entry for that stream.
+
+**Fix:** This is harmless and self-resolving — Alloy catches up to present time
+and errors stop. To clear immediately:
+```bash
+kubectl rollout restart ds -n monitoring alloy
+```
+After restart, Alloy tails from approximately "now" for each container.
+
+## Pitfall 6: Alertmanager Service Name
+
+**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
+
+**Cause:** The Prometheus Helm chart names the Alertmanager service
+`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
+silent alert delivery failures.
+
+**Fix:**
+```yaml
+ruler:
+  alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
+```
+Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
+
+## Verification
+```bash
+# Loki pod running
+kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
+
+# Loki receiving logs
+kubectl port-forward -n monitoring svc/loki 3100:3100 &
+curl -s 'http://localhost:3100/loki/api/v1/labels'
+# Should return JSON with namespace, pod, container labels
+
+# PV bound
+kubectl get pv loki
+# STATUS should be "Bound"
+```
+
+## Notes
+- Always check PV status before retrying a failed deploy
+- The Loki Helm chart creates many components by default (gateway, canary,
+  memcached caches) — disable what you don't need for single-binary mode
+- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
+  disk-friendly setups, but data is lost on pod crash
+- See also: `helm-release-force-rerender` for Helm values not updating resources
--- a/.claude/skills/archived/music-assistant-librespot-wrong-account/SKILL.md
+++ b/.claude/skills/archived/music-assistant-librespot-wrong-account/SKILL.md
@ -0,0 +1,148 @@
+---
+name: music-assistant-librespot-wrong-account
+description: |
+  Fix for Music Assistant Spotify playback failing with "librespot does not support free
+  accounts" even when the Spotify account has Premium. Use when: (1) Songs load for 1-2
+  seconds then auto-pause, (2) Music Assistant logs show "librespot does not support free
+  accounts" followed by FFmpeg "Invalid data found when processing input" exit code 183,
+  (3) Spotify provider shows "Successfully logged in" but streaming fails. Root cause is
+  stale librespot credential cache pointing to a different (free-tier) Spotify account.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-21
+---
+
+# Music Assistant Librespot Wrong Account / Stale Credentials
+
+## Problem
+Music Assistant (MASS) Spotify playback fails immediately — songs appear to load for 1-2
+seconds then auto-pause. Every track is marked "unplayable". The error log shows librespot
+rejecting the account as "free" despite the configured Spotify account having Premium.
+
+## Context / Trigger Conditions
+- Music Assistant addon on Home Assistant (tested with v2.7.8, addon `d5369777_music_assistant`)
+- Symptoms: Song starts loading, pauses after 1-2 seconds, skipped as "unplayable"
+- Log pattern (all three appear together on every play attempt):
+  ```
+  WARNING [music_assistant.spotify] [librespot] librespot does not support "free" accounts.
+  WARNING [music_assistant.audio.media_stream] Error opening input: Invalid data found when processing input
+  ERROR [music_assistant.streams] AudioError while streaming queue item ... FFMpeg exited with code 183
+  ```
+- OAuth login succeeds: `Successfully logged in to Spotify as <Name>`
+- But librespot streaming fails with the "free" account error
+
+## Root Cause
+Music Assistant uses **two separate auth mechanisms** for Spotify:
+1. **OAuth (PKCE flow)** — for browsing, search, metadata. Uses access tokens refreshed via
+   the Spotify Web API. This is what produces the "Successfully logged in" message.
+2. **Librespot** — for actual audio streaming. Uses cached credentials stored in
+   `/data/.cache/spotify--<id>/credentials.json` inside the addon container.
+
+The librespot credential cache can become stale or point to a **different Spotify account**
+(e.g., if another family member logged in, or credentials were cached from before a Premium
+upgrade). Librespot uses these cached credentials to connect to Spotify's internal API, which
+returns a `ProductInfo` XML packet containing the account `type`. If the cached account is
+"free", librespot calls `exit(1)`, killing the audio pipeline before FFmpeg receives any data.
+
+## How Librespot Determines Account Type
+Librespot reads the `type` field from Spotify's `ProductInfo` server packet
+(`librespot-org/librespot`, `core/src/session.rs`):
+```rust
+fn check_catalogue(attributes: &UserAttributes) {
+    if let Some(account_type) = attributes.get("type") {
+        if account_type != "premium" {
+            error!("librespot does not support {account_type:?} accounts.");
+            exit(1);
+        }
+    }
+}
+```
+The check is an exact string match against `"premium"`.
+
+## Solution
+
+### Step 1: Verify the Problem
+Check Music Assistant addon logs for the "free accounts" error:
+```bash
+# Via HA API (from a machine with the HA token)
+python3 -c "
+import os, json, requests
+url = os.environ.get('HOME_ASSISTANT_SOFIA_URL', '').rstrip('/')
+token = os.environ.get('HOME_ASSISTANT_SOFIA_TOKEN', '')
+headers = {'Authorization': f'Bearer {token}'}
+r = requests.get(f'{url}/api/hassio/addons/d5369777_music_assistant/logs', headers=headers)
+for line in r.text.split('\n'):
+    if 'free' in line.lower() or 'librespot' in line.lower():
+        print(line)
+"
+```
+
+### Step 2: Identify the Music Assistant Container
+From the SSH addon (ha-sofia: `ssh vbarzin@192.168.1.8`):
+```bash
+sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/json | \
+  python3 -c "import sys,json; [print(c['Names'][0], c['Id'][:12]) for c in json.load(sys.stdin) if 'music' in c['Names'][0].lower()]"
+```
+
+### Step 3: Check Cached Credentials
+Exec into the container to read the librespot cache:
+```bash
+# Create exec
+EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
+  "http://localhost/containers/<CONTAINER_ID>/exec" \
+  -H 'Content-Type: application/json' \
+  -d '{"Cmd":["cat","/data/.cache/spotify--5s3mSP8y/credentials.json"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
+
+# Run exec
+sudo curl -s --unix-socket /run/docker.sock \
+  "http://localhost/exec/$EXEC_ID/start" \
+  -H 'Content-Type: application/json' -d '{"Detach":false}'
+```
+Check the `username` field — if it doesn't match the expected Premium account, that's the problem.
+
+### Step 4: Clear the Cache
+```bash
+# Create exec to delete cache
+EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
+  "http://localhost/containers/<CONTAINER_ID>/exec" \
+  -H 'Content-Type: application/json' \
+  -d '{"Cmd":["rm","-rf","/data/.cache/spotify--5s3mSP8y"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
+
+# Run exec
+sudo curl -s --unix-socket /run/docker.sock \
+  "http://localhost/exec/$EXEC_ID/start" \
+  -H 'Content-Type: application/json' -d '{"Detach":false}'
+```
+
+### Step 5: Restart Music Assistant
+```bash
+sudo curl -s --unix-socket /run/docker.sock \
+  "http://localhost/containers/<CONTAINER_ID>/restart" -X POST
+```
+
+### Step 6: Verify
+After restart, check logs for:
+- `Successfully logged in to Spotify as <Name>` (OAuth OK)
+- No "free accounts" error when playing a track
+- Optionally re-check `/data/.cache/spotify--5s3mSP8y/credentials.json` to confirm the
+  `username` now matches the Premium account
+
+## Verification
+1. Play any Spotify track through Music Assistant
+2. The track should stream without pausing after 1-2 seconds
+3. Logs should show `Start Queue Flow stream` without subsequent `AudioError`
+
+## Notes
+- The cache directory name `spotify--5s3mSP8y` is an internal Music Assistant provider ID
+  and may differ across installations. Use `find /data -name credentials.json` to locate it.
+- The `username` field in the credentials cache is Spotify's internal user ID (numeric for
+  newer accounts, text for older ones), not necessarily the display name or email.
+- Spotify Family plan **owners** have account type `"premium"`. Family plan **members** also
+  report as `"premium"` when their membership is active.
+- If the problem recurs, it may indicate that Music Assistant's Spotify provider re-caches
+  the wrong credentials — check if multiple Spotify accounts are configured or if another
+  user logged in via the Music Assistant UI.
+- The SSH addon on HA OS needs `sudo` for Docker socket access (`/run/docker.sock` is owned
+  by `root:messagebus`).
+- The HA long-lived token typically does NOT have Supervisor API access (hassio endpoints
+  return 401), so addon management must go through the Docker socket from the SSH addon.
--- a/.claude/skills/archived/nextcloud-calendar/SKILL.md
+++ b/.claude/skills/archived/nextcloud-calendar/SKILL.md
@ -0,0 +1,128 @@
+---
+name: nextcloud-calendar
+description: |
+  Create, list, and query calendar events in Nextcloud via CalDAV. Use when:
+  (1) User asks to create a calendar event, (2) User asks what's on their calendar,
+  (3) User says "add to calendar" or "schedule", (4) User asks about upcoming events.
+  Always use Nextcloud calendar unless user specifies otherwise.
+author: Claude Code
+version: 1.0.0
+date: 2025-01-25
+---
+
+# Nextcloud Calendar Management
+
+## Problem
+Need to create, query, or manage calendar events in the user's Nextcloud calendar.
+
+## Context / Trigger Conditions
+- User asks to create/add a calendar event
+- User asks "what's on my calendar?" or similar
+- User mentions scheduling something
+- User says "remind me" with a date (create calendar event)
+- Default calendar is always Nextcloud unless otherwise specified
+
+## Prerequisites
+- Python 3 with `caldav` and `icalendar` packages available (installed via PYTHONPATH or system packages)
+- Environment variables `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` must be set
+
+## Solution
+
+### Script Location
+```
+.claude/calendar-query.py
+```
+
+### Execution Pattern (CRITICAL)
+Run the script directly with python3 (env vars are set in the environment):
+
+```bash
+python3 .claude/calendar-query.py [command] [options]
+```
+
+### Available Commands
+
+#### List Calendars
+```bash
+python .claude/calendar-query.py list
+```
+
+#### Query Events
+```bash
+# Today's events
+python .claude/calendar-query.py today
+
+# Tomorrow's events
+python .claude/calendar-query.py tomorrow
+
+# This week
+python .claude/calendar-query.py week
+
+# This month
+python .claude/calendar-query.py month
+
+# Custom date range
+python .claude/calendar-query.py events --days 14
+python .claude/calendar-query.py events --date 2026-04-10
+
+# From specific calendar
+python .claude/calendar-query.py today --calendar "Work"
+```
+
+#### Create Events
+```bash
+# All-day event (single day)
+python .claude/calendar-query.py create --title "Doctor appointment" --start "2026-03-15" --all-day
+
+# All-day event (multi-day) - end date is EXCLUSIVE
+# For April 10-13, use end date April 14
+python .claude/calendar-query.py create --title "Vacation" --start "2026-04-10" --end "2026-04-14" --all-day
+
+# Timed event
+python .claude/calendar-query.py create --title "Meeting" --start "2026-03-15 14:00" --end "2026-03-15 15:00"
+
+# With location and description
+python .claude/calendar-query.py create --title "Lunch" --start "tomorrow 12:00" --location "Cafe" --description "Team lunch"
+
+# Relative dates work
+python .claude/calendar-query.py create --title "Call" --start "today 16:00"
+python .claude/calendar-query.py create --title "Review" --start "tomorrow 10:00"
+```
+
+### Output Formats
+```bash
+# JSON output (for parsing)
+python .claude/calendar-query.py today --json
+
+# Text output (default, human-readable)
+python .claude/calendar-query.py week
+```
+
+## Complete Example
+
+To create an event "Team offsite" from March 20-22, 2026:
+
+```bash
+python3 .claude/calendar-query.py create --title "Team offsite" --start "2026-03-20" --end "2026-03-23" --all-day
+```
+
+## Important Notes
+
+1. **End dates are exclusive** for all-day events (CalDAV standard). To create an event spanning April 10-13, set end to April 14.
+
+2. **No delete/update commands** - The script currently only supports create and query. To modify events, user must do it manually in Nextcloud.
+
+4. **Default calendar** is "Personal" - use `--calendar` flag for others.
+
+## Verification
+- For queries: Output shows formatted event list
+- For creates: Output shows "Event created: [title]" with calendar name and start date
+- Exit code 0 = success, 1 = error (check output for details)
+
+## Common Errors
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD must be set` | Env vars not set | Ensure `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` are in the environment |
+| `Required packages not installed` | caldav/icalendar missing | Ensure PYTHONPATH includes the installed packages |
+| `Calendar 'X' not found` | Wrong calendar name | Run `list` command to see available calendars |
--- a/.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
+++ b/.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
@ -0,0 +1,132 @@
+---
+name: nfsv4-idmapd-uid-mapping
+description: |
+  Fix for all file UIDs showing as 65534 (nobody) inside Kubernetes containers when using
+  NFS volumes from TrueNAS/FreeBSD. Use when: (1) ls -lan inside a container shows all files
+  owned by 65534:65534 despite correct ownership on the NFS server, (2) PostgreSQL fails with
+  "data directory has wrong ownership", (3) chown inside containers returns "Invalid argument"
+  on NFS volumes, (4) services that check file ownership (PostgreSQL, MySQL) crash on startup,
+  (5) the same NFS mount shows correct UIDs on the host but 65534 inside containers,
+  (6) NFSv4.2 appears in container mount output even though host mounts use NFSv3.
+  Root cause: Kubernetes inline NFS volumes auto-negotiate NFSv4.2 (not NFSv3), and NFSv4
+  idmapd fails to map UIDs when domains don't match or users don't exist on the server.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-01
+---
+
+# NFSv4 idmapd UID Mapping — All Files Show as nobody (65534)
+
+## Problem
+All files on NFS volumes appear owned by UID 65534 (nobody:nogroup) inside Kubernetes
+containers, even though `ls -lan` on the NFS server shows the correct UIDs (e.g., 999, 472).
+This breaks any service that checks file ownership: PostgreSQL refuses to start ("data
+directory has wrong ownership"), MySQL's entrypoint `chown` fails with "Invalid argument",
+and any `chown` inside the container returns EINVAL.
+
+## Context / Trigger Conditions
+
+- TrueNAS CORE (FreeBSD) or TrueNAS SCALE as NFS server
+- NFSv4 enabled on the NFS server (`v4: true` in TrueNAS NFS config)
+- Kubernetes using inline NFS volumes (not PV/PVC with mount options)
+- **Key symptom**: `mount` inside the container shows `type nfs4 (vers=4.2,...)` even
+  though existing kubelet mounts on the host show `vers=3`
+- **Key symptom**: Same NFS path mounted directly on the host shows correct UIDs, but
+  inside any container shows 65534
+
+## Root Cause
+
+Kubernetes inline NFS volumes don't support `mountOptions`. When kubelet mounts NFS for a
+new pod, the Linux NFS client auto-negotiates the highest available version — NFSv4.2 if
+the server supports it.
+
+NFSv4 uses **idmapd** for UID translation: the server translates UID→username (e.g.,
+`999→postgres@domain`), sends the username string over the wire, and the client translates
+it back to a local UID. This fails when:
+
+1. **Domain mismatch**: Server domain (from hostname) differs from client domain
+   - TrueNAS: `viktorbarzin.me` (from `truenas.viktorbarzin.me`)
+   - K8s nodes: `viktorbarzin.lan` (from `k8s-node4.viktorbarzin.lan`)
+   - When domains don't match, ALL UIDs fall back to `nobody` (65534)
+
+2. **Unknown UIDs**: Even with matching domains, if the NFS server has no local user for
+   UID 999 (common for container UIDs), idmapd maps it to `nobody`
+
+**Why existing mounts work**: Older kubelet mounts (established before NFSv4 was enabled,
+or when the NFS client defaulted to v3) continue using NFSv3 with direct numeric UID
+passthrough. Only NEW mounts negotiate NFSv4.2.
+
+## Solution
+
+**Fix on TrueNAS (no NFS restart required):**
+
+```bash
+# 1. Enable NFSv3-style numeric UID passthrough for NFSv4
+midclt call nfs.update '{"v4_v3owner": true, "v4_domain": "viktorbarzin.lan"}'
+
+# 2. Restart nfsuserd with the correct domain (NOT nfsd — that would crash the cluster)
+killall nfsuserd
+nfsuserd -domain viktorbarzin.lan -force
+```
+
+**Clear caches on all K8s nodes:**
+
+```bash
+for node in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  ssh wizard@$node "sudo nfsidmap -c && sudo keyctl clear @u"
+done
+```
+
+**Key settings explained:**
+- `v4_v3owner = true`: Makes NFSv4 use numeric UID passthrough like NFSv3, completely
+  bypassing the username-based idmapd translation. **This is the critical fix.**
+- `v4_domain`: Should match the K8s nodes' DNS domain (check with `hostname -d` on a node)
+- `nfsuserd -domain <domain> -force`: FreeBSD daemon that handles NFSv4 user mapping.
+  The `-force` flag is required if it thinks it's already running.
+
+## Verification
+
+```bash
+# Run a test pod and check UIDs
+kubectl run nfs-test --rm -it --restart=Never --image=alpine \
+  --overrides='{"spec":{"containers":[{"name":"test","image":"alpine",
+  "command":["sh","-c","ls -lan /data | head -5"],
+  "volumeMounts":[{"name":"nfs","mountPath":"/data"}]}],
+  "volumes":[{"name":"nfs","nfs":{"server":"10.0.10.15","path":"/mnt/main/some-path"}}]}}'
+
+# Should show actual UIDs (e.g., 999, 472) instead of 65534
+```
+
+## Debugging Steps
+
+If you're not sure whether this is the issue:
+
+```bash
+# 1. Check mount type INSIDE a container (not on the host!)
+kubectl exec <pod> -- mount | grep nfs
+# If it shows "type nfs4" with "vers=4.2" — this is the issue
+
+# 2. Compare UIDs: host vs container
+# On host (via kubelet mount path):
+sudo ls -lan /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<vol>/
+# Inside container:
+kubectl exec <pod> -- ls -lan /mount-path/
+
+# 3. Check TrueNAS NFS config
+midclt call nfs.config  # Look for v4: true, v4_v3owner, v4_domain
+
+# 4. Check nfsuserd is running with the right domain
+ps aux | grep nfsuserd  # On TrueNAS
+```
+
+## Notes
+
+- **NEVER restart NFS (nfsd)** on TrueNAS — it causes mount failures across ALL pods
+  cluster-wide. Only restart `nfsuserd` (the ID mapping daemon).
+- Existing NFSv3 mounts continue working fine. The issue only affects NEW mounts.
+- The `v4_v3owner` setting is persistent across TrueNAS reboots (stored in middleware config).
+- The `nfsuserd` restart is NOT persistent — TrueNAS may restart it without the `-domain`
+  flag after a reboot. The `v4_domain` setting in the middleware config should handle this,
+  but verify after any TrueNAS restart.
+- On Linux NFS servers (not FreeBSD/TrueNAS), the equivalent fix is setting `Domain` in
+  `/etc/idmapd.conf` on both server and all clients.
--- a/.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
+++ b/.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
@ -0,0 +1,216 @@
+---
+name: openclaw-k8s-deployment
+description: |
+  Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when:
+  (1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet",
+  (2) exec fails with "requires a paired node (none available)",
+  (3) gateway shows "Config invalid" for exec.host or exec.security values,
+  (4) OpenClaw can't write files (EACCES on workspace or home),
+  (5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange),
+  (6) 502 Bad Gateway from Traefik after pod restart,
+  (7) setting up Telegram bot channel,
+  (8) configuring modelrelay sidecar for free model routing.
+  Covers all non-obvious deployment gotchas discovered through trial and error.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-01
+---
+
+# OpenClaw Kubernetes Deployment
+
+## Problem
+Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration
+requirements. The gateway process, Telegram integration, exec permissions, and
+file ownership all have specific constraints not documented together.
+
+## Context / Trigger Conditions
+- Deploying OpenClaw from `ghcr.io/openclaw/openclaw` container image
+- Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
+- Want Telegram bot integration, tool execution, and persistent state
+
+## Solution
+
+### 1. Gateway Configuration (openclaw.json)
+
+**Required fields that aren't obvious:**
+
+```json
+{
+  "gateway": {
+    "mode": "local",
+    "bind": "lan",
+    "controlUi": {
+      "dangerouslyDisableDeviceAuth": true,
+      "dangerouslyAllowHostHeaderOriginFallback": true
+    }
+  },
+  "wizard": {
+    "lastRunAt": "2026-03-01T00:00:00.000Z",
+    "lastRunVersion": "2026.2.26",
+    "lastRunCommand": "configure",
+    "lastRunMode": "local"
+  }
+}
+```
+
+- `gateway.mode = "local"` — **required** or gateway refuses to start
+- `dangerouslyAllowHostHeaderOriginFallback = true` — required in v2026.2.26+
+  for non-loopback Control UI (error: "non-loopback Control UI requires
+  gateway.controlUi.allowedOrigins")
+- `wizard` block — **required** for Telegram to start. Without it, gateway logs
+  "Telegram configured, not enabled yet" on every startup. The wizard block
+  signals that initial setup was completed.
+
+### 2. Exec Configuration
+
+Valid values for `tools.exec`:
+
+| Field | Valid Values | Notes |
+|-------|-------------|-------|
+| `host` | `sandbox`, `gateway`, `node` | NOT "local" — that's invalid |
+| `security` | `deny`, `allowlist`, `full` | NOT "off" — that's invalid |
+| `ask` | `"off"` | Disables confirmation prompts |
+
+- `host = "gateway"` — runs commands on the container host directly
+- `host = "node"` — requires a "paired node" companion app (doesn't work in containers)
+- `host = "sandbox"` — requires Docker-in-Docker
+- `security = "full"` — most permissive valid option
+
+### 3. Sandbox Mode
+
+```json
+{
+  "agents": {
+    "defaults": {
+      "sandbox": { "mode": "off" },
+      "workspace": "/workspace/infra"
+    }
+  }
+}
+```
+
+- `sandbox.mode = "off"` disables Docker sandboxing
+- `workspace` must be set explicitly — defaults to `~/.openclaw/workspace`
+
+### 4. File Permissions
+
+The init container runs as root but the main container runs as `node` (UID 1000).
+
+**Must chown in init container:**
+```sh
+chown -R 1000:1000 /workspace/infra
+chown -R 1000:1000 /openclaw-home
+chmod 700 /openclaw-home
+```
+
+**Must create directories:**
+```sh
+mkdir -p /openclaw-home/agents/main/sessions \
+         /openclaw-home/credentials \
+         /openclaw-home/canvas \
+         /openclaw-home/devices \
+         /openclaw-home/cron
+```
+
+Without these: `EACCES: permission denied` errors for AGENTS.md, canvas,
+cron/jobs.json, devices, and other runtime files.
+
+### 5. Startup Command
+
+```sh
+node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan
+```
+
+Run `doctor --fix` before the gateway to auto-enable Telegram and fix
+config issues. Without this, Telegram stays "not enabled yet".
+
+### 6. Resource Requirements
+
+- **CPU limit: 2 cores minimum** — the Node.js gateway startup is CPU-intensive.
+  With 150-300m CPU, startup takes 5+ minutes.
+- **Memory limit: 2Gi minimum** — the gateway OOM-kills at 1Gi during startup
+  (V8 heap exhaustion).
+- **Goldilocks VPA will override these** — see "VPA Override" section below.
+
+### 7. Readiness Probe
+
+```hcl
+readiness_probe {
+  tcp_socket { port = 18789 }
+  initial_delay_seconds = 30
+  period_seconds        = 10
+}
+```
+
+Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening
+and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik
+during startup without killing the container.
+
+### 8. Telegram Integration
+
+```json
+{
+  "channels": {
+    "telegram": {
+      "enabled": true,
+      "botToken": "...",
+      "dmPolicy": "allowlist",
+      "allowFrom": ["tg:USER_ID"],
+      "groupPolicy": "allowlist",
+      "streamMode": "partial"
+    }
+  }
+}
+```
+
+Telegram won't start without:
+1. The `wizard` block in config (signals setup was run)
+2. `doctor --fix` at startup (auto-enables the channel)
+3. Both `groupPolicy` and `streamMode` fields
+
+### 9. NFS Volume Strategy
+
+| Volume | Purpose | Type |
+|--------|---------|------|
+| `/home/node/.openclaw` | Persistent state (SOUL.md, sessions, memory, telegram) | NFS |
+| `/tools` | Cached binaries (kubectl, terraform, terragrunt, python libs) | NFS |
+| `/workspace` | Infra repo clone | NFS |
+| `/data` | General data | NFS |
+
+Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping
+binary downloads and pip installs on subsequent starts.
+
+### 10. ModelRelay Sidecar
+
+Deploy as a sidecar container for automatic free model routing:
+
+```hcl
+container {
+  name  = "modelrelay"
+  image = "node:22-alpine"
+  command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
+  env { name = "NVIDIA_API_KEY"; value = "..." }
+  env { name = "OPENROUTER_API_KEY"; value = "..." }
+}
+```
+
+Configure as provider: `baseUrl = "http://127.0.0.1:7352/v1"`, model `auto-fastest`.
+
+## Verification
+1. `kubectl logs -c openclaw` should show `[gateway] listening on ws://0.0.0.0:18789`
+2. No "Telegram configured, not enabled yet" message
+3. No `EACCES` permission errors
+4. `kubectl exec ... -- cat /proc/net/tcp` shows listening sockets
+5. Telegram bot responds to `/start`
+
+## Notes
+- ConfigMap changes require pod restart (init container copies config at start)
+- ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
+- Goldilocks VPA recreates itself from namespace labels — must delete VPA on
+  every pod recreation if namespace has `goldilocks.fairwinds.com/vpa-update-mode`
+- The `--allow-unconfigured` flag is needed for the gateway command
+- v2026.2.26 introduced breaking change requiring `dangerouslyAllowHostHeaderOriginFallback`
+
+## See also
+- `openclaw-custom-model-provider` — basic model provider configuration
+- `k8s-limitrange-oom-silent-kill` — LimitRange causing OOM (related but different)
--- a/.claude/skills/archived/pfsense-dnsmasq-interface-binding/SKILL.md
+++ b/.claude/skills/archived/pfsense-dnsmasq-interface-binding/SKILL.md
@ -0,0 +1,169 @@
+---
+name: pfsense-dnsmasq-interface-binding
+description: |
+  Restrict pfSense dnsmasq (DNS Forwarder) to specific interfaces to free port 53 on
+  other interfaces for port forwarding. Use when: (1) pfSense blocks port 53 NAT port
+  forward because dnsmasq is listening on *:53, (2) need to forward DNS from WAN to an
+  internal DNS server while preserving client source IPs, (3) dnsmasq shows *:53 in
+  sockstat despite --listen-address flags, (4) pfSense loses DNS resolution after
+  restricting dnsmasq interfaces, (5) NAT rdr rules for port 53 silently fail to
+  generate in /tmp/rules.debug.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-17
+---
+
+# pfSense dnsmasq Interface Binding for DNS Port Forwarding
+
+## Problem
+pfSense's dnsmasq (DNS Forwarder) binds to `*:53` by default. This prevents creating
+NAT port forward rules for port 53 — pfSense silently skips generating the pf `rdr`
+directive. You need to restrict dnsmasq to specific interfaces to free port 53 on other
+interfaces (e.g., WAN) for forwarding to an internal DNS server.
+
+## Context / Trigger Conditions
+- Attempting to create a NAT port forward for port 53 on the WAN interface
+- Port forward rule saves to config.xml but `pfctl -sn` shows no corresponding `rdr` rule
+- `sockstat -4 | grep ":53"` shows `dnsmasq` on `*:53`
+- Goal: Forward DNS queries from one network to an internal DNS server (e.g., Technitium)
+  while preserving client source IPs (no masquerading)
+
+## Solution
+
+### Step 1: Bind dnsmasq to specific interfaces
+
+Set the interface field in pfSense's dnsmasq config:
+
+```php
+ssh admin@10.0.20.1 'php -r '"'"'
+require_once("config.inc");
+require_once("service-utils.inc");
+global $config;
+$config = parse_config(true);
+$config["dnsmasq"]["interface"] = "lan,opt1";  // Only LAN and OPT1, NOT wan
+write_config("Bind dnsmasq to LAN and OPT1 only");
+'"'"''
+```
+
+This adds `--listen-address=<IP>` flags to dnsmasq but does NOT change socket binding.
+
+### Step 2: Add bind-dynamic (CRITICAL)
+
+Without `bind-dynamic`, dnsmasq still binds the socket to `*:53` even with
+`--listen-address` flags. The `--listen-address` only controls which queries get
+responses, not the actual socket binding.
+
+```php
+ssh admin@10.0.20.1 'php -r '"'"'
+require_once("config.inc");
+require_once("service-utils.inc");
+global $config;
+$config = parse_config(true);
+$existing = base64_decode($config["dnsmasq"]["custom_options"]);
+if (strpos($existing, "bind-dynamic") === false) {
+    $existing = "bind-dynamic\n" . $existing;
+    $config["dnsmasq"]["custom_options"] = base64_encode($existing);
+    write_config("Add bind-dynamic to restrict dnsmasq socket binding");
+}
+'"'"''
+```
+
+### Step 3: Add localhost listen address (CRITICAL)
+
+pfSense's own `resolv.conf` points to `127.0.0.1`. Without this, pfSense itself
+loses DNS resolution after the interface restriction.
+
+```php
+# Add to custom_options (base64-encoded in config):
+listen-address=127.0.0.1
+```
+
+### Step 4: Restart dnsmasq
+
+```php
+services_dnsmasq_configure();
+```
+
+### Step 5: Verify binding
+
+```bash
+sockstat -4 | grep ":53 "
+# Should show specific IPs, not *:53:
+# 127.0.0.1:53
+# 10.0.10.1:53  (lan)
+# 10.0.20.1:53  (opt1)
+# NOT 192.168.1.2:53 (wan)
+```
+
+### Step 6: Add the port forward rule
+
+**Critical format note**: The `source` field must use `array("any" => "")`, NOT
+`array("network" => "192.168.1.0/24")`. The CIDR source format silently fails to
+generate the pf `rdr` directive.
+
+```php
+ssh admin@10.0.20.1 'php -r '"'"'
+require_once("config.inc");
+require_once("filter.inc");
+require_once("shaper.inc");
+global $config;
+$config = parse_config(true);
+
+$rule = array(
+    "source" => array("any" => ""),           // MUST be "any", not CIDR
+    "destination" => array(
+        "network" => "wanip",
+        "port" => "53"
+    ),
+    "ipprotocol" => "inet",
+    "protocol" => "udp",
+    "target" => "10.0.20.204",                // Internal DNS server
+    "local-port" => "53",
+    "interface" => "wan",
+    "associated-rule-id" => "pass",
+    "descr" => "DNS to internal DNS (preserve client IP)",
+    "created" => array("time" => (string)time(), "username" => "admin"),
+    "updated" => array("time" => (string)time(), "username" => "admin")
+);
+array_unshift($config["nat"]["rule"], $rule);
+write_config("Add DNS port forward");
+filter_configure();
+'"'"''
+```
+
+### Step 7: Verify the redirect rule
+
+```bash
+pfctl -sn | grep "domain\|:53"
+# Should show: rdr pass on vtnet0 inet proto udp from any to 192.168.1.2 port = domain -> 10.0.20.204
+```
+
+## Verification
+
+1. pfSense own DNS: `nslookup google.com 127.0.0.1` (from pfSense shell)
+2. Internal DNS: `nslookup google.com 10.0.20.1` (from LAN/OPT1 clients)
+3. Port forward: `dig @192.168.1.2 example.com` (from WAN-side client)
+4. Client IP: Check DNS server logs — should show real client IP, not pfSense IP
+
+## Pitfalls
+
+| Pitfall | Symptom | Fix |
+|---------|---------|-----|
+| Missing `bind-dynamic` | sockstat shows `*:53`, port forward still blocked | Add `bind-dynamic` to custom_options |
+| Missing `listen-address=127.0.0.1` | pfSense loses all DNS resolution | Add to custom_options |
+| Source `"network" => "CIDR"` in NAT rule | Rule saves to config but no `rdr` in `pfctl -sn` | Use `"any" => ""` instead |
+| Using local `$config` variable | Config not persisted after PHP exit | Always use `global $config` |
+| Not calling `filter_configure()` | Rule in config.xml but not in pf | Call after `write_config()` |
+| Custom options not base64 | dnsmasq fails to start | pfSense stores custom_options as base64 |
+
+## Notes
+- `bind-dynamic` is preferred over `bind-interfaces` because it handles interfaces that
+  come up after dnsmasq starts (e.g., VPN tunnels)
+- The pf `rdr` rule is a redirect, not masquerade — source IP is preserved
+- dnsmasq custom_options in pfSense config.xml are base64-encoded
+- Check `/tmp/rules.debug` for the generated pf ruleset (before loading into pf)
+- Use `pfctl -sn` to see rules actually loaded in the running firewall
+
+## See also
+- `pfsense` — General pfSense management skill
+- `k8s-ndots-search-domain-nxdomain-flood` — Related DNS optimization
--- a/.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
+++ b/.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
@ -0,0 +1,105 @@
+---
+name: pfsense-nat-rule-creation
+description: |
+  Create NAT port forward rules on pfSense programmatically via PHP/SSH.
+  Use when: (1) adding port forwards for new K8s services, (2) NAT rules
+  added via PHP don't appear in pfctl output, (3) config_read_array() throws
+  "undefined function" error, (4) destination "wanip" not working in NAT rules,
+  (5) rules saved to config.xml but not loaded into pfctl. Covers the correct
+  PHP array structure, config API differences between pfSense versions, and
+  the required pfctl reload step.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-21
+---
+
+# pfSense NAT Rule Creation via PHP
+
+## Problem
+Creating NAT port forward rules on pfSense programmatically via SSH/PHP has
+multiple gotchas around the config API, rule structure, and rule loading.
+
+## Context / Trigger Conditions
+- Adding a port forward for a new Kubernetes service (e.g., TURN, game server)
+- Using `ssh admin@10.0.20.1` + PHP to automate pfSense config
+- NAT rules don't appear in `pfctl -sn` after `write_config()` + `filter_configure()`
+- `config_read_array()` throws "Call to undefined function"
+- Rules saved to config.xml but pfctl doesn't have them
+
+## Solution
+
+### Correct PHP for adding NAT rules
+
+```php
+<?php
+require_once("config.inc");
+require_once("filter.inc");
+global $config;  // NOT config_read_array() — that doesn't exist in pfSense 2.7.x
+
+$config["nat"]["rule"][] = array(
+    "interface"          => "wan",
+    "ipprotocol"         => "inet",          // Required! Must be "inet" for IPv4
+    "protocol"           => "tcp/udp",       // Or "udp" or "tcp"
+    "source"             => array("any" => ""),
+    "destination"        => array(
+        "network" => "wanip",               // Use "network" => "wanip", NOT "address" => "wanip"
+        "port"    => "3478"                  // Single port or "start:end" for range
+    ),
+    "target"             => "10.0.20.200",   // Internal destination IP
+    "local-port"         => "3478",          // Internal port (for ranges, just the start port)
+    "descr"              => "My port forward",
+    "associated-rule-id" => "pass"           // Auto-create firewall pass rule
+);
+
+write_config("Description for config history");
+filter_configure();
+```
+
+### Key gotchas
+
+1. **`config_read_array()` doesn't exist** in pfSense 2.7.x. Use `global $config` instead.
+
+2. **Destination format**: Use `"network" => "wanip"`, NOT `"address" => "wanip"` or `"address" => "192.168.1.2"`. The `"network"` key with `"wanip"` tells pfSense to resolve the WAN IP dynamically.
+
+3. **`ipprotocol` is required**: Must include `"ipprotocol" => "inet"` or rules won't generate in `/tmp/rules.debug`.
+
+4. **Port ranges**: Use `"port" => "49152:49252"` for ranges. The `"local-port"` should be just the start port — pfSense maps the range automatically.
+
+5. **Rules may not load immediately**: After `write_config()` + `filter_configure()`, rules appear in `/tmp/rules.debug` but may not be in pfctl until the next filter reload. Force with:
+   ```bash
+   pfctl -f /tmp/rules.debug
+   ```
+
+6. **SSH quoting**: The pfsense.py `php` command breaks on `\n` in strings. For multi-line PHP, write a `.php` file, `scp` it, and execute:
+   ```bash
+   scp script.php admin@10.0.20.1:/tmp/
+   ssh admin@10.0.20.1 "php /tmp/script.php"
+   ```
+
+### Execution via pfsense.py
+
+For simple single-line PHP (no newlines or backslashes):
+```bash
+python3 .claude/pfsense.py php 'require_once("config.inc"); ...; echo "Done";'
+```
+
+For complex scripts, use scp + ssh as above.
+
+## Verification
+
+```bash
+# Check rules in config
+ssh admin@10.0.20.1 "grep 'YOUR_PORT' /cf/conf/config.xml"
+
+# Check generated pf rules
+ssh admin@10.0.20.1 "grep 'YOUR_PORT' /tmp/rules.debug"
+
+# Check active pfctl rules
+python3 .claude/pfsense.py pfctl "-sn" | grep YOUR_PORT
+```
+
+## Notes
+- Existing working NAT rules on this pfSense use the same structure (check WireGuard port 51820 as reference)
+- The `associated-rule-id: pass` auto-creates a WAN firewall rule to allow the forwarded traffic
+- pfSense applies NAT rules across ALL interfaces when using the web UI, but PHP-created rules only apply to the specified interface
+- See also: `pfsense` skill for general pfSense management
--- a/.claude/skills/archived/proxmox-vm-disk-expansion-pitfalls/SKILL.md
+++ b/.claude/skills/archived/proxmox-vm-disk-expansion-pitfalls/SKILL.md
@ -0,0 +1,136 @@
+---
+name: proxmox-vm-disk-expansion-pitfalls
+description: |
+  Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
+  cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
+  with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
+  with "invalid option -- P", (3) kubectl drain times out with pods stuck
+  terminating, (4) filesystem shows old size after qm resize. Covers
+  cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
+  tuning, and recovery from partial failures.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-13
+---
+
+# Proxmox VM Disk Expansion Pitfalls
+
+## Problem
+
+Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
+Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
+incompatibilities, and Kubernetes drain timeouts.
+
+## Context / Trigger Conditions
+
+- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
+- Ubuntu 24.04 cloud-init images (the default k8s node template)
+- Kubernetes nodes with many pods or stateful workloads
+- Using `scripts/extend_vm_storage.sh` or similar automation
+
+## Issues and Solutions
+
+### 1. `growpart: command not found` on Ubuntu 24.04
+
+**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
+with "command not found". `resize2fs` then reports "Nothing to do!" because the
+partition table hasn't been updated.
+
+**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
+by default. The `growpart` tool (which updates the partition table to use new
+disk space) is in this package.
+
+**Fix**:
+```bash
+sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
+sudo growpart /dev/sda 1
+sudo resize2fs /dev/sda1
+```
+
+**Prevention**: Check for `growpart` before attempting partition expansion:
+```bash
+if ! command -v growpart &>/dev/null; then
+    sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
+fi
+```
+
+### 2. `grep -P` (PCRE) not available on macOS
+
+**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
+
+**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
+regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
+
+**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
+```bash
+# BAD (GNU grep only):
+CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
+
+# GOOD (portable):
+CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
+```
+
+**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
+vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
+regex or bash built-in `[[ =~ ]]` for pattern matching.
+
+### 3. `kubectl drain` timeout with stuck pods
+
+**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
+for multiple pods. Pods are evicted but don't terminate in time.
+
+**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
+OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
+pods are draining simultaneously.
+
+**Fix**: Use `--force` flag and a longer timeout, or retry:
+```bash
+# First attempt with standard timeout
+kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
+
+# If it fails, force with longer timeout (pods already evicting)
+kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
+```
+
+**Note**: After a failed drain, the node is already cordoned. A second drain
+attempt only needs to wait for already-evicting pods to finish.
+
+### 4. Recovery from partial failure
+
+If the script fails mid-way (after drain but before uncordon):
+
+```bash
+# Check VM status
+ssh root@192.168.1.127 "qm status <vmid>"
+
+# Start VM if stopped
+ssh root@192.168.1.127 "qm start <vmid>"
+
+# Uncordon node
+kubectl --kubeconfig $(pwd)/config uncordon <node-name>
+```
+
+## Verification
+
+After successful expansion:
+```bash
+# On the VM
+df -h /
+# Should show new size (128G disk → ~126G usable for ext4)
+
+# On the cluster
+kubectl get node <name>
+# Should show Ready status
+```
+
+## Notes
+
+- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
+  the script handling both paths
+- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
+  this is not an error
+- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
+- SSH host keys may change if VMs are recreated or network changes — use
+  `-o StrictHostKeyChecking=no` in automated scripts
+
+See also: `extend-vm-storage.md` (the operational skill for running the script)
--- a/.claude/skills/archived/python-filename-sanitization/SKILL.md
+++ b/.claude/skills/archived/python-filename-sanitization/SKILL.md
@ -0,0 +1,182 @@
+---
+name: python-filename-sanitization
+description: |
+  Secure filename sanitization pattern for Python web applications. Use when:
+  (1) Accepting user-provided filenames for file operations, (2) Building file
+  rename/upload functionality, (3) Preventing path traversal attacks (../../../etc/passwd),
+  (4) Preventing shell injection through filenames, (5) FastAPI/Flask file handling.
+  Provides regex-based whitelist approach with pathlib for safe file operations.
+author: Claude Code
+version: 1.0.0
+date: 2025-01-31
+---
+
+# Python Filename Sanitization
+
+## Problem
+User-provided filenames can contain malicious characters that enable path traversal
+attacks, shell injection, or filesystem corruption. Direct use of user input in
+file paths is a security vulnerability.
+
+## Context / Trigger Conditions
+- Building file upload, rename, or download functionality
+- User can specify filenames via API or form input
+- Files are stored on server filesystem
+- Need to prevent: `../`, shell metacharacters, null bytes, etc.
+
+## Solution
+
+### Complete Sanitization Function
+```python
+import re
+from pathlib import Path
+
+def sanitize_filename(filename: str, max_length: int = 200) -> str:
+    """
+    Sanitize a filename to prevent path traversal and shell injection.
+    Only allows alphanumeric characters, spaces, hyphens, underscores,
+    parentheses, and dots.
+    """
+    if not filename:
+        raise ValueError("Filename cannot be empty")
+
+    # Remove any path components (prevent path traversal)
+    filename = Path(filename).name
+
+    # Only allow safe characters: alphanumeric, space, hyphen, underscore, parentheses, dot
+    # This regex removes anything that isn't in the allowed set
+    safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
+
+    # Collapse multiple spaces/dots
+    safe_filename = re.sub(r'\s+', ' ', safe_filename)
+    safe_filename = re.sub(r'\.+', '.', safe_filename)
+
+    # Strip leading/trailing whitespace and dots
+    safe_filename = safe_filename.strip(' .')
+
+    # Limit length
+    if len(safe_filename) > max_length:
+        safe_filename = safe_filename[:max_length]
+
+    if not safe_filename:
+        raise ValueError("Filename contains no valid characters")
+
+    return safe_filename
+```
+
+### FastAPI Integration Example
+```python
+from fastapi import APIRouter, HTTPException
+from pydantic import BaseModel
+from pathlib import Path
+
+class RenameRequest(BaseModel):
+    new_name: str
+
+@router.patch("/files/{file_id}/rename")
+async def rename_file(file_id: str, request: RenameRequest):
+    """Rename a file with sanitized input."""
+    file_dir = Path("/data/files") / file_id
+
+    if not file_dir.exists():
+        raise HTTPException(status_code=404, detail="File not found")
+
+    # Find existing file
+    files = list(file_dir.glob("*"))
+    if not files:
+        raise HTTPException(status_code=404, detail="No file found")
+
+    current_file = files[0]
+    current_extension = current_file.suffix
+
+    # Sanitize the new name
+    try:
+        safe_name = sanitize_filename(request.new_name)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+
+    # Preserve original extension
+    if not safe_name.lower().endswith(current_extension.lower()):
+        safe_name = safe_name + current_extension
+
+    # Create new path (same directory, new filename)
+    new_file = file_dir / safe_name
+
+    # Check for conflicts
+    if new_file.exists() and new_file != current_file:
+        raise HTTPException(status_code=400, detail="A file with that name already exists")
+
+    # Rename using pathlib (no shell commands!)
+    current_file.rename(new_file)
+
+    return {"status": "renamed", "new_filename": safe_name}
+```
+
+## Key Security Principles
+
+### 1. Whitelist, Don't Blacklist
+```python
+# BAD: Trying to block dangerous characters
+filename = filename.replace('../', '').replace('\x00', '')
+
+# GOOD: Only allow known-safe characters
+safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
+```
+
+### 2. Use pathlib, Not Shell Commands
+```python
+# BAD: Shell command (vulnerable to injection)
+os.system(f'mv "{old_path}" "{new_path}"')
+
+# GOOD: Pure Python (no shell)
+old_path.rename(new_path)
+```
+
+### 3. Extract Basename First
+```python
+# BAD: User could submit "../../../etc/passwd"
+filename = user_input
+
+# GOOD: Extract just the filename part
+filename = Path(user_input).name
+```
+
+### 4. Validate After Sanitization
+```python
+# Ensure something remains after sanitization
+if not safe_filename:
+    raise ValueError("Filename contains no valid characters")
+```
+
+## Verification
+```python
+# Test cases that should be handled safely
+assert sanitize_filename("normal.txt") == "normal.txt"
+assert sanitize_filename("../../../etc/passwd") == "etcpasswd"
+assert sanitize_filename("file; rm -rf /") == "file rm -rf"
+assert sanitize_filename("  spaces  .txt") == "spaces.txt"
+assert sanitize_filename("$(whoami).txt") == "whoami.txt"
+
+# Test cases that should raise errors
+try:
+    sanitize_filename("")  # Should raise ValueError
+except ValueError:
+    pass
+
+try:
+    sanitize_filename("$#@!")  # Should raise ValueError (no valid chars)
+except ValueError:
+    pass
+```
+
+## Notes
+- This is intentionally restrictive; expand the regex if you need Unicode support
+- For Unicode filenames, consider `unicodedata.normalize('NFKD', ...)` first
+- Max length of 200 is conservative; filesystem limits vary (255 bytes typical)
+- Always preserve file extensions when renaming to avoid breaking file associations
+- Consider adding a UUID prefix for guaranteed uniqueness in upload scenarios
+
+## References
+- [OWASP Path Traversal](https://owasp.org/www-community/attacks/Path_Traversal)
+- [CWE-22: Path Traversal](https://cwe.mitre.org/data/definitions/22.html)
+- [Python pathlib documentation](https://docs.python.org/3/library/pathlib.html)
--- a/.claude/skills/archived/sops-age-secrets-migration/SKILL.md
+++ b/.claude/skills/archived/sops-age-secrets-migration/SKILL.md
@ -0,0 +1,116 @@
+---
+name: sops-age-secrets-migration
+description: |
+  Migrate from git-crypt to SOPS + age for multi-user secret management in a
+  Terraform/Terragrunt infrastructure repo. Use when: (1) need per-user secret
+  access control (git-crypt is all-or-nothing), (2) want operators to push PRs
+  without seeing secrets (CI decrypts), (3) migrating from a single encrypted
+  terraform.tfvars to structured secret management. Covers: JSON format (not YAML
+  — Terraform can't parse YAML tfvars), race condition avoidance with parallel
+  terragrunt applies, CI pipeline integration with Woodpecker, age key management,
+  and the complete migration sequence.
+author: Claude Code
+version: 1.0.0
+date: 2026-03-07
+---
+
+# SOPS + age Secrets Migration from git-crypt
+
+## Problem
+git-crypt encrypts entire files — anyone with the key decrypts everything. For multi-user
+setups where operators should push code without seeing secrets, you need per-value encryption
+with CI-only decryption.
+
+## Context / Trigger Conditions
+- Single `terraform.tfvars` encrypted with git-crypt containing 100+ secrets
+- Need to onboard operators who shouldn't see API keys, passwords, SSH keys
+- Want GitOps (secrets in git) but with access control
+- Terraform/Terragrunt stack-per-service architecture
+
+## Solution
+
+### 1. Use JSON, not YAML
+SOPS outputs the same format as input. `sops -d file.yaml` → YAML. `sops -d file.json` → JSON.
+Terraform natively supports `*.auto.tfvars.json` files. YAML is NOT valid HCL.
+
+```
+secrets.sops.json → sops -d → secrets.auto.tfvars.json → Terraform reads it
+```
+
+### 2. Split tfvars into config + secrets
+```
+config.tfvars          ← plaintext (hostnames, IPs, DNS records)
+secrets.sops.json      ← SOPS-encrypted (passwords, tokens, keys)
+```
+
+### 3. Global decrypt, not per-stack hooks
+**CRITICAL**: Do NOT use `before_hook`/`after_hook` for decryption. With `terragrunt run --all`,
+70+ stacks run hooks in parallel, all writing to the same output file — race condition.
+
+Instead, use a wrapper script that decrypts once:
+```bash
+#!/usr/bin/env bash
+# scripts/tg — decrypt then terragrunt
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+if [ ! -f "$REPO_ROOT/secrets.auto.tfvars.json" ] || \
+   [ "$REPO_ROOT/secrets.sops.json" -nt "$REPO_ROOT/secrets.auto.tfvars.json" ]; then
+  sops -d "$REPO_ROOT/secrets.sops.json" > "$REPO_ROOT/secrets.auto.tfvars.json"
+fi
+exec terragrunt "$@"
+```
+
+### 4. Terragrunt loads both (backward compatible)
+```hcl
+terraform {
+  extra_arguments "common_vars" {
+    commands = get_terraform_commands_that_need_vars()
+    required_var_files = ["${get_repo_root()}/config.tfvars"]
+    optional_var_files = [
+      "${get_repo_root()}/terraform.tfvars",        # legacy (git-crypt)
+      "${get_repo_root()}/secrets.auto.tfvars.json"  # new (SOPS)
+    ]
+  }
+  before_hook "check_secrets" {
+    commands = ["apply", "plan", "destroy"]
+    execute  = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
+  }
+}
+```
+
+### 5. Complex types work in JSON
+Maps, lists, nested objects, multiline strings (SSH keys as `\n`-escaped) all work:
+```json
+{
+  "simple_password": "abc123",
+  "mailserver_accounts": {"user@domain": "pass"},
+  "ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n"
+}
+```
+
+### 6. CI integration (Woodpecker)
+- Store age private key as CI secret (`SOPS_AGE_KEY`)
+- Write to temp file for `SOPS_AGE_KEY_FILE` (Woodpecker `from_secret` only does env vars)
+- `git add stacks/ state/ .woodpecker/` — NEVER `git add .`
+- Cleanup step with `status: [success, failure]`
+
+## Verification
+```bash
+# Encrypt
+sops -e -i secrets.sops.json
+
+# Decrypt and verify
+sops -d secrets.sops.json | jq .
+
+# Verify SSH keys
+sops -d secrets.sops.json | jq -r '.ssh_key' | ssh-keygen -l -f -
+
+# Test with terragrunt
+scripts/tg validate
+```
+
+## Notes
+- Keep git-crypt for binary files (TLS certs, deploy keys) — SOPS can't encrypt binary
+- `sensitive = true` on all secret variable declarations — prevents plan output leaks
+- Don't add `sensitive = true` to non-secret variables with "secret" in the name (e.g., `tls_secret_name`, `ingress_path`) — breaks `for_each` on lists
+- Age keys are one line — much simpler than GPG
+- `.sops.yaml` path_regex should be anchored: `^secrets\.sops\.json$`
--- a/.claude/skills/archived/terraform-state-identity-mismatch/SKILL.md
+++ b/.claude/skills/archived/terraform-state-identity-mismatch/SKILL.md
@ -0,0 +1,97 @@
+---
+name: terraform-state-identity-mismatch
+description: |
+  Fix Terraform "Unexpected Identity Change" errors during plan/apply. Use when:
+  (1) Terraform fails with "the Terraform Provider unexpectedly returned a different 
+  identity", (2) State refresh shows identity mismatch between stored and current values,
+  (3) Resource was created but terraform apply timed out, leaving state inconsistent.
+  Solution involves removing and reimporting the affected resource.
+author: Claude Code
+version: 1.0.0
+date: 2026-01-28
+---
+
+# Terraform State Identity Mismatch Fix
+
+## Problem
+Terraform fails during plan or apply with an "Unexpected Identity Change" error, 
+indicating the stored state identity doesn't match what the provider returns when 
+reading the resource.
+
+## Context / Trigger Conditions
+- Error message contains: "Unexpected Identity Change: During the read operation, 
+  the Terraform Provider unexpectedly returned a different identity"
+- Often occurs after a terraform apply times out mid-creation
+- Resource exists in the cluster/cloud but state is corrupted
+- Common with Kubernetes provider after deployment rollout timeouts
+
+## Solution
+
+### Step 1: Identify the affected resource
+The error message includes the resource address:
+```
+with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
+```
+
+### Step 2: Remove from state
+```bash
+terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
+```
+Note: Use single quotes around the address to handle brackets properly.
+
+### Step 3: Import the resource back
+```bash
+terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' <namespace>/<name>
+```
+For Kubernetes deployments, the import ID is `namespace/deployment-name`.
+
+### Step 4: Verify with plan
+```bash
+terraform plan -target=<module-path>
+```
+Should show minimal or no changes if import was successful.
+
+### Step 5: Apply to sync any drift
+```bash
+terraform apply -target=<module-path>
+```
+
+## Verification
+- `terraform plan` runs without identity errors
+- `terraform apply` completes successfully
+- Resource still exists and functions correctly
+
+## Example
+**Error:**
+```
+Error: Unexpected Identity Change
+
+Current Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.NullVal...})
+New Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.StringVal("apps/v1")...})
+
+with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
+```
+
+**Fix:**
+```bash
+terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
+# Output: Removed ... Successfully removed 1 resource instance(s).
+
+terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' resume/resume
+# Output: Import successful!
+
+terraform apply -target=module.kubernetes_cluster.module.resume -auto-approve
+# Output: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
+```
+
+## Notes
+- This is a provider bug, not user error - consider reporting to provider maintainers
+- The resource continues to work fine; only the terraform state is affected
+- Always verify the resource exists before importing (don't import non-existent resources)
+- For Kubernetes resources, import IDs are typically `namespace/name`
+- For AWS resources, import IDs vary by resource type (check provider docs)
+- Consider adding `-lock=false` if state locking causes issues during recovery
+
+## See Also
+- Terraform state management documentation
+- Kubernetes provider import documentation
--- a/.claude/skills/archived/traefik-helm-configuration/SKILL.md
+++ b/.claude/skills/archived/traefik-helm-configuration/SKILL.md
@ -0,0 +1,405 @@
+---
+name: traefik-helm-configuration
+description: |
+  Consolidated Traefik Helm chart configuration skill covering HTTP/3 (QUIC), UDP
+  cross-namespace routing, and plugin download failures. Use when:
+  (1) enabling HTTP/3 on Traefik or Alt-Svc header shows wrong port (e.g., 8443 instead of 443),
+  (2) HTTP/3 is configured in Helm values but not working end-to-end,
+  (3) Cloudflare-proxied domains need HTTP/3 enabled,
+  (4) custom UDP entrypoints don't appear in the LoadBalancer Service,
+  (5) IngressRouteUDP logs show "udp service is not in the parent resource namespace",
+  (6) DNS or other UDP traffic through Traefik times out despite correct IngressRouteUDP config,
+  (7) all Traefik routes suddenly return 404 after a restart or pod recreation,
+  (8) Traefik logs show "Plugins are disabled because an error has occurred",
+  (9) plugin download fails with "context deadline exceeded" for crowdsec-bouncer or rewrite-body.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-22
+---
+
+# Traefik Helm Chart Configuration
+
+Consolidated guide for three common Traefik Helm chart issues: HTTP/3 (QUIC) enablement,
+UDP cross-namespace routing, and plugin download failures causing global 404s.
+
+---
+
+## HTTP/3 (QUIC)
+
+### Problem
+
+You want to enable HTTP/3 (QUIC) on a Traefik ingress controller in Kubernetes so that
+clients can negotiate HTTP/3 connections via the `Alt-Svc` response header.
+
+### Context / When to Use
+
+- Enabling HTTP/3 for the first time on Traefik
+- Troubleshooting HTTP/3 not working despite configuration
+- Alt-Svc header shows internal container port (8443) instead of external port (443)
+- Need to enable HTTP/3 on both origin (Traefik) and CDN (Cloudflare)
+
+### Solution
+
+#### Step 1: Configure Traefik Helm Chart Values
+
+In the Traefik Helm release values, add `http3` configuration to the `websecure` entrypoint:
+
+```hcl
+# In modules/kubernetes/traefik/main.tf
+ports = {
+  websecure = {
+    port        = 8443
+    exposedPort = 443
+    protocol    = "TCP"
+    http = {
+      tls = {
+        enabled = true
+      }
+    }
+    # Enable HTTP/3 (QUIC)
+    http3 = {
+      enabled        = true
+      advertisedPort = 443  # CRITICAL: Must match the external port
+    }
+  }
+}
+```
+
+**Key gotcha: `advertisedPort = 443`**
+
+Without `advertisedPort`, Traefik advertises the *internal container port* (8443) in the
+`Alt-Svc` header:
+```
+Alt-Svc: h3=":8443"; ma=2592000
+```
+
+This is wrong because clients connect on external port 443, not 8443. The correct header is:
+```
+Alt-Svc: h3=":443"; ma=2592000
+```
+
+Setting `advertisedPort = 443` fixes this.
+
+#### Step 2: Ensure Helm Chart Fully Re-renders
+
+Changing `http3.enabled=true` in values alone may not cause the Helm chart to add the
+required UDP port to the Service and Deployment specs. The Traefik Helm chart templates
+need to re-render to include `websecure-http3: 443/UDP` in the Service.
+
+If the Service doesn't show a UDP port after applying:
+- See the companion skill `helm-release-force-rerender` for fixing this
+- The root cause is that `helm upgrade --reuse-values` (Terraform's default behavior)
+  may not trigger template re-rendering for structural changes like adding new ports
+
+After a successful apply, verify the Service has the UDP port:
+```bash
+kubectl get svc traefik -n traefik -o yaml | grep -A5 "443"
+```
+
+Expected output should include both:
+```yaml
+- name: websecure
+  port: 443
+  protocol: TCP
+  targetPort: websecure
+- name: websecure-http3
+  port: 443
+  protocol: UDP
+  targetPort: websecure-http3
+```
+
+#### Step 3: Enable HTTP/3 on Cloudflare (if using Cloudflare proxy)
+
+For Cloudflare-proxied domains, HTTP/3 must also be enabled at the Cloudflare zone level.
+
+**Cloudflare Provider v4** (current in this repo):
+```hcl
+resource "cloudflare_zone_settings_override" "http3" {
+  zone_id = var.cloudflare_zone_id
+
+  settings {
+    http3 = "on"  # String values: "on" or "off"
+  }
+}
+```
+
+**Note**: In Cloudflare provider v5, this uses `cloudflare_zone_setting` (singular) with
+different syntax. The v4 resource is `cloudflare_zone_settings_override` (plural + override).
+
+#### Step 4: Verify End-to-End
+
+##### Testing from macOS
+
+macOS system curl does NOT support HTTP/3. Install curl with HTTP/3:
+```bash
+brew install curl
+```
+
+Then use the Homebrew version explicitly:
+```bash
+# Test HTTP/3 negotiation (Alt-Svc header)
+/opt/homebrew/opt/curl/bin/curl -sI https://example.viktorbarzin.me 2>&1 | grep -i alt-svc
+# Expected: alt-svc: h3=":443"; ma=2592000
+
+# Test actual HTTP/3 connection
+/opt/homebrew/opt/curl/bin/curl --http3-only -sI https://example.viktorbarzin.me
+# Expected: HTTP/3 200
+```
+
+##### Testing from within the Cluster
+
+```bash
+# Use a curl image with HTTP/3 support (amd64 only)
+kubectl run curl-h3 --rm -it --image=ymuski/curl-http3 --restart=Never -- \
+  curl --http3-only -sI https://example.viktorbarzin.me
+
+# Note: ymuski/curl-http3 is amd64-only; it will fail on arm64 nodes
+```
+
+##### Checking Traefik Logs
+
+```bash
+kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100 | grep -i quic
+```
+
+### Verification Checklist
+
+1. Traefik Service shows UDP port 443 (`websecure-http3`)
+2. `Alt-Svc` response header shows `h3=":443"` (not `h3=":8443"`)
+3. `/opt/homebrew/opt/curl/bin/curl --http3-only` successfully connects
+4. Cloudflare zone has HTTP/3 enabled (for proxied domains)
+
+### Current Configuration (This Repo)
+
+- **Traefik config**: `modules/kubernetes/traefik/main.tf` (lines 89-92)
+- **Cloudflare HTTP/3**: `modules/kubernetes/cloudflared/cloudflare.tf` (line 153)
+- **MetalLB IP**: 10.0.20.202 (Traefik LoadBalancer service)
+
+### Notes
+
+- HTTP/3 uses QUIC over UDP. Firewalls must allow UDP 443 inbound.
+- Traefik automatically handles TLS for HTTP/3 using the same certs as HTTPS.
+- The `Alt-Svc` header is sent on HTTP/2 responses to tell clients HTTP/3 is available.
+  Clients then upgrade to HTTP/3 on subsequent requests.
+- For non-Cloudflare (direct DNS) domains, only the Traefik-side config is needed.
+- Cloudflare handles its own HTTP/3 negotiation with end users; the origin connection
+  between Cloudflare and Traefik uses HTTP/1.1 or HTTP/2 (not HTTP/3).
+
+---
+
+## UDP Cross-Namespace Routing
+
+### Problem
+
+Adding a custom UDP entrypoint (e.g., DNS on port 53) to Traefik v3 via Helm chart values
+doesn't work out of the box. Traffic times out even though the Traefik pod listens on the
+port internally. Two separate issues compound:
+
+1. The Helm chart defaults `expose` to `false` for custom entrypoints -- the port is never
+   added to the LoadBalancer Service
+2. `allowCrossNamespace` defaults to `false` -- IngressRouteUDP in namespace A can't
+   reference a Service in namespace B
+
+### Context / Trigger Conditions
+
+- Traefik Helm chart v39.0.0+ (Traefik v3.x)
+- Custom UDP entrypoint defined in `ports` values
+- `IngressRouteUDP` referencing a service in a different namespace
+- Symptoms:
+  - `kubectl get svc traefik` doesn't show your custom UDP port
+  - UDP traffic to the LoadBalancer IP times out
+  - Traefik logs show: `"udp service <namespace>/<service> is not in the parent resource namespace <traefik-namespace>"`
+  - `netstat -ulnp` inside Traefik pod confirms it IS listening on the port
+
+### Solution
+
+#### Fix 1: Expose the UDP port on the Service
+
+In the Helm values, add `expose = { default = true }` to the entrypoint:
+
+```hcl
+# Terraform HCL
+ports = {
+  dns-udp = {
+    port        = 5353
+    exposedPort = 53
+    protocol    = "UDP"
+    expose      = { default = true }  # <-- Required for custom entrypoints
+  }
+}
+```
+
+```yaml
+# Helm values YAML equivalent
+ports:
+  dns-udp:
+    port: 5353
+    exposedPort: 53
+    protocol: UDP
+    expose:
+      default: true
+```
+
+Note: The built-in `web` and `websecure` entrypoints have `expose.default = true` by
+default, but custom entrypoints do NOT.
+
+#### Fix 2: Enable cross-namespace CRD references
+
+In the Helm values, add `allowCrossNamespace = true` to the kubernetesCRD provider:
+
+```hcl
+# Terraform HCL
+providers = {
+  kubernetesCRD = {
+    enabled              = true
+    allowCrossNamespace  = true  # <-- Required for cross-namespace IngressRouteUDP
+  }
+}
+```
+
+```yaml
+# Helm values YAML
+providers:
+  kubernetesCRD:
+    enabled: true
+    allowCrossNamespace: true
+```
+
+This is required whenever an `IngressRouteUDP` (or `IngressRouteTCP`, `IngressRoute`)
+references a Kubernetes Service in a different namespace.
+
+### Verification
+
+```bash
+# 1. Verify the port appears in the Service
+kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'
+# Should include your custom entrypoint name (e.g., "dns-udp")
+
+# 2. Check Traefik logs for cross-namespace errors
+kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "not in the parent resource namespace"
+# Should return nothing after the fix
+
+# 3. Test the UDP service
+dig @<traefik-lb-ip> example.com
+```
+
+### Example
+
+DNS forwarding through Traefik to Technitium DNS:
+- IngressRouteUDP in `traefik` namespace routes `dns-udp` entrypoint to
+  `technitium-dns:53` in `technitium` namespace
+- Without Fix 1: port 53 never exposed on LoadBalancer -- traffic can't reach Traefik
+- Without Fix 2: Traefik rejects the route -- logs error every ~60 seconds
+- With both fixes: DNS queries to LoadBalancer IP:53 -> Traefik -> Technitium
+
+### Notes
+
+1. **Debugging order matters**: Fix 1 (expose) must come first. Without the port on the
+   Service, you can't even test if the routing works. Fix 2 (cross-namespace) errors only
+   appear in Traefik logs, not as user-visible failures.
+2. **`allowCrossNamespace` is a security consideration**: It allows any IngressRoute CRD
+   to reference services in any namespace. If this is too broad, consider using
+   `TraefikService` middleware or moving the IngressRouteUDP to the target namespace.
+3. **Rolling update**: Changing `allowCrossNamespace` triggers a Traefik pod restart
+   (new CLI args). Changing `expose` only updates the Service (no pod restart needed).
+4. **This applies to TCP too**: `IngressRouteTCP` with cross-namespace services needs the
+   same `allowCrossNamespace` setting.
+
+---
+
+## Plugin Download Failure (Global 404)
+
+### Problem
+
+After a node maintenance operation (containerd restart, node drain/uncordon, etc.),
+all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist
+and look correct, making this extremely confusing to debug.
+
+### Context / Trigger Conditions
+
+- ALL Traefik routes return 404 simultaneously (not just one service)
+- Traefik pods are Running and Ready
+- Ingress resources exist with correct annotations
+- Middlewares exist in the correct namespaces
+- TLS secrets exist
+- Traefik startup logs contain: `Plugins are disabled because an error has occurred`
+- Plugin download error: `unable to download plugin ... context deadline exceeded`
+- Happened after a node restart, containerd restart, or network disruption
+
+### Root Cause
+
+Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from
+`plugins.traefik.io` on **every pod startup**. If the download fails (network
+unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**.
+
+Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually
+every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the
+missing plugin middleware as a fatal routing error and returns 404 for every route
+that references it -- which is typically all of them.
+
+### Solution
+
+```bash
+# 1. Confirm the diagnosis - check Traefik startup logs
+kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20
+# Look for: "Plugins are disabled because an error has occurred"
+
+# 2. Verify outbound connectivity is restored
+kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \
+  -o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io
+
+# 3. Rollout restart to retry plugin download
+kubectl rollout restart deployment -n traefik traefik
+
+# 4. Verify plugins loaded
+kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins"
+# Should show: "Plugins loaded."
+
+# 5. Verify routes work
+curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k
+# Should return 200 instead of 404
+```
+
+### Verification
+
+- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`)
+- Routes return expected HTTP status codes (200, 302, etc.) instead of 404
+- `kubectl logs -n traefik <pod> | grep "does not exist"` shows no middleware errors
+
+### Why This Is Hard to Debug
+
+1. **Traefik pods show Running/Ready** -- health checks pass even without plugins
+2. **All Kubernetes resources look correct** -- Ingresses, Services, Middlewares all exist
+3. **The error is in startup logs only** -- not in per-request logs (requests just get 404)
+4. **The 404 is Traefik's default** -- same as "no route matched", not a backend error
+5. **The middleware error is logged once at startup** -- easy to miss in a stream of logs
+
+### Prevention
+
+- During planned maintenance (node drain, containerd restart), restart Traefik pods
+  AFTER network connectivity is confirmed restored
+- Consider pre-caching Traefik plugins in the container image or using an init container
+- Monitor for the `Plugins are disabled` log message in your alerting system
+
+### Notes
+
+- This affects ALL plugin-based middlewares, not just crowdsec
+- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected
+- Traefik v3.x downloads plugins on every startup; there is no persistent cache
+- If only some routes return 404, the problem is likely different (missing middleware
+  or TLS secret, not a plugin issue)
+
+---
+
+## References
+
+- [Traefik HTTP/3 Documentation](https://doc.traefik.io/traefik/routing/entrypoints/#http3)
+- [Traefik Helm Chart Values](https://github.com/traefik/traefik-helm-chart/blob/master/traefik/values.yaml)
+- [Cloudflare HTTP/3 Settings](https://developers.cloudflare.com/speed/optimization/protocol/http3/)
+- [Traefik Helm Chart Ports Configuration](https://github.com/traefik/traefik-helm-chart)
+- [Traefik v3 Providers Documentation](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
+
+## See Also
+
+- `traefik-rewrite-body-troubleshooting` -- Traefik rewrite-body plugin troubleshooting (compression, Accept header issues)
+- `helm-release-force-rerender` -- Force Helm chart re-render when structural changes don't take effect
--- a/.claude/skills/archived/traefik-rewrite-body-troubleshooting/SKILL.md
+++ b/.claude/skills/archived/traefik-rewrite-body-troubleshooting/SKILL.md
@ -0,0 +1,200 @@
+---
+name: traefik-rewrite-body-troubleshooting
+description: |
+  Troubleshooting guide for the Traefik rewrite-body plugin (packruler/rewrite-body).
+  Covers two failure modes: (1) Compression failure — plugin logs "flate: corrupt input
+  before offset 5" when backends send gzip-compressed responses, corrupting response
+  bodies and breaking WebSocket connections, authentication flows, and mobile app
+  connectivity. (2) Silent skip — plugin silently skips content injection (rybbit
+  analytics, trap links, or any HTML rewriting) when the request Accept header doesn't
+  contain "text/html" (e.g., curl's default Accept: */*), making it appear broken
+  despite correct configuration.
+author: Claude Code
+version: 1.0.0
+date: 2026-02-22
+---
+
+# Traefik Rewrite-Body Plugin Troubleshooting
+
+Two distinct failure modes for the `packruler/rewrite-body` Traefik plugin used for
+injecting analytics scripts (rybbit) and anti-AI trap links into HTML responses.
+
+---
+
+## Problem 1: Compression Failure
+
+### Symptoms
+- Traefik logs show: `Rewrite-Body | ERROR ... Error loading content: flate: corrupt input before offset 5`
+- Mobile apps (e.g., Home Assistant Companion) fail while browser works
+- HA Companion app shows repeated `GET /?external_auth=1` requests (auth loop)
+- WebSocket connections (`/api/websocket`) are very short-lived (seconds instead of minutes)
+- HTTP 499 errors on API calls (client disconnects due to corrupted responses)
+- Using `packruler/rewrite-body` plugin v1.2.0 with `monitoring.types = ["text/html"]`
+
+### Root Cause
+Despite the `monitoring.types = ["text/html"]` filter, the plugin attempts to decompress
+ALL responses before checking content type. When decompression fails on certain gzip
+encodings, it corrupts the response body, breaking:
+- WebSocket upgrade handshakes
+- Authentication flows (HA Companion app's `external_auth` callback)
+- Mobile app connectivity (while browser appears to work due to auto-reconnect)
+
+### Misleading Symptoms
+- HTTP/3 (QUIC) may appear to be the cause because HTTP/3 requests show 499 errors.
+  This is a red herring -- the rewrite-body plugin corruption affects all protocols.
+- WebSocket issues may look like a timeout or proxy configuration problem.
+- The `monitoring.types = ["text/html"]` config suggests the plugin should only touch
+  HTML, but it still processes all responses for decompression before filtering.
+
+### Solution
+
+#### Step 1: Create a strip-accept-encoding middleware
+Add a Traefik middleware that removes `Accept-Encoding` from requests, forcing
+backends to send uncompressed responses that the plugin can safely process:
+
+```hcl
+# In traefik/middleware.tf
+resource "kubernetes_manifest" "middleware_strip_accept_encoding" {
+  manifest = {
+    apiVersion = "traefik.io/v1alpha1"
+    kind       = "Middleware"
+    metadata = {
+      name      = "strip-accept-encoding"
+      namespace = kubernetes_namespace.traefik.metadata[0].name
+    }
+    spec = {
+      headers = {
+        customRequestHeaders = {
+          "Accept-Encoding" = ""
+        }
+      }
+    }
+  }
+  depends_on = [helm_release.traefik]
+}
+```
+
+#### Step 2: Add middleware to routes with rewrite-body
+In the ingress factory middleware chain, add `strip-accept-encoding` BEFORE the
+rewrite-body middleware:
+
+```hcl
+var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
+var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
+```
+
+The order matters: strip-accept-encoding must come first so the request reaches
+the backend without Accept-Encoding, and the uncompressed response then passes
+through the rewrite-body plugin.
+
+### Verification (Compression Fix)
+1. Check Traefik logs for absence of `flate: corrupt input` errors:
+   ```bash
+   kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=200 | grep -i "flate\|rewrite-body"
+   ```
+2. Verify the middleware chain includes strip-accept-encoding before rybbit:
+   ```bash
+   kubectl get ingress -n <namespace> <name> -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
+   ```
+3. Test mobile app connectivity (HA Companion, etc.)
+
+### Notes (Compression)
+- This affects ALL services using the rewrite-body plugin, not just HA
+- The fix is applied conditionally: `strip-accept-encoding` is only added to the
+  middleware chain when `rybbit_site_id` is set, so services without analytics
+  are unaffected
+- Both `ingress_factory` and `reverse_proxy/factory` modules need the fix
+- Traefik may still compress responses to clients via its own compression middleware;
+  the strip only affects the backend request
+- The plugin's `monitoring.types` filter works for deciding what to rewrite, but
+  decompression is attempted on all responses regardless
+
+---
+
+## Problem 2: Silent Skip (Accept Header Mismatch)
+
+### Symptoms
+- rewrite-body middleware is in the ingress middleware chain and shows status "enabled" in Traefik API
+- `curl https://example.com/` returns original HTML with no injected content
+- Browser shows injected content (rybbit script, trap links, etc.)
+- No errors in Traefik logs -- the plugin silently skips processing
+- `monitoring.types = ["text/html"]` is configured in the middleware spec
+- Middleware chain order is correct (strip-accept-encoding before rewrite-body)
+
+### Root Cause
+In the plugin source code, `SupportsProcessing()` checks the **request** `Accept`
+header (not the response `Content-Type`) against `monitoring.types`:
+
+```go
+func (r *Rewriter) SupportsProcessing(req *http.Request) bool {
+    accept := req.Header.Get("Accept")
+    for _, monitoringType := range r.monitoring.Types {
+        if strings.Contains(accept, monitoringType) {
+            return true
+        }
+    }
+    return false
+}
+```
+
+It uses `strings.Contains(accept, "text/html")`. The curl default `Accept: */*` does
+NOT contain the substring `text/html`, so the plugin returns false and skips all
+processing. Browser requests include `Accept: text/html,application/xhtml+xml,...`
+which does match.
+
+### Misleading Symptoms
+- Appears as if the middleware isn't working at all
+- May look like a middleware ordering issue or configuration error
+- `kubectl get middleware` shows the resource exists with correct spec
+- Traefik API (`/api/http/middlewares/`) shows the middleware as "enabled"
+- Checking the rewrite-body regex patterns seems pointless since nothing is being processed
+
+### Solution
+This is **working as designed** -- not a bug. The fix depends on context:
+
+#### For testing with curl
+Add the `Accept` header to simulate a browser:
+```bash
+curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/
+```
+
+#### For verifying injection is working
+```bash
+# Check for injected content (trap links, analytics, etc.)
+curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
+  | grep -oE 'href="https://poison[^"]*"'
+
+# Check for rybbit analytics
+curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
+  | grep -oE 'src="https://rybbit[^"]*"'
+```
+
+#### For programmatic clients that need injection
+If a non-browser client needs to receive injected content, ensure it sends
+`Accept: text/html` in its request headers.
+
+### Verification (Accept Header)
+```bash
+# Without Accept header -- no injection (expected)
+curl -s https://example.com/ | grep -c "rybbit"
+# Output: 0
+
+# With Accept header -- injection works
+curl -s -H "Accept: text/html" https://example.com/ | grep -c "rybbit"
+# Output: 1
+```
+
+### Notes (Accept Header)
+- This behavior is independent of the compression issue (Problem 1 above)
+- The check is on the **request** `Accept` header, not the **response** `Content-Type`
+- `Accept: */*` does NOT match -- `strings.Contains("*/*", "text/html")` is false
+- Real AI scrapers typically send browser-like Accept headers, so trap links will be
+  injected for them correctly
+- API calls (which typically send `Accept: application/json`) are correctly skipped
+
+---
+
+## See Also
+- `traefik-helm-configuration` -- Traefik Helm chart configuration and entrypoints
+- `ingress-factory-migration` -- Covers the ingress factory module that creates
+  rybbit analytics middlewares